The World Health Organization forecasts a population of 2,000 million people over 60 years by the year 2050, with 7% of this population suffering from dementia, a disease of public priority. Making a constant evaluation of older adults allows early detection of the disease and provides a better quality of life in the patient. In this sense, the research and development of innovative technological systems for the management of the growing number of patients with cognitive diseases has increased in recent years, integrating data collection and its automatic processing based on geriatric metrics into these systems using artificial intelligence (AI) methods, such that they can establish disease detection at an early stage and follow-up of it, in order to support the increase in patients expected in the coming years in the clinical area. This research presents an interactive web platform that allows users with internet connection from any mobile device, computer, laptop, or other devices, to remotely perform an automated assessment of the Montreal Cognitive Assessment (MoCA) test. This test detects and assesses cognitive deterioration. We use AI and neural network methods for binary and multiclass classification to obtain assessment scores according to geriatric metrics. Subsequently, this test is validated remotely by a mental health specialist. The tests carried out show a correct correspondence in the handling of the information and the results regarding the reference data for comparison. Our system provides an automated and easyto-use digital evaluation metric.
Semantic segmentation is a high-level task in computer vision that associates each pixel of an image with a semantic(class) label. Fine-semantic segmentation is a pixel-level task that provides detailed information necessary to easily identify the region of the object of interest. Hands are one of the main channels for communication, enhancing human-object and human-environment interaction, and in egocentric videos, they appear to be ubiquitous and at the center of vision and activities, hence our interest in hand segmentation. Fine-semantic segmentation of hands locates, identifies, and groups together pixels associated with the hands, with a hand semantic label. We performed fine semantic segmentation of hands, by improving the architecture of the state-of-the-art deep convolutional neural network (RefineNet). We achieve a finer and more accurate result by amending the process of obtaining and combining high and low-level features, and the pixel grouping for pixel-level classification. We performed this task on a public egocentric video dataset (EgoHands). We evaluate our model (RefineNet-Pix) performance by adopting the existing pixel-level metric, mean precision (mPrecision). Comparing our result with the baseline reported in Urooj’s work, we obtain accuracy higher than 87.9% of the benchmark. Our finer and more accurate semantic segmentation result guarantees good performance under various lighting conditions and complex backgrounds, making it suitable for use in both indoor and outdoor environments. Fine-hand semantic segmentation can be applied in image analysis, medical systems (with a focus on understanding hand motion for prediction, diagnosis, and monitoring), hand gesture recognition (human-computer interaction and understanding action), and robotics(grasp and manipulation of objects).
Good health and functional ability are important for individuals to lead fulfilling mental, psychological, and social lives. The diseases such as Dementia causes irreversible damage, decline in cognition, function, and behavior which translates into difficulty in independently performing daily tasks. Studies showed that assessment of Instrumental activities of daily living(IADLs) correlate with humans' cognitive and functional status. Analysis of biomechanical markers such as hand movement/use was done with artificial intelligence(AI). We present an optimized AI algorithm for hand detection in the analysis of egocentric video recordings. This improved AI algorithm is based on a probabilistic approach where hand regions are detected in egocentric videos. They then feed the human functional pattern recognition process. To evaluate the performance of our proposal we use a dataset containing the four functional patterns organized into four classes, based on the prehensile patterns of the hands: strength-precision, and on the kinematics of the instruments: displacementhandling. This work was inspired by a previous work done by our group, where biomechanical markers were analyzed throughout the performance of IADL activities to recognize the human functional pattern. The result of our proposal yielded an accuracy of 87.5% in recognizing strength-precision and displacement-handling movement patterns when evaluating the test database with information from Segmented and Not-Segmented videos. This resulted in a single video that changed its classification ratio between the two subsets. This can be of great potential in the development of technological tools for the creation of an automated model to support the diagnosis of early Alzheimer's disease.
KEYWORDS: Image processing, Human vision and color perception, Image retrieval, RGB color model, Feature extraction, Cones, Databases, Eye, Visualization, Content based image retrieval
Bridging the semantic gap between the low level visual features extracted by computers such as color, texture or shape and high level semantic concepts perceived by humans is the main challenge in the aim of increasing the precision of semantic results into Content-Based Image Retrieval (CBIR). This challenge has been approached with the technique known as Relevance Feedback (RF). The technique of RF can be applied through two methods, biased subspace learning or query movement. The method of query movement is based on Rocchio algorithm. In this paper, we present a new optimization to technique of Relevance Feedback through query movement to develop a CBIR system with better semantic precision. We make a modification to the input images color channels composition in the additive color space (Red, Green, Blue) and perceptual additive color space (Hue, Saturation, Value), through the images representation with human photopic vision behavior, which provides the semantic perception of the colors. With the proposed representation we obtained a more accurate behavior of the Color Histogram (CH), Color Coherence Vector (CCV) and Local Binary Patterns (LBP) descriptors in Rocchio algorithm, thus, a query movement oriented more to the semantics of the user. The optimization performance was measured with a subset of 137 classes with 100 images each one from Caltech256 object database. The results show a significant improvement in the semantic precision in comparison to the P. Mane RF method with prominent features, as well as the performance of CBIR systems without RF using the mentioned descriptors.
In areas such as computer vision, the content recognition of an image is a topic of interest in applications such as search engines, biometric security and autonomous cars, among others, since the computer must recognize all the objects that an image can have, which arises as the challenge of localizing and classifying different objects inside a single image in an efficient way. In recent years, this challenge has been approached with the use of region-based convolutional neuronal networks (R-CNN) which are systems that learn to recognize different objects by their representation in a series of images. The proposal of regions is essential for the performance of R-CNN when locating the individual objects of the image with accuracy and in the shortest time. In this article we propose a modification to a method for region proposal based on the density of SIFT like feature points that describe the objects within the image. The selection of regions is made through a decision based on the values of the cumulative distribution function of the normal distribution constructed using points density. The obtained results show a significant reduction in the processing time required for the localization of objects; having slight variations in the classification accuracy with respect to using methods such as KDRP and selective search.
Nowadays there is a trend towards the use of unimodal databases for multimedia content description, organization and retrieval applications of a single type of content like text, voice and images, instead bimodal databases allow to associate semantically two different types of content like audio-video, image-text, among others. The generation of a bimodal database of audio-video implies the creation of a connection between the multimedia content through the semantic relation that associates the actions of both types of information. This paper describes in detail the used characteristics and methodology for the creation of the bimodal database of violent content; the semantic relationship is stablished by the proposed concepts that describe the audiovisual information. The use of bimodal databases in applications related to the audiovisual content processing allows an increase in the semantic performance only and only if these applications process both type of content. This bimodal database counts with 580 audiovisual annotated segments, with a duration of 28 minutes, divided in 41 classes. Bimodal databases are a tool in the generation of applications for the semantic web.
The automatic identification and classification of musical genres based on the sound similarities to form musical textures, it is a very active investigation area. In this context it has been created recognition systems of musical genres, formed by time-frequency characteristics extraction methods and by classification methods. The selection of this methods are important for a good development in the recognition systems. In this article they are proposed the Mel-Frequency Cepstral Coefficients (MFCC) methods as a characteristic extractor and Support Vector Machines (SVM) as a classifier for our system. The stablished parameters of the MFCC method in the system by our time-frequency analysis, represents the gamma of Mexican culture musical genres in this article. For the precision of a classification system of musical genres it is necessary that the descriptors represent the correct spectrum of each gender; to achieve this we must realize a correct parametrization of the MFCC like the one we present in this article. With the system developed we get satisfactory detection results, where the least identification percentage of musical genres was 66.67% and the one with the most precision was 100%.
KEYWORDS: Convolutional neural networks, Buildings, Image classification, RGB color model, Visualization, Cultural heritage, Image processing, Data modeling, Convolution, Video
We propose a convolutional neural network to classify images of buildings using sparse features at the network’s input in conjunction with primary color pixel values. As a result, a trained neuronal model is obtained to classify Mexican buildings in three classes according to the architectural styles: prehispanic, colonial, and modern with an accuracy of 88.01%. The problem of poor information in a training dataset is faced due to the unequal availability of cultural material. We propose a data augmentation and oversampling method to solve this problem. The results are encouraging and allow for prefiltering of the content in the search tasks.
In the computer world, the consumption and generation of multimedia content are in constant growth due to the popularization of mobile devices and new communication technologies. Retrieve information from multimedia content to describe Mexican buildings is a challenging problem. Our objective is to determine patterns related to three building eras (Pre-Hispanic, colonial and modern). For this purpose, existing recognition systems need to process a plenty of videos and images. The automatic learning systems trains the recognition capability with a semantic-annotated database. We built the database taking into account high-level feature concepts, user knowledge and experience. The annotations helps correlating context and content to understand the data on multimedia files. Without a method, the user needs a super mind to remember all and registry this data manually. This article presents a methodology for a quick images annotation using a graphical interface and intuitive controls. Emphasizing in the most two important features: time-consuming during annotations task and the quality of selected images. Though, we only classify images by its era and its quality. Finally, we obtain a dataset of Mexican buildings preserving the contextual information with semantic-annotations for training and test of buildings recognition systems. Therefore, research on content low-level descriptors is other possible use for this dataset.
Current search engines are based upon search methods that involve the combination of words (text-based search); which
has been efficient until now. However, the Internet’s growing demand indicates that there’s more diversity on it with each
passing day. Text-based searches are becoming limited, as most of the information on the Internet can be found in different
types of content denominated multimedia content (images, audio files, video files).
Indeed, what needs to be improved in current search engines is: search content, and precision; as well as an accurate display
of expected search results by the user. Any search can be more precise if it uses more text parameters, but it doesn’t help
improve the content or speed of the search itself. One solution is to improve them through the characterization of the
content for the search in multimedia files. In this article, an analysis of the new generation multimedia search engines is
presented, focusing the needs according to new technologies.
Multimedia content has become a central part of the flow of information in our daily life. This reflects the necessity of
having multimedia search engines, as well as knowing the real tasks that it must comply. Through this analysis, it is shown
that there are not many search engines that can perform content searches. The area of research of multimedia search engines
of new generation is a multidisciplinary area that’s in constant growth, generating tools that satisfy the different needs of
new generation systems.
Biometrics refers to identify people through their physical characteristics or behavior such as fingerprints, face, DNA,
hand geometries, retina and iris patterns. Typically, the iris pattern is to acquire in short distance to recognize a person,
however, in the past few years is a challenge identify a person by its iris pattern at certain distance in non-cooperative
environments. This challenge comprises: 1) high quality iris image, 2) light variation, 3) blur reduction, 4) specular
reflections reduction, 5) the distance from the acquisition system to the user, and 6) standardize the iris size and the density
pixel of iris texture. The solution of the challenge will add robustness and enhance the iris recognition rates. For this
reason, we describe the technical issues that must be considered during iris acquisition. Some of these considerations are
the camera sensor, lens, the math analysis of depth of field (DOF) and field of view (FOV) for iris recognition. Finally,
based on this issues we present experiment that show the result of captures obtained with our camera at distance and
captures obtained with cameras in very short distance.
The video annotation is important for web indexing and browsing systems. Indeed, in order to evaluate the performance of video query and mining techniques, databases with concept annotations are required. Therefore, it is necessary generate a database with a semantic indexing that represents the digital content of the Mexican bullfighting atmosphere. This paper proposes a scheme to make complex annotations in a video in the frame of multimedia search engine project. Each video is partitioned using our segmentation algorithm that creates shots of different length and different number of frames. In order to make complex annotations about the video, we use ELAN software. The annotations are done in two steps: First, we take note about the whole content in each shot. Second, we describe the actions as parameters of the camera like direction, position and deepness. As a consequence, we obtain a more complete descriptor of every action. In both cases we use the concepts of the TRECVid 2014 dataset. We also propose new concepts. This methodology allows to generate a database with the necessary information to create descriptors and algorithms capable to detect actions to automatically index and classify new bullfighting multimedia content.
Multimedia content production and storage in repositories are now an increasingly widespread practice. Indexing concepts for search in multimedia libraries are very useful for users of the repositories. However the search tools of content-based retrieval and automatic video tagging, still do not have great consistency. Regardless of how these systems are implemented, it is of vital importance to possess lots of videos that have concepts tagged with ground truth (training and testing sets). This paper describes a novel methodology to make complex annotations on video resources through ELAN software. The concepts are annotated and related to Mexican nature in a High Level Features (HLF) from development set of TRECVID 2014 in a collaborative environment. Based on this set, each nature concept observed is tagged on each video shot using concepts of the TRECVid 2014 dataset. We also propose new concepts, -like tropical settings, urban scenes, actions, events, weather, places for name a few. We also propose specific concepts that best describe video content of Mexican culture. We have been careful to get the database tagged with concepts of nature and ground truth. It is evident that a collaborative environment is more suitable for annotation of concepts related to ground truth and nature. As a result a Mexican nature database was built. It also is the basis for testing and training sets to automatically classify new multimedia content of Mexican nature.
We present a new method for the estimation of non-planar rotations, i.e. rotations around axis parallel to the image plane, in the context of video compression applications. This method is based on a non planar rotation model which assumes that the moving objects have a planar surface. The proposed block-based motion estimation approach is performed between consecutive or non-consecutive images, which may be contained large displacements, and aims at minimizing the motion compensation error. The efficiency of the method has been compared to the results obtained with the classical full search block matching approach. Experimental results have been done on real video sequences. These results show a significant gain in term of PSNR for the motion compensated P or B frames, compared to the classical full search block matching approach, while the coding cost of the additional motion information is very low, which demonstrates the interest of the proposed rotation model in the context of motion compensation for video compression applications.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.