In this paper we propose a new method to infer human social interactions using typical techniques adopted in literature for visual search and information retrieval. The main piece of information we use to discriminate among different types of interactions is provided by proxemics cues acquired by a tracker, and used to distinguish between intentional and casual interactions. The proxemics information has been acquired through the analysis of two different metrics: on the one hand we observe the current distance between subjects, and on the other hand we measure the O-space synergy between subjects. The obtained values are taken at every time step over a temporal sliding window, and processed in the Discrete Fourier Transform (DFT) domain. The features are eventually merged into an unique array, and clustered using the K-means algorithm. The clusters are reorganized using a second larger temporal window into a Bag Of Words framework, so as to build the feature vector that will feed the SVM classifier.
Face analysis in a real-world environment is a complex task as it should deal with challenging problems such as pose variations,
illumination changes and complex backgrounds. The use of active appearance models for facial features detection
is often successful in restricted environments, but the performance decreases when applied in unconstrained environments.
Therefore, in this paper, we introduce a novel method that integrates the knowledge of a face detector inside the shape and
the appearance models by using what we call a 'virtual structuring element' (VSE). In this way the possible settings of the
active appearance models are constrained in an appearance-driven manner. The use of a virtual structuring element in an
active appearance model provides increased performance in both accuracy and robustness over standard active appearance
models applied to different environments.
Recent advances in computing, communications and storage technology have made multimedia data become prevalent. Multimedia has gained enormous potential in improving the processes in a wide range of fields, such as advertising and marketing, education and training, entertainment, medicine, surveillance, wearable computing, biometrics, and remote sensing. Rich content of multimedia data, built through the synergies of the information contained in different modalities, calls for new and innovative methods for modeling, processing, mining, organizing, and indexing of this data for effective and efficient searching, retrieval, delivery, management and sharing of multimedia content, as required by the applications in the abovementioned fields. The objective of this paper is to present our views on the trends that should be followed when developing such methods, to elaborate on the related research challenges, and to introduce the new conference, Multimedia Content Analysis, Management and Retrieval, as a premium venue for presenting and discussing these methods with the scientific community. Starting from 2006, the conference will be held annually as a part of the IS&T/SPIE Electronic Imaging event.
In Content-based Image Retrieval the comparison of a query image and each of the database images is defined by a similarity distance obtained from the two feature vectors involved. These feature vectors can be seen as sets of noisy indexes. Unlike text matching (that is exact) image matching is only approximate, leading to ranking
methods. Only images at the top ranks (within the scope) are returned as retrieval results. Image retrieval performance characterization has mainly been based on measures available from probabilistic text retrieval in the form of Precision-Recall or Precision-Scope graphs. However, these graphs offer an incomplete overview of the image retrieval system under study. Essential information about how the success of the query is influenced by the size and type of irrelevant images is missing. Due to the inexactness of the visual matching process, the effect of the irrelevant embedding, represented in the additional performance measure generality, plays an important role.
In general, a performance graph will be three-dimensional, a Generality-Recall-Precision Graph. By choosing appropriate scope values a new two-dimensional performance graph, the Generality-Recall-Precision Graph, is proposed to replace the commonly used Precision-Recall Graph, as the better choice for total recall studies.
Recent technological advances have enabled human users to interact with computers in ways previously unimaginable. Beyond the confines of the keyboard and mouse, new modalities for human-computer interaction such as voice, gesture, and force-feedback are emerging. Despite important advances, one necessary ingredient for natural
interaction is still missing-emotions. Emotions play an important role in human-to-human communication and interaction, allowing people to express themselves beyond the verbal domain. The ability to understand human emotions is desirable for the computer in several applications. This paper explores new ways of human-computer
interaction that enable the computer to be more aware of the user's emotional and attentional expressions. We present the basic research in the field and the recent advances into the emotion recognition from facial, voice, and physiological signals, where the different modalities are treated independently. We then describe the challenging problem of multimodal emotion recognition and we advocate the use of probabilistic graphical models when fusing the different modalities. We also discuss the difficult issues of obtaining reliable affective data, obtaining ground truth for emotion recognition, and the use of unlabeled data.
In this paper, we present a general guideline to establish the relation of noise distribution model and its corresponding error metric. By designing error metrics, we obtain a much richer set of distance measures besides the conventional Euclidean distance or SSD (sum of the squared difference) and the Manhattan distance or SAD (sum of the absolute difference). The corresponding nonlinear estimations such as harmonic mean, geometric mean, as well as their generalized nonlinear operations are derived. It not only offers more flexibility than the conventional metrics but also discloses the coherent relation between the noise model and its corresponding error metric. We experiment with different error metrics for similarity noise estimation and compute the accuracy of different methods in three kinds of applications: content-based image retrieval from a large database, stereo matching, and motion tracking in video sequences. In all the experiments, robust results are obtained for noise estimation based on the proposed error metric analysis.
Content-based image retrieval has become one of the most active research areas in the past few years. Most of the attention from the research has been focused on indexing techniques based on global feature distributions. However, these global distributions have limited discriminating power because they are unable to capture local image information. Applying global Gabor texture features greatly improves the retrieval accuracy. But they are computationally complex. In this paper, we present a wavelet-based salient point extraction algorithm. We show that extracting the color and texture information in the locations given by these points provides significantly improved results in terms of retrieval accuracy, computational complexity and storage space of feature vectors as compared to the global feature approaches.
The growing capacity of computers, the abundance of digital cameras, and the increased connectivity of the world all point to large digital multimedia archives. They include images and videos from the World Wide Web, museum objects, flowers, trademarks, and views from everyday life. The faster these archives grow, the more prominent becomes the need for efficient access to the content of the images and videos.
In this short course, we will give a survey of the most recent developments on image and video search engines. First, the important step of feature extraction will be discussed in detail including color, shape, and texture information, with particular attention to discriminatory power and invariance. We will then focus on the concepts of indexing and genre classification as an intermediate step to sort the data. We will pay attention to interactive ways to perform browsing and retrieval by means of information visualization and relevance feedback. Methods will be discussed to localize the retrieved objects in their images.