Recent advances in saliency detection have used deep learning to obtain high-level features to detect salient regions. These advances have demonstrated superior results over previous works that use handcrafted low-level features for saliency detection. We propose a convolutional neural network (CNN) model to learn high-level features for saliency detection. Compared to other methods, our method presents two merits. First, when performing features extraction, apart from the convolution and pooling step in our method, we add restricted Boltzmann machine into the CNN framework to obtain more accurate features in intermediate step. Second, in order to avoid manual annotation data, we add deep belief network classifier at the end of this model to classify salient and nonsalient regions. Quantitative and qualitative experiments on three benchmark datasets demonstrate that our method performs favorably against the state-of-the-art methods.
KEYWORDS: Visualization, Signal to noise ratio, Visual process modeling, Computer science, Target detection, Electronic imaging, Neurons, Human vision and color perception, Visual analytics, Statistical modeling
Visual attention to salient and relevant scene regions is crucial for an animal's survival in
the natural world. It is guided by a complex interplay of at least two factors: image-driven,
bottom-up salience [1] and knowledge-driven, top-down guidance [2, 3]. For instance, a ripe
red fruit among green leaves captures visual attention due to its bottom-up salience, while a
non-salient camou
aged predator is detected through top-down guidance to known predator
locations and features. Although both bottom-up and top-down factors are important for
guiding visual attention, most existing models and theories are either purely top-down [4]
or bottom-up [5, 6]. Here, we present a combined model of bottom-up and top-down visual
attention.
We are developing a distributed system for the tracking of people and objects in complex scenes and environments using biologically based algorithms. An important component of such a system is its ability to track targets from multiple cameras at multiple viewpoints. As such, our system must be able to extract and analyze the features of targets in a manner that is sufficiently invariant of viewpoints, so that they can share information about targets, for purposes such as tracking. Since biological organisms are able to describe targets to one another from very different visual perspectives, by discovering the mechanisms by which they understand objects, it is hoped such abilities can be imparted on a system of distributed agents with many camera viewpoints. Our current methodology draws from work on saliency and center surround competition among visual components that allows for real time location of targets without the need for prior information about the targets visual features. For instance, gestalt principles of color opponencies, continuity and motion form a basis to locate targets in a logical manner. From this, targets can be located and tracked relatively reliably for short periods. Features can then be extracted from salient targets allowing for a signature to be stored which describes the basic visual features of a target. This signature can then be used to share target information with other cameras, at other viewpoints, or may be used to create the prior information needed for other types of trackers. Here we discuss such a system, which, without the need for prior target feature information, extracts salient features from a scene, binds them and uses the bound features as a set for understanding trackable objects.
One of the important components of a multi sensor “intelligent” room, which can observe, track and react to its occupants, is a multi camera system. This system involves the development of algorithms that enable a set of cameras to communicate and cooperate with each other effectively so that they can monitor the events happening in the room. To achieve this, the cameras typically must first build a map of their relative locations. In this paper, we discuss a novel RF based technique for estimating distances between cameras. The algorithm proposed for RF can estimate distances with
relatively good accuracy even in the presence of random noise.
We discus a tool kit for usage in scene understanding where prior information about targets is not necessarily understood. As such, we give it a notion of connectivity such that it can classify features in an image for the purpose of tracking and identification. The tool VFAT (Visual Feature Analysis Tool) is designed to work in real time in an intelligent multi agent room. It is built around a modular design and includes several fast vision processes. The first components discussed are for feature selection using visual saliency and Monte Carlo selection. Then features that have been selected from an image are mixed into useful and more complex features. All the features are then reduced in dimension and contrasted using a combination of Independent Component Analysis and Principle Component Analysis (ICA/PCA). Once this has been done, we classify features using a custom non-parametric classifier (NPclassify) that does not require hard parameters such as class size or number of classes so that VFAT can create classes without stringent priors about class structure. These classes are then generalized using Gaussian regions which allows easier storage of class properties and computation of probability for class matching. To speed up to creation of Gaussian regions we use a system of rotations instead of the traditional Psuedo-inverse method. In addtion to discussing the structure of VFAT we discuss training of the current system which is relatively easy to perform. ICA/PCA is trained by giving VFAT a large number of random images. The ICA/PCA matrix is computed by features extracted by VFAT. The non-parametric classifier NPclasify it trained by presenting it with images of objects having it decide how many objects it thinks it sees. The difference between what it sees and what it is supposed to see in terms of the number of objects is used as the error term and allows VFAT to learn to classify based upon the experimenters subjective idea of good classification.
We apply a biologically-motivated algorithm that selects visually-salient regions of interest in video streams to multiply-foveated video compression. Regions of high encoding priority are selected based on nonlinear integration of low-level visual cues, mimicking processing in primate occipital and posterior parietal cortex. A dynamic foveation filter then blurs (foveates) every frame, increasingly with distance from high-priority regions. Two variants of the model (one with continuously-variable blur proportional to saliency at every pixel, and the other with blur proportional to distance from three independent foveation centers) are validated against eye fixations from 4-6 human observers on 50 video clips (synthetic stimuli, video games, outdoors day and night home video, television newscast, sports, talk-shows, etc). Significant overlap is found between human and algorithmic foveations on every clip with one variant, and on 48 out of 50 clips with the other. Substantial compressed file size reductions by a factor 0.5 on average are obtained for foveated compared to unfoveated clips. These results suggest a general-purpose usefulness of the algorithm in improving
compression ratios of unconstrained video.
We have developed a method for clustering features into objects by taking those features which include intensity,
orientations and colors from the most salient points in an image as determined by our biologically motivated
saliency program. We can train a program to cluster these features by only supplying as training input the number of
objects that should appear in an image. We do this by clustering from a technique that involves linking nodes in a
minimum spanning tree by not only distance, but by a density metric as well. We can then form classes over objects
or object segmentation in a novel validation set by training over a set of seven soft and hard parameters. We discus
as well the uses of such a flexible method in landmark based navigation since a robot using such a method may have
a better ability to generalize over the features and objects.
We describe a neurobiological model of visual attention and eye/head movements in primates, and its application to the automatic animation of a realistic virtual human head watching an unconstrained variety of visual inputs. The bottom-up (image-based) attention model is based on the known neurophysiology of visual processing along the occipito-parietal pathway of the primate brain, while the eye/head movement model is derived from recordings in freely behaving Rhesus monkeys. The system is successful at autonomously saccading towards and tracking salient targets in a variety of video clips, including synthetic stimuli, real outdoors scenes and gaming console outputs. The resulting virtual human eye/head animation yields realistic rendering of the simulation results, both suggesting applicability of this approach to avatar animation and reinforcing the plausibility of the neural model.
Utilizing off the shelf low cost parts, we have constructed a robot that is small, light, powerful and relatively inexpensive (< $3900). The system is constructed around the Beowulf concept of linking multiple discrete computing units into a single cooperative system. The goal of this project is to demonstrate a new robotics platform with sufficient computing resources to run biologically-inspired vision algorithms in real-time. This is accomplished by connecting two dual-CPU embedded PC motherboards using fast gigabit Ethernet. The motherboards contain integrated Firewire, USB and serial connections to handle camera, servomotor, GPS and other miscellaneous inputs/outputs. Computing systems are mounted on a servomechanism-controlled off-the-shelf “Off Road” RC car. Using the high performance characteristics of the car, the robot can attain relatively high speeds outdoors. The robot is used as a test platform for biologically-inspired as well as traditional robotic algorithms, in outdoor navigation and exploration activities. Leader following using multi blob tracking and segmentation, and navigation using statistical information and decision inference from image spectral information are discussed. The design of the robot is open-source and is constructed in a manner that enhances ease of replication. This is done to facilitate construction and development of mobile robots at research institutions where large financial resources may not be readily available as well as to put robots into the hands of hobbyists and help lead to the next stage in the evolution of robotics, a home hobby robot with potential real world applications.
In view of the growing complexity of computational tasks and their design, we propose that certain interactive systems may be better designed by utilizing computational strategies based on the study of the human brain. Compared with current engineering paradigms, brain theory offers the promise of improved self-organization and adaptation to the current environment, freeing the programmer from having to address those issues in a procedural manner when designing and implementing large-scale complex systems. To advance this hypothesis, we discus a multi-agent surveillance system where 12 agent CPUs each with its own camera, compete and cooperate to monitor a large room. To cope with the overload of image data streaming from 12 cameras, we take inspiration from the primate’s visual system, which allows the animal to operate a real-time selection of the few most conspicuous locations in visual input. This is accomplished by having each camera agent utilize the bottom-up, saliency-based visual attention algorithm of Itti and Koch (Vision Research 2000;40(10-12):1489-1506) to scan the scene for objects of interest. Real time operation is achieved using a distributed version that runs on a 16-CPU Beowulf cluster composed of the agent computers. The algorithm guides cameras to track and monitor salient objects based on maps of color, orientation, intensity, and motion. To spread camera view points or create cooperation in monitoring highly salient targets, camera agents bias each other by increasing or decreasing the weight of different feature vectors in other cameras, using mechanisms similar to excitation and suppression that have been documented in electrophysiology, psychophysics and imaging studies of low-level visual processing. In addition, if cameras need to compete for computing resources, allocation of computational time is weighed based upon the history of each camera. A camera agent that has a history of seeing more salient targets is more likely to obtain computational resources. The system demonstrates the viability of biologically inspired systems in a real time tracking. In future work we plan on implementing additional biological mechanisms for cooperative management of both the sensor and processing resources in this system that include top down biasing for target specificity as well as novelty and the activity of the tracked object in relation to sensitive features of the environment.
We describe a new mobile robotics platform specifically designed for the implementation and testing of neuromorphic vision algorithms in unconstrained outdoors environments. The new platform includes significant computational power (four 1.1GHz CPUs with gigabit interconnect), a high-speed four-wheel-drive chassis, standard Linux operating system, and a comprehensive toolkit of C++ vision classes. The robot is designed with two major goals in mind: real-time operation of sophisticated neuromorphic vision algorithms, and off-the-shelf components to ensure rapid technological evolvability. A preliminary embedded neuromorphic vision architecture that includes attentional, gist/layout, object recognition, and high-level decision subsystems is finally described.
When confronted with cluttered natural environments, animals still perform orders of magnitude better than artificial vision systems in visual tasks such as orienting, target detection, navigation and scene understanding. To better understand biological visual processing, we have developed a neuromorphic model of how our visual attention is attracted towards conspicuous locations in a visual scene. It replicates processing in the dorsal ('where') visual stream in the primate brain. The model includes a bottom-up (image-based) computation of low-level color, intensity, orientation and flicker features, as well as a nonlinear spatial competition that enhances salient locations in each feature channel. All feature channels feed into a unique scalar 'saliency map' which controls where to next focus attention onto. In this article, we discuss a parallel implementation of the model which runs at 30 frames/s on a 16-CPU Beowulf cluster, and the role of flicker (temporal derivatives) cues in computing salience. We show how our simple within-feature competition for salience effectively suppresses strong but spatially widespread motion transients resulting from egomotion. The model robustly detects salient targets in live outdoors video streams, despite large variations in illumination, clutter, and rapid egomotion. The success of this approach suggests that neuromorphic vision algorithms may prove unusually robust for outdoors vision applications.
We describe an integrated vision system which reliably detects persons in static color natural scenes, or other targets among distracting objects. The system is built upon the biologically-inspired synergy between two processing stages: A fast trainable visual attention front-end (where), which rapidly selects a restricted number of conspicuous image locations, and a computationally expensive object recognition back-end (what), which determines whether the selected locations are targets of interest. We experiment with two recognition back-ends: One uses a support vector machine algorithm and achieves highly reliable recognition of pedestrians in natural scenes, but is not particularly biologically plausible, while the other is directly inspired from the neurobiology of inferotemporal cortex, but is not yet as robust with natural images. Integrating the attention and recognition algorithms yields substantial speedup over exhaustive search, while preserving detection rate. The success of this approach demonstrates that using a biological attention-based strategy to guide an object recognition system may represent an efficient strategy for rapid scene analysis.
Bottom-up or saliency-based visual attention allows primates to detect non-specific conspicuous targets in cluttered scenes. A classical metaphor, derived from electrophysiological and psychophysical studies, describes attention as a rapidly shiftable 'spotlight'. The model described here reproduces the attentional scanpaths of this spotlight: Simple multi-scale 'feature maps' detect local spatial discontinuities in intensity, color, orientation or optical flow, and are combined into a unique 'master' or 'saliency' map. the saliency map is sequentially scanned, in order of decreasing saliency, by the focus of attention. We study the problem of combining feature maps, from different visual modalities and with unrelated dynamic ranges, into a unique saliency map. Four combination strategies are compared using three databases of natural color images: (1) Simple normalized summation, (2) linear combination with learned weights, (3) global non-linear normalization followed by summation, and (4) local non-linear competition between salient locations. Performance was measured as the number of false detections before the most salient target was found. Strategy (1) always yielded poorest performance and (2) best performance, with a 3- to 8-fold improvement in time to find a salient target. However, (2) yielded specialized systems with poor generations. Interestingly, strategy (4) and its simplified, computationally efficient approximation (3) yielded significantly better performance than (1), with up to 4-fold improvement, while preserving generality.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.