This paper describes an efficient method and system for representing, processing and understanding multi-modal sensory
data. More specifically, it describes a computational method and system for how to process and remember multiple
locations in multimodal sensory space (e.g., visual, auditory, somatosensory, etc.). The multimodal representation and
memory is based on a biologically-inspired hierarchy of spatial representations implemented with novel analogues of
real representations used in the human brain. The novelty of the work is in the computationally efficient and robust
spatial representation of 3D locations in multimodal sensory space as well as an associated working memory for storage
and recall of these representations at the desired level for goal-oriented action. We describe (1) A simple and efficient
method for human-like hierarchical spatial representations of sensory data and how to associate, integrate and convert
between these representations (head-centered coordinate system, body-centered coordinate, etc.); (2) a robust method for
training and learning a mapping of points in multimodal sensory space (e.g., camera-visible object positions, location of
auditory sources, etc.) to the above hierarchical spatial representations; and (3) a specification and implementation of a
hierarchical spatial working memory based on the above for storage and recall at the desired level for goal-oriented
action(s). This work is most useful for any machine or human-machine application that requires processing of
multimodal sensory inputs, making sense of it from a spatial perspective (e.g., where is the sensory information coming
from with respect to the machine and its parts) and then taking some goal-oriented action based on this spatial
understanding. A multi-level spatial representation hierarchy means that heterogeneous sensory inputs (e.g., visual,
auditory, somatosensory, etc.) can map onto the hierarchy at different levels. When controlling various machine/robot
degrees of freedom, the desired movements and action can be computed from these different levels in the hierarchy. The
most basic embodiment of this machine could be a pan-tilt camera system, an array of microphones, a machine with
arm/hand like structure or/and a robot with some or all of the above capabilities. We describe the approach, system and
present preliminary results on a real-robotic platform.
|