In recent years, with the vigorous development of artificial intelligence and autonomous driving technology, the importance of scene perception technology is increasing. Unsupervised deep learning based methods have demonstrated a certain level of robustness and accuracy in some challenging scenes. By inferring depth from a single input image without any ground truth label, a lot of time and resources can be saved. However, unsupervised depth estimation has defects in robustness and accuracy under complex environment which could be improved by modifying network structure and incorporating other modal information. In this paper, we propose an unsupervised, monocular depth estimation network achieving high speed and accuracy, and a learning framework with our depth estimation network to improve depth performance by incorporating transformed images across different modalities. The depth estimator is an encoder-decoder network to generate the multi-scale dense depth map. The sub-pixel convolutional layer is adopted to obtain depth super-resolution by replacing the up-sample branches. The cross-modal depth estimation using near-infrared image and RGB image enhances the performance of depth estimation than pure RGB image. The training mode is to transfer both images to the same modality and then carry out super-resolved depth estimation for each stereo camera pair. Compared with the initial results of depth estimation using only RGB images, the experiment verifies that our depth estimation network with the cross-modal fusion system designed in this paper achieves better performance on public datasets and a multi-modal dataset collected by our stereo vision sensor.
Visually Impaired (VI) people around the world have difficulties in socializing and traveling due to the limitation of traditional assistive tools. In recent years, practical assistance systems for scene text detection and recognition allow VI people to obtain text information from surrounding scenes. However, real-world scene text features complex background, low resolution, variable fonts as well as irregular arrangement which make it difficult to achieve robust scene text detection and recognition. In this paper, a scene text recognition system to help VI people is proposed. Firstly, we propose a high-performance neural network to detect and track objects, which is applied to specific scenes to obtain Regions of Interest (ROI). In order to achieve real-time detection, a light-weight deep neural network has been built using depth-wise separable convolutions that enables the system to be integrated into mobile devices with limited computational resources. Secondly, we train the neural network using the textural features to improve the precision of text detection. Our algorithm suppresses the effects of spatial transformation (including translation, scaling, rotation as well as other geometric transformations) based on the spatial transformer networks. Open-source optical character recognition (OCR) is used to train scene texts individually to improve the accuracy of text recognition. The interactive system eventually transfers the number and distance information of inbound buses to visually impaired people. Finally, a comprehensive set of experiments on several benchmark datasets demonstrates that our algorithm has achieved an extraordinary trade-off between precision and resource usage.
In recent years, with development of computer vision and robotics, a wide variety of localization approaches have been proposed. However, it is still challenging to design a localization algorithm that performs well in both indoor and outdoor environment. In this paper, an algorithm that fuses camera, IMU, GPS, as well as digital compass is proposed to solve this problem. Our algorithm includes two phases: (1) the monocular RGB camera and IMU are fused together as a VIO that estimates the approximate orientation and position; (2) the absolute position and orientation measured by GPS and digital compass are merged with the position and orientation estimated in first phase to get a refined result in the world coordinate. A bag-of-word based algorithm is utilized to realize loop detection and relocalization. We also built a prototype and did two experiments to evaluate the effectiveness and robustness of the localization algorithm in both indoors and outdoors environment.
Feature matching is at the base of many computer vision algorithms such as SLAM, which is a technology widely used in the area from intelligent vehicles (IV) to assistance for the visually impaired (VI). This article presents an improved detector and a novel semantic-visual descriptor, coined SORB (Semantic ORB), combining binary semantic labels and traditional ORB descriptor. Compared to the original ORB feature, the new SORB performs better in uniformity of distribution and accuracy of matching. We demonstrate it through experiments on some open source datasets and several real-world images obtained by RealSense.