Amblyopia, also known as lazy eye, affects 2-3% of children. If ambylopia is not treated successfully during early childhood, it will persist into adulthood. One of the causes of amblyopia is strabismus, which is a misalignment of the eyes. In this paper, we have investigated several neural network architectures as universal feature extractors for two tasks: (1) classification of eye images to detect strabismus, and (2) detecting the need to be referred to a specialist. We have examined several state-of-the-art backbone architectures for feature extraction, as well as several classifier frameworks. Through these experiments, we observed that VGG19 and random forest classifier offer the overall best performance for both classification tasks. We also observed that when top-performing architectures are fused together, even with simple rules such as a median filter, overall performance improves.
In this work, we aim to address the needs of human analysts to automatically summarize the content of large swaths of overhead imagery. We present our approach to this problem using deep neural networks, providing detection and segmentation information to enable fine-grained description of scene content for human ingestion. Four different perception systems were run on blocks of large-scale satellite imagery: (1) semantic segmentation of roads, buildings, and vegetation; (2) zone segmentation to identify commercial, industrial, residential, and airport zones; (3) classification of objects such as helipads, silos, and water towers; and (4) object detection to find vehicles. Results are filtered based on a user's zoom level in the swath, and subsequently summarized as textual bullets and statistics. Our framework blocks the image swaths at a resolution of approximately 30cm for each perception system. For semantic segmentation, overlapping imagery is processed to avoid edge artifacts and improve segmentation results by voting for the category label of each pixel in the scene visible from multiple chips. Our approach to zone segmentation is based on classification models that vote for a chip belonging to a particular zone type. Regions surrounded by chips classified as a particular category are assigned a higher score. We also provide an overview of our experience using OpenStreetMap (OSM) for pixel-wise annotation (for semantic segmentation), image-level labels (for classification), and end-to-end captioning methods (image to text). These capabilities are envisioned to aid the human analyst through an interactive user interface, whereby scene content is automatically summarized and updated as the user pans and zooms within the imagery.
KEYWORDS: Video, Visualization, Image segmentation, Semantic video, Video surveillance, Data storage, RGB color model, Visual process modeling, Convolution, Video processing
In this work, we aim to address the needs of human analysts to consume and exploit data given the proliferation of overhead imaging sensors. We have investigated automatic captioning methods capable of describing and summarizing scenes and activities by providing textual descriptions using natural language for overhead full motion video (FMV). We have integrated methods to provide three types of outputs: (1) summaries of short video clips; (2) semantic maps, where each pixel is labeled with a semantic category; and (3) dense object description to capture object attributes and activities. We show results obtained from VIRAT and Aeroscapes publicly available datasets.
In this work, we address the problem of losing details in the overhead remote sensing image acquisition and generation process due to sensor resolution and distance to target by leveraging state-of-the-art deep neural network architectures. The goal is to recover such details by super-resolving the images acquired by overhead imaging sensors in order for human analysts to interpret data more accurately, and consequentially, for automated visual exploitation algorithms to be applied more effectively. We have developed a super-resolution framework operating on overhead full motion video (FMV) and still imagery (e.g. satellite images). Our framework consists of a neural network capable of learning the mapping between low and high resolution images in order to produce plausible details about the scene. Our framework combines Generative Adversarial Networks (GANs) and Recurrent Neural Networks (RNNs) to process low resolution signals both spatially and, in the case of FMV, temporally. We have applied the output of our system to several visual perception tasks, including object detection, object tracking, and semantic segmentation. We have also applied our methods to data from different geographical areas, sensors, and even modalities to demonstrate broad and generalized applicability.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.