Visually detecting camouflaged objects is a hard problem for both humans and computer vision algorithms. Strong similarities between object and background appearance make the task significantly more challenging than traditional object detection or segmentation tasks. Current state-of-the-art models use either convolutional neural networks or vision transformers as feature extractors. They are trained in a fully supervised manner and thus need a large amount of labeled training data. In this paper, both self-supervised and frugal learning methods are introduced to the task of Camouflaged Object Detection (COD). The overall goal is to fine-tune two COD reference methods, namely SINet-V2 and HitNet, pre-trained for camouflaged animal detection to the task of camouflaged human detection. Therefore, we use the public dataset CPD1K that contains camouflaged humans in a forest environment. We create a strong baseline using supervised frugal transfer learning for the fine-tuning task. Then, we analyze three pseudo-labeling approaches to perform the fine-tuning task in a self-supervised manner. Our experiments show that we achieve similar performance by pure self-supervision compared to fully supervised frugal learning.
Cameras digitize real-world scenes as pixel intensity values with a limited value range given by the available bits per pixel (bpp). High Dynamic Range (HDR) cameras capture those luminance values in higher resolution through an increase in the number of bpp. Most displays, however, are limited to 8 bpp. Naïve HDR compression methods lead to a loss of the rich information contained in those HDR images. In this paper, tone mapping algorithms for thermal infrared images with 16 bpp are investigated that can preserve this information. An optimized multi-scale Retinex algorithm sets the baseline. This algorithm is then approximated with a deep learning approach based on the popular U-Net architecture. The remaining noise in the images after tone mapping is reduced implicitly by utilizing a self-supervised deep learning approach that can be jointly trained with the tone mapping approach in a multi-task learning scheme. Further discussions are provided on denoising and deflickering for thermal infrared video enhancement in the context of tone mapping. Extensive experiments on the public FLIR ADAS Dataset prove the effectiveness of our proposed method in comparison with the state-of-the-art.
Multispectral person detection aims at automatically localizing humans in images that consist of multiple spectral bands. Usually, the visual-optical (VIS) and the thermal infrared (IR) spectra are combined to achieve higher robustness for person detection especially in insufficiently illuminated scenes. This paper focuses on analyzing existing detection approaches for their generalization ability. Generalization is a key feature for machine learning based detection algorithms that are supposed to perform well across different datasets. Inspired by recent literature regarding person detection in the VIS spectrum, we perform a cross-validation study to empirically determine the most promising dataset to train a well-generalizing detector. Therefore, we pick one reference Deep Convolutional Neural Network (DCNN) architecture as well as three different multispectral datasets. The Region Proposal Network (RPN) that was originally introduced for object detection within the popular Faster R-CNN is chosen as a reference DCNN. The reason for this choice is that a stand-alone RPN is able to serve as a competitive detector for two-class problems such as person detection. Furthermore, all current state-of-the-art approaches initially apply an RPN followed by individual classifiers. The three considered datasets are the KAIST Multispectral Pedestrian Benchmark including recently published improved annotations for training and testing, the Tokyo Multi-spectral Semantic Segmentation dataset, and the OSU Color-Thermal dataset including just recently released annotations. The experimental results show that the KAIST Multispectral Pedestrian Benchmark with its improved annotations provides the best basis to train a DCNN with good generalization ability compared to the other two multispectral datasets. On average, this detection model achieves a log-average Miss Rate (MR) of 29.74% evaluated on the reasonable test subsets of the three analyzed datasets.
Image stacking is a well-known method that is used to improve the quality of images in video data. A set of consecutive images is aligned by applying image registration and warping. In the resulting image stack, each pixel has redundant information about its intensity value. This redundant information can be used to suppress image noise, resharpen blurry images, or even enhance the spatial image resolution as done in super-resolution. Small moving objects in the videos usually get blurred or distorted by image stacking and thus need to be handled explicitly. We use image stacking in an innovative way: image registration is applied to small moving objects only, and image warping blurs the stationary background that surrounds the moving objects. Our video data are coming from a small fixed-wing unmanned aerial vehicle (UAV) that acquires top-view gray-value images of urban scenes. Moving objects are mainly cars but also other vehicles such as motorcycles. The resulting images, after applying our proposed image stacking approach, are used to improve baseline algorithms for vehicle detection and segmentation. We improve precision and recall by up to 0.011, which corresponds to a reduction of the number of false positive and false negative detections by more than 3 per second. Furthermore, we show how our proposed image stacking approach can be implemented efficiently.
Person recognition is a key issue in visual surveillance. It is needed in many security applications such as intruder detection in military camps but also for gaining situational awareness in a variety of different safety applications. A solution for LWIR videos coming from a moving camera is presented that is based on hot spot classification to distinguish persons from background clutter and other objects. We especially consider objects in higher distance with small appearance in the image. Hot spots are detected and tracked along the videos. Various image features are extracted from the spots and different classifiers such as SVM or AdaBoost are evaluated and extended to utilize the temporal information. We demonstrate that taking advantage of this temporal context can improve the classification performance.
Spaceborne SAR imagery offers high capability for wide-ranging maritime surveillance especially in situations,
where AIS (Automatic Identification System) data is not available. Therefore, maritime objects have to
be detected and optional information such as size, orientation, or object/ship class is desired. In recent
research work, we proposed a SAR processing chain consisting of pre-processing, detection, segmentation, and
classification for single-polarimetric (HH) TerraSAR-X StripMap images to finally assign detection hypotheses
to class "clutter", "non-ship", "unstructured ship", or "ship structure 1" (bulk carrier appearance) respectively
"ship structure 2" (oil tanker appearance). In this work, we extend the existing processing chain and are now
able to handle full-polarimetric (HH, HV, VH, VV) TerraSAR-X data. With the possibility of better noise
suppression using the different polarizations, we slightly improve both the segmentation and the classification
process. In several experiments we demonstrate the potential benefit for segmentation and classification.
Precision of size and orientation estimation as well as correct classification rates are calculated individually
for single- and quad-polarization and compared to each other.
Small and medium sized UAVs like German LUNA have long endurance and define in combination with sophisticated
image exploitation algorithms a very cost efficient platform for surveillance. At Fraunhofer IOSB, we have
developed the video exploitation system ABUL with the target to meet the demands of small and medium sized
UAVs. Several image exploitation algorithms such as multi-resolution, super-resolution, image stabilization, geocoded
mosaiking and stereo-images/3D-models have been implemented and are used with several UAV-systems.
Among these algorithms is the moving target detection with compensation of sensor motion. Moving objects
are of major interest during surveillance missions, but due to movement of the sensor on the UAV and small
object size in the images, it is a challenging task to develop reliable detection algorithms under the constraint of
real-time demands on limited hardware resources. Based on compensation of sensor motion by fast and robust
estimation of geometric transformations between images, independent motion is detected relatively to the static
background. From independent motion cues, regions of interest (bounding-boxes) are generated and used as
initial object hypotheses. A novel classification module is introduced to perform an appearance-based analysis of
the hypotheses. Various texture features are extracted and evaluated automatically for achieving a good feature
selection to successfully classify vehicles and people.
It is expected, that ship detection and classification in SAR satellite imagery will be part of future downstream
services for various applications, e.g. surveillance of fishery zones or tracking of cargo ships. Due to the
requirements of operational services and due to the potential of high resolution SAR (e.g. TerraSAR-X), there
is a need for composing, optimization, and validation of specific fully automated image processing chains.
The presented processing chain covers all steps from land masking, screening, object segmentation, feature
extraction to classification and parameter estimation. The chain is base for experiments with both open sea
and harbor scenes for ship detection and monitoring. Within this chain, a classification component for SAR
ship and non-ship decision is investigated. Based on many extracted image features and numerous image chips
for training and test, some promissing results are presented and discussed. Since the classification can reduce
the false alarms of the screening component, the processing chain is expected to work on images with less
good weather and signal conditions and to extract ships with lower reflexions.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.