PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.
This PDF file contains the front matter associated with SPIE Proceedings Volume 13539, including the Title Page, Copyright information, Table of Contents, and Conference Committee information.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Underwater object detection is an important computer vision task that has been widely used in marine life identification and tracking. However, problems such as low contrast conditions, occlusion condition, unbalanced light condition and small dense objects bring a series of challenges to underwater object detection. Considering these challenges, several methods have been proposed to extract features more efficiently. Attention mechanism has been proven powerful in feature extraction. However, the attention mechanism ignores the internal structure of the captured object, and conventional regular patch division is too coarse. Thus, we apply graph attention mechanisms to irregular patches and propose an Irregular-patch Graph Attention Network (IPGA). Firstly, the superpixel segmentation method is used to segment the image to reduce noise. Secondly, the global graph and local graph are constructed using clustering methods to obtain internal structures. Finally, to handle occlusion and small objects, a distinctive Feature Interaction (FIA) module is proposed to fuse information from global and local graph. To demonstrate the effectiveness of the proposed method, we conduct comprehensive evaluations on four challenging underwater datasets DUO, Brackish, TrashCan and WPBB. Experimental results demonstrate that the proposed IPGA achieves superior performance on three challenging underwater datasets.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
With the development of the rapid development of satellite technology, multimodal object detection between optical and SAR (Synthetic Aperture Radar) images has attracted growing interest in the field of remote sensing image interpretation. However, this area lacks a dataset that considers novel and diverse challenges related to practical applications, crucial for both the training and evaluation of recent detectors. In this paper, we introduce a new challenging multi-modal dataset for optical-SAR object detection, OGSOD-2.0, aiming to enhance object detection in tiny-scale and crowded objects under complex backgrounds. Building on OGSOD-1.0, the proposed dataset supplements more 5,130 optical-SAR image pairs from the Sentinel satellite series with 24,421 instance annotations, containing four significant types of static objects: bridges, harbors, oil tanks, and playgrounds. These objects exhibit a wide variety of scales, aspect ratios, and orientations under complex aerial scenarios. Specially, most objects are characterized by relatively low resolution, even smaller than 12 pixels, and clustered together at high densities, further increasing the challenges for existing detection methods. To evaluate the challenging aspects of OGSOD-2.0, we compare our proposed dataset with existing optical-SAR datasets over several state-of-the-art methods including single-modal, cross-modal and multimodal detectors. Comprehensive experiments show that proposed OGSOD-2.0 are quite challenging and related to practical applications. This multi-modal benchmark will be publicly available.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Saliency map have been increasingly employed in detection and recognition tasks, since it can highlight areas of an image that attract human attention. The fusion of saliency map and original image can solve the problem of complex scene or high occluded object detection. However, we discover that the existing techniques of element-wise addition and element-wise multiplication fusion will directly affect image pixels. This may result in the destruction of the target shape information or texture information, thus damaging the performance of object detection. Therefore, we propose saliency guided RT-DETR for object detection, which can promote the integration of original image and saliency map while preserving the details of the object. Specifically, we design a dual-stream fusion module by sending saliency-weighted images and original images as dual-stream inputs to backbone for independent analysis and utilizing cross-attention enhancement feature units to achieve feature alignment so as to facilitate feature interaction and enhancement. Then, we apply hybrid feature fusion module to effectively fuse multi-scale features to capture target information comprehensively. The proposed method achieves optimal values in all metrics of the newly designed coco dataset with improving the accuracy by 1.4% and 0.9% respectively compared with adding fusion and multiplying fusion element-by-element. Additionally, in the saliency-weighted object precision test, our model demonstrates superior performance across all metrics.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
To tackle the issue of inaccurate position estimation in traditional vision SLAM systems when dealing with dynamic objects, we developed a dynamic vision SLAM system that leverages a lightweight target detection network. This system enhances localization accuracy and robustness by identifying and removing dynamic feature points within the environment. By replacing the YOLOv5s backbone with the lightweight PP-LCNet, we introduced the PPLCNet-YOLOv5s target detection network, which reduces network parameters by 44.72% and boosts operating speed by 39%. This improved network is integrated into the tracking thread of the ORB-SLAM3 system to filter out dynamic feature points. Experiments on the TUM dataset demonstrate that this approach significantly improves SLAM system performance in dynamic environments, resulting in reduced trajectory errors and enhanced position estimation accuracy, thereby effectively strengthening the system’s robustness.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The brain midline is a virtual boundary between the left and right hemispheres, usually straight under normal conditions due to the dynamic constancy of intracranial contents. However, conditions such as hemorrhagic stroke can cause a mass effect due to hematoma and edema, causing Midline Shift (MLS). MLS is crucial for assessing the severity of hemorrhagic stroke. Traditionally, midline identification is manually performed by clinicians, which is time-consuming and prone to inconsistencies. In recent years, Computer-Aided Diagnosis (CAD) methods have emerged to automate this process, improving efficiency and accuracy. These CAD approaches generally fall into two categories: symmetry-based and landmark-based methods. While these methods have contributed to MLS detection, they have limitations in capturing the elongated structure of the midline and providing continuous predictions. In this study, we propose a novel method for brain MLS estimation. First, we propose an ellipse-based correction method to align CT images, ensuring direction consistency. Importantly, we propose Deformable Midline Shift Network (DMSNet), which utilizes Dynamic Snake Convolution (DSConv) to better extract the elongated structural features of the brain midline. The model's regression-based approach allows for continuous and smooth midline predictions, improving accuracy. The DMSNet outperforms traditional convolutional networks and state-of-the-art models, achieving superior results in multiple evaluation metrics. Moreover, the model demonstrates high accuracy in assessing surgical indications based on MLS, further proving its clinical relevance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Carbon fiber defect detection presents unique challenges due to the homogeneous nature of the material and the scarcity of labeled defect data. Traditional pre-training and fine-tuning approaches, while effective for natural scenes, often struggle with the limited semantic diversity and low recognizability characteristic of carbon fiber surfaces. In this paper, we introduce an enhanced prototypical contrastive learning method specifically tailored for carbon fiber defect detection in scenarios with limited labeled data. Our approach leverages the visual similarity of normal samples in carbon fiber materials to develop a more effective self-supervised pre-training strategy. By addressing the gap between natural image datasets and carbon fiber scenes, we enhance feature learning and improve the adaptability of models to the specific requirements. We propose a novel dense contrastive learning branch and a clustering-based prototype identification technique to better capture the subtle variations indicative of defects. Following a pretrain-finetune paradigm, we demonstrate that our approach significantly outperforms existing self-supervised learning techniques when applied to carbon fiber defect detection tasks.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Few-Shot object detection is a task that trains a model to effectively recognize and locate novel classes of objects given a very limited amount of labeled samples. However, due to the scarcity of samples for novel classes, the model always lacks sufficient feature discrimination between base classes and the few-shot novel classes, conventional classification methods are prone to confuse similar categories, affecting the accuracy of object detection. In this paper, the query calibration method via graph-centrality-based prototype (GC-PQC) and a Gradual Gradient Isolation (GGI) is proposed. The GC-PQC method constructs intra-class correlation models using graph centrality measures to enhance feature representation and alleviate inter-class confusion. Meanwhile, the GGI module progressively decouples the gradients of backpropagation to promote feature independence during the fine-tuning process. Together, these two methods improve the performance of few-shot object detection models and enhance the feature representation ability of novel classes. Experiment show that our framework achieves a 2%-5% increase in accuracy on novel classes across multiple datasets, demonstrating its effectiveness in addressing the challenges of few-shot object detection.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Object detection in remote sensing images relies on a large amount of annotated data for training, but adequately annotating novel classes is difficult. Few-Shot Object Detection (FSOD) address this problem by fine-tuning on finitely annotated novel classes, this results in its within-class variation being limited, making it difficult to adequately fit the within-class distribution. In order to alleviate this problem, this paper proposed a few-shot object detection method in remote sensing based on adaptive feature modification and guided by Gaussian dynamic dilated convolution. Firstly, in view of the difficulty of novel classes recognition, an adaptive feature modification algorithm is proposed. By using the characteristics of the two-branch network architecture of meta-learning itself, the support features are integrated into the query feature extraction network as convolutional bias to adaptively modify the query features. Secondly, in order to simulate intra-class variation of noise (feature perturbation caused by noise in the foreground), the backbone network weights were reparameterized to make them fit Gaussian distribution variables during fine-tuning, and the feature perturbation was achieved with parameter variation. Finally, in order to simulate intra-class scale variation (different foreground sizes), the convolution is changed into a dynamic dilation convolution, and different dilation rates are used to replace feature scale variation. Extensive experiments on two widely used remote sensing datasets (NWPU.V2, dior) demonstrate the effectiveness of the proposed method.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Automated aircraft surface damage detection faces numerous issues in engineering practice. This paper presents a portable automatic aircraft surface damage detection technology based on machine vision. It analyzes the difficulties associated with aircraft surface inspection using machine vision and optimizes the classification of aircraft surface damage according to machine vision requirements, thus laying a foundation for improving detection effectiveness. YOLOv8 was selected as the basic algorithm model and Jetson Nano as the implementation platform for the design and development of the portable automatic aircraft surface damage detection algorithm. The relevant experimental environment was established to train and validate the algorithm model. Experimental results show that the portable automatic aircraft surface damage detection method developed in this paper achieves good detection performance, with the mAP of 69.53% and GFLOPs of 7.48. This technology can serve as an auxiliary technical means in the field of aircraft manufacturing or maintenance, effectively enhancing the automation level of aircraft surface inspections.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Background noise can easily lead to incorrect classification in anomaly detection, which is a crucial step in industrial product quality inspection. In this paper, we introduce Mask-Patchcore, a simple yet effective algorithm designed to mitigate the impact of background noise. Firstly, we employ a segmentation-based mask generator that integrates with the input image in the Mask-Patchcore detection network. This approach allows for the assignment of varying weights to different segments of the image, thereby enhancing focus on regions of interest. By dynamically adjusting attention across image regions via the mask generator, Mask-Patchcore significantly improves the anomaly detection algorithm’s ability to pinpoint and identify target areas. This enhancement boosts overall detection accuracy and robustness. We extensively evaluate Mask-Patchcore using public datasets such as MVTEC-AD, VISA, and MPDD. Experimental results demonstrate its superior performance compared to existing algorithms. Furthermore, we showcase its successful application in detecting map boundary anomalies.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Monocular 3D object detection has been a challenging task in the field of autonomous driving, in response to this, numerous methods have emerged accordingly. Among them, Transformer-based methods have demonstrated superior performance, which predicts 3D attributes from a single 2D image using an end-to-end approach. Most existing Transformer-based methods leverage both visual and depth representation to achieve objects detection. In these models, the quality of the learned query points has a great impact on detection accuracy. However, the existing unsupervised attention mechanism based on Transformer generates many low-quality queries due to the inaccuracy of its receptive field. To alleviate this problem, this paper introduces a novel “Edge Module” (EM) for monocular 3D object detection. Specially, EM leverage edge information to better locate the position of the object in the image and improve the accuracy of the receptive field. Specifically, the edge module also enhances low-level features, and then interactively fuses them with high-level features after optimization. After flattening the result of interaction fusion, it interacts with learnable object queries initialized by Decoder to improve the quality of object queries. Besides, we utilize a Feature Separation Module (FSM) to separate low-level features from high-level features. Then we use the edge-guided Transformer to produce edge-aware queries, which are fed into detection heads for object detection. On KITTI benchmark with monocular images as input, our method achieves state-of-the-art performance compared to the existing approaches and requires no extra data.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Fatigue driving is one of the main causes of traffic accidents. In this paper, face region detection and driver state analysis are investigated using infrared video images. An improved Kalman filter algorithm is proposed to track the human face, and geometric constraints are applied to the driver's eye detection region and mouth detection region using the three court and five eye features. The camshift algorithm is used for face feature recognition. Based on the conclusion of statistical analysis, the driver status is judged by combining the detection results of human eye status and mouth status. The algorithm is experimentally verified using different driver video images taken under different lighting conditions, and the results show that the accuracy of the algorithm proposed in this paper reaches more than 96%, and the detection of the driver state is in line with the real cab environment, which is of practical application value.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Camouflaged Object Detection (COD) aims to accurately identify targets seamlessly integrated into complex surroundings, representing a challenging yet critical visual task. However, existing methods for detecting camouflaged objects perform sub-optimally in scenes with cluttered backgrounds and subtle edge information, primarily due to limitations in capturing local fine-grained features and fusing multi-scale features. To address these challenges, we propose a novel approach, LSFNet, which is designed to enhance the representational capability of the decoder by introducing detail-enhanced feature guidance and integrating an improved feature fusion module, thereby collectively tackling the challenges present in camouflaged object detection tasks. The model comprises two main components: the Local Guidance Augmentation Module (LGAM) effectively supplements high-quality detail information such as boundaries and textures by combining high-level semantic guidance, ensuring accurate identification even in indistinguishable edges. Additionally, a Selective Feature Fusion Perceptor (SFFP) is introduced to filter features extracted by the backbone network, selectively integrate multi-scale contextual information from top to bottom, and effectively suppress noise, achieving refined predictions. Extensive experiments conducted on four benchmark datasets demonstrate that LSFNet significantly outperforms 18 state-of-the-art methods, showcasing its outstanding performance in camouflaged object detection.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The main aim of Interactive image segmentation is to accurately segment regions of interest by utilizing simple user-provided interactive information. Currently, the most popular training approach involves using the previous segmentation result as supplementary information for the next segmentation process to establish a connection between the two interactions. This iterative training method transfers all the information carried by interactions to the global level, without considering its impact at the local level. This will affect the final segmentation results and the training efficiency. To address this problem, we propose an iterative training strategy called Double-Click to involve cropping the image based on the newly added clicks and then separately feeding the cropped image and the newly added clicks into a smaller branch network to obtain predictions for the local region. Subsequently, the locally segmented result is fused with the previous segmentation result to achieve the final outcome. Then, the refined segmentation results are used to guide the next segmentation, ensuring that more precise results guide subsequent segmentations while maintaining the guidance information of the clicks on the global image. Our training strategy improves the training effectiveness by handling two clicks in an iterative process. The accuracy of the segmentation is ensured through both local and global predictions. Additionally, we designed a dual-branch network to accommodate our proposed training strategy. The effectiveness of this strategy has been tested on four validated on four benchmarks, namely GrabCut, Berkeley, SBD, and DAVIS. Experimental results demonstrate that the Double-Click method achieves better performance with fewer interactions compared to previous methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The Circle of Willis, a critical arterial ring structure located at the base of the brain, is essential for maintaining cerebral blood supply and is closely associated with various cerebrovascular diseases. Accurate segmentation of the Circle of Willis is crucial for assessing its structural integrity and functional status in clinical diagnosis and neurosurgical planning. This study developed a novel deep learning model based on Time-of-Flight Magnetic Resonance Angiography (TOF-MRA) technology. The model significantly improves the continuity of vascular segmentation and the recognition of fine vascular structures through a vascular skeleton-assisted segmentation mechanism and a multi-scale integrated fusion module. The innovatively introduced Multi-task Feature Attention Fusion Module integrates the semantic information of the vascular skeleton, effectively optimizing segmentation accuracy and structural integrity. Compared to existing techniques, our method has achieved promising results.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Stroke, as a disease with high incidence and mortality rates, is increasingly receiving attention. Despite the rapid development of deep learning in the medical field providing excellent performance for AI-assisted diagnosis, automated segmentation of stroke still poses significant challenges. Issues such as the similarity between hemorrhagic regions and the background, the irregularity of hemorrhagic areas, and the vast variability in hemorrhage sizes persist. To address these segmentation challenges, this paper introduces a new network architecture that incorporates a multi-scale channel joint attention module and cascaded feature assisted enhancement module, taking into account the anisotropy and asymmetry of images. This method aims to accurately segment lesion tissues from chronic stroke brain images in T1- weighted MRI. Experimental results have demonstrated that the method proposed in this paper achieves superior performance outcomes compared to other methods. This study provides a promising solution for stroke segmentation.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Medical image segmentation plays an important role in the application of the computer-aided medical therapy field. The traditional approach of merely using Kolmogorov-Arnold Networks instead of the Multilayer Perceptron for medical image segmentation has strong limitations in segmentation performance. In this paper, we propose a superior medical image segmentation method based on the improved Kolmogorov-Arnold Networks structure. Firstly, we use the Gaussian smoothing technique to reduce the noise in the label graph, and the segmentation boundary we gain is smoother than that of the traditional Kolmogorov-Arnold Networks model. Simultaneously, we compute the loss of the grid boundary conditions and add it to the smoothed regularized loss of the grid, helping us to reduce the possible errors in the segmentation process and have better overall performance than the traditional Kolmogorov-Arnold Networks model. Finally, we access the eca_layer at the back end of the Kolmogorov-Arnold Networks to implement an efficient channel attention mechanism, which enables our network to better focus on the important features and enhances the feature representation capability of the model than the traditional Kolmogorov-Arnold Networks model. Experiment shows that our model achieves higher accuracy and better segmentation performance on a variety of medical image segmentation tasks. We obtain 79.73% Dice score and 66.72% IoU score on the BUSI dataset, 92.89% Dice score and 86.73% IoU score on the GlaS dataset, 91.91% Dice score and 85.20% IoU score on the CVC dataset, standing out from many other state-of-the-art models.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Semantic segmentation of remote sensing images is crucial for global geographic information analysis. However, existing methods still need improvement in accurately segmenting small target regions, often resulting in missed or misclassified small targets. To address this issue, we propose a remote sensing image segmentation model Global-Local Transformer based on global-local feature fusion attention mechanism. The model employs the classical U-Net framework, with the encoder integrating convolutional neural networks and Swin-Transformer, and the decoder composed of stacked deconvolution layers for upsampling. We also introduce a Global Information Sharing module to compensate for the Swin-Transformer's limited global information extraction capability due to its sliding window approach. Experiments on the Potsdam and LoveDA datasets validate the model's effectiveness, showing improvements in mIoU and mF1 metrics on the Potsdam dataset by 1.1% and 1.1%, respectively, and on the LoveDA dataset by 1.3% and 2.1%, respectively.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In medical image processing, accurate 3D segmentation is a crucial prerequisite for effective computer-aided diagnosis. Recently, UNet-based network architectures have become the mainstream choice for medical image segmentation. However, the inherent geometric limitations of traditional convolutions restrict their ability to perceive irregularly shaped lesions, and existing attention mechanisms often overlook positional information, which is essential for models to understand the configuration of target regions. To address these challenges, we propose an encoder-decoder network that integrates shape perception with 3D localization to generate precise segmentation masks. Central to our framework is the Deformable Coordinate Kernel Attention (DCK-Attention) module, which adaptively deforms the sampling grid by learning features from the input image, thereby enhancing attention and perception of regions of interest through spatial localization information. We validate the effectiveness of our method on the Brain Tumor Segmentation (BraTS) and Automated Cardiac Diagnosis Challenge (ACDC) datasets. Experimental results demonstrate that our approach outperforms existing segmentation methods, offering a significant advancement and providing a referenceable solution for the field of medical image segmentation.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Tooth segmentation from Cone Beam Computed Tomography (CBCT) datasets is a crucial step in generating three-dimensional tooth models. Mainstream tooth segmentation methods randomly or uniformly cut CBCT datasets into several regions without anatomical significance, known as patches. Networks trained with these patches struggle to learn the global semantic information in the CBCT datasets, which is important for accurate segmentation. This paper pioneers the application of the patch-free concept to the tooth segmentation task, proposing a novel patch-free tooth segmentation method. Our approach includes two steps: whole tooth region localization and patch-free tooth segmentation. Given a CBCT dataset, we first extract a fixed-size whole tooth region containing all teeth. Then, it is used for training or testing a U-Net. To demonstrate the superiority of our method, we conducted the quantitative and qualitative comparisons with a popular patch-based method using similar network architectures and hyperparameters. The results indicate that our patch-free approach significantly outperforms the patch-based method in accurately segmenting individual teeth.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The 3D point cloud is a fundamental data format and a vital information carrier for analyzing and understanding real-world scenes, particularly crucial for applications like autonomous driving. Current research work mainly focuses on how to improve model performance. However, accurately assessing predictive uncertainty remains crucial before models can achieve perfect precision, especially in risk-sensitive scenarios. We apply evidential deep learning to the field of point cloud semantic segmentation for the first time and propose a new uncertainty-aware segmentation head that can predict cognitive uncertainty and arbitrary uncertainty in a single forward reasoning. Furthermore, this method can be seamlessly integrated with any point cloud classification and segmentation model. In addition, by introducing arbitrary uncertainty into the loss function, overfitting to noise points is avoided to a certain extent. Experimental results confirm that this approach leads in inference speed and precision of uncertainty estimation, setting a new standard in the field.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Segmentation of choroidal vessels is crucial for diagnosing various fundus diseases but poses a significant challenge. The complex and diverse morphology of choroidal vessels makes obtaining pixel-wise annotations extremely difficult. In this work, we introduce a labeling strategy that combines full annotation and sparse annotation to preserves the continuity features of the vessels while significantly reducing the annotation workload. To efficiently learn from both type of annotations, we propose a semi-supervised segmentation framework based on dual networks collaborative learning. In this framework, we propose an Uncertainty-Guided Dynamic Teacher Selection strategy (UGDTS) that dynamically selects the teacher network to generate pseudo-labels and supervises the other network under the guidance of the uncertainty map. Additionally, to reduce the cognitive bias of networks when using sparse annotations, we design a Deformation Perception Consistency (DPC) which enforces a weak-to-strong consistency constraint on the predictions before and after choroidal structural deformations. The quantitative and qualitative experimental results across various label proportions demonstrate the effectiveness and superiority of the proposed framework over other state-of-the-art methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The analysis of artery and vein differences in Optical Coherence Tomography Angiography (OCTA) is of great significance for diagnosing various eye diseases and systemic diseases (such as diabetic retinopathy, hypertension, and cardiovascular diseases). We propose a multimodal knowledge distillation-guided Artery-Vein Segmentation Network for OCT/OCTA images (MKD-Net). Firstly, we introduce 2D projection image information in the process of segmenting from 3D to 2D to enhance the semantic representation capability in the horizontal direction of the network. Secondly, we transfer knowledge from a trained multimodal (3D and 2D input) teacher network to a single-modal (3D input) student network, enabling the student network to not rely on the projection images generated by retinal layer segmentation, avoiding segmentation failure under diseased conditions, and enhancing its ability to capture details and feature changes in the images. Extensive evaluations on the OCTA500 and OCTA100 datasets demonstrate that the proposed method achieves superior performance, with an MIou of 84.96% and 84.25% on the OCTA500 and OCTA100 datasets respectively.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Unsupervised domain adaptation is an effective technology for cross-domain image classification which leverages the rich labeled source domain data to build an accurate classifier for the unlabeled target domain data. To effectively evaluate the distribution discrepancy between domains, the maximum mean discrepancy (MMD) has been widely used for its nonparametric advantage. Recently, a new distribution discrepancy metric named maximum mean and covariance discrepancy (MMCD) was proposed, which combines MMD with the maximum covariance discrepancy (MCD). Since MCD could evaluate the second order statistics in reproducing kernel Hilbert space, the MMCD metric can capture more information than using MMD alone and give better results. However, most existing MMCD based methods generally focus on adapting the marginal and class-conditional distributions between domains but ignore exploring discriminative information offered by the labeled source domain data. To overcome this shortcoming, in this paper we propose a discriminative MMCD based domain adaptation (DMcDA) method, which could simultaneously reduce the MMCD of the marginal and class-conditional distributions between domains and minimize the intra-class scatter of the source domain data. Moreover, a kernel version of DMcDA is also derived. Comprehensive experiments are carried out on two cross-domain image classification problems, and the results demonstrate the effectiveness of the proposed DMcDA method.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Multimodal prompt learning has shown remarkable generalization abilities in pretrained vision-language models, such as CLIP, enabling adaptation to various downstream tasks. Building on CoOp, previous works combine learnable text tokens with class labels to acquire domain-specific textual knowledge. However, this approach struggles with poor generalization to unseen classes, as much of the general knowledge acquired during pretraining is forgotten. To address this issue, we propose the Distance-Aware Text Optimizer (DATO), which mitigates the forgetting problem in learnable prompts. DATO introduces a distance-based penalty on text features, guiding the learning of prompts by enforcing proximity to the original prompt features. This penalty increases as the learned text features deviate further from the original ones, balancing against classification loss. Our experiments demonstrate that the proposed Distance-Aware Text Optimizer is an effective multimodal prompt learning optimization method, achieving 1.28% improvement over MaPLe.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In Hyperspectral Image (HSI) classification, each pixel sample is assigned to a land cover category. Recently, HSI classification methods based on Convolutional Neural Networks (CNN) have significantly improved performance due to their superior feature representation capabilities. However, these methods exhibit a limited ability to capture deep semantic features, and computational costs increase significantly as the number of layers grows. The Vision Transformer (ViT), leveraging its self-attention mechanism, offers promising classification performance compared to CNNs, and Transformer-based methods exhibit robust global spectral feature modeling capabilities. However, these methods struggle to effectively extract local spatial features. In this paper, we propose a novel Transformer-based method with efficient self-attention for HSI classification, capable of fully aggregating both local and global spatial-spectral features. The proposed method employs spectral and spatial convolution operations, integrated with attention mechanisms, to enhance structural and shape information. Initially, 3-D convolution with adaptive pooling and 3-D convolution with residual connections are employed to extract fused spatial-spectral features. Subsequently, an interactive self-attention module is applied across the height, width, and spectral dimensions, achieving a deep fusion of spatial-spectral features. Experimental analyses on three standard datasets confirm that the proposed method Hybrid Spectral-Spatial ResNet and Transformer (HSSRT) outperforms existing classification techniques, delivering state-of-the-art classification performance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Infertility, a reproductive disorder, arises from various causes, including abnormalities in the uterus and associated reproductive organs, which are significant contributors to the condition. While acupuncture has been proven effective intreating infertility, accurately diagnosing its underlying causes remains a challenge. Ultrasonic imaging, capable of identifying issues such as uterine fibroids, ovarian cysts, and fallopian tube blockages, facilitates the development of more targeted acupuncture treatment plans. This highlights the need for a method that enables acupuncture practitioners with limited exposure to Western medicine to interpret ultrasonic images effectively. The approach introduced in this study leverages the yolov5 framework to classify uterine abnormalities. Results indicate that this method improves baseline performance compared to other methods. This method will enhance clinical diagnosis, aid in pinpointing the causes of infertility, and improve the precision of infertility diagnoses by acupuncture specialists, thereby boosting the efficacy of acupuncture interventions for infertility.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Lychee and longan are important economic crops in South China. Fast identifying and acquiring their planting areas are important tasks. However, lychee trees and longan trees have similar appearances. So, they have similar spectral features. Therefore, obtaining the planting area of these trees from a remote sensing image becomes a challenge. Furthermore, unlike lychee, the planting area of longan in Guangdong province is too small to be identified. Therefore, spatial and spectral resolution may become important factors for the classification. To evaluate the influence of spatial and spectral resolution on lychee and longan classification, this paper compares different kinds of representative remote sensing data sources, including Sentinel-2 (high spatial resolution, multispectral image), Zhuhai-1 (high spatial resolution, hyperspectral image), Gaofen-5 (low spatial resolution, hyperspectral image), Ziyuan-1 02D (low spatial resolution, hyperspectral image), as well as the fused image of GF-5 with Sentinel-2. We aim at finding which spatial and spectral resolution can be suitable for classification. We combine a hierarchical classifier and Convolutional Neural Networks (CNNs) to classify the classes. The hierarchical classifier classifies the land use classes and the combined class of lychee and longan at the first level. And then it classifies lychee and longan from the combined class at the second level. The CNN is used as a base classifier in the hierarchical classifier. Three branches, i.e. spatial feature, spectral feature, and the joint of spatial and spectral feature branches, are the backbone of the network. 1D, 2D and 3D convolutional layers are designed to extract the spectral features, the spatial features and the joint features in the above branches, respectively. The experimental results show that spectral resolution is the most important factor for the lychee and longan classification, while spatial resolution cannot be ignored since longan occupies small planting areas. The hyperspectral images with 10m spatial resolution obtain the best performance. The image with 30m spatial resolution can hardly distinguish longan from lychee unless the spectral resolution reaches at least 10nm (e.g. GF-5). The multispectral image cannot distinguish these trees.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The traditional content-based image feature extraction method only focuses on the visual features at the bottom of the image, mainly extracting color features, texture features, shape features, and local features, without considering the spatial information of the image. This article proposes an image classification algorithm based on topological feature extraction, which fully considers the spatial position relationship and topological invariance of images. Firstly, the image is binarized and then filtered by a filtering function to obtain a grayscale image. The persistent image is calculated and further transformed into topological feature descriptors. Finally, machine learning algorithms are used for classification. Experiments conducted on the MNIST dataset showed that the enhanced classification algorithm, which incorporates spatial topological features of images, not only reduces the algorithm's complexity but also maintains classification accuracy.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Synthetic Aperture Radar Automatic Target Recognition (SAR ATR) has always had a problem, which is the large intra-class variance and small inter-class differences in the dataset due to angle sensitivity and differences in shallow and deep features. This article deeply combines multi-view methods with multiscale feature methods and proposes a Multi-View Residual Feature Pyramid Network (MV-ResFPN) for SAR target recognition. For multi-view targets, four residual blocks of ResNet50 are used to extract multi-scale feature information of the target from each view. Perform multi-view feature fusion on feature of the same scale to obtain four different scales of multi-view fusion features. Next, these four features are fed into the proposed adaptive Feature Pyramid Network (FPN) for multi-scale fusion, which adaptively fuses shallow to deep features and the final output of ResNet. The proposed method reduces intra-class differences and makes inter-class differences more pronounced. Experiments on the Bistatic CircularSAR dataset captured by our team have shown that this method outperforms existing techniques in SAR ATR.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Facial Expression Recognition (FER) encounters significant challenges due to the limited sensitivity of visible light images in low light conditions. Most existing cross-domain emotion recognition studies have focused on domain adaptation for different visible light datasets, often neglecting the problem of emotion recognition under varying lighting conditions. To address this, we introduce the first cross-domain emotion recognition study for day and night environments, emphasizing the modal migration between normal visible light and low-light emotion maps. Considering the limited number of emotion maps and the reduced emotional detail in night time imaging, we designed an attention- switching diversified feature capture module to focus on facial expressions and extract more local emotional features that aid in migration. Given the significant domain shifts between samples in different lighting environments, we further developed a prototype feature transfer module to learn modality-independent category features and reduce domain differences between visible and low-light features. To tackle the challenge of numerous low sentiment value samples in the dataset, we introduce a high-confidence blending module to filter informative visible and infrared samples for fusion, producing features that combine the styles of both domains. Compared to state-of-the-art methods in domain adaptation and cross-domain emotion recognition, our approach demonstrates superior performance and validates its effectiveness.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The essence of locality preserving projections (LPP) is a classic and linear approximation of Laplacian eigenmaps(LE), which can effectively extract the local manifold structure information. However, LPP is an unsupervised learning method, which cannot make good use of the category information of samples. Therefore, by introducing adjustable factors, the extended locality preserving projections (ELPP) method was formed considering sample categories while preserving local information. In addition, the performance of traditional KNN classifiers varies significantly due to the selection of value. However, the fuzzy k-nearest neighbor (FKNN) algorithm adds the fuzzy factor to make the generalization performance stronger. Therefore, this paper combines fuzzy K-nearest neighbor algorithm with ELPP to form FELPP learning model. Before using FELPP model, it is necessary to find out the optimal value of the tunable factor. Then the superiority of FELPP algorithm is proved by the test results on the Yale, ORL, and AR face database.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In recent years, UAVs have been widely used in the field of object detection and object counting due to their advantages of high flexibility and strong mobility. When exceeding a certain flight altitude, the drone obtains a wider field of view, captures more objects and improves time efficiency with limited endurance. However, higher flight altitudes mean that drones can collect smaller objects, making it more difficult to detect objects and count them with tracking. When using multi-object tracking for small object counting tasks, this paper proposes a counting algorithm based on changing ID and boundary-aware in the region of interest for the difficult problem that the objects at the far end of the image are too small to track and the objects at the near end always maintain the identity information (ID) unchanged. By setting a suitable region of interest, this paper realizes the counting task with the advantage of not having to track smaller objects at the far end of the image. The tracking-based counting method based on changing ID and boundary-aware proposed in this paper can perform flexible counting when the ID of tracking object changes, which effectively alleviates the adverse impact of ID change on counting accuracy in the counting area of interest.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Handwriting recognition, which identifies individuals by analyzing and comparing the features of handwriting samples, is a challenging field of biometric recognition. While deep learning techniques can handle large datasets effectively and provide high recognition accuracy, they still struggle with capturing subtle features in small handwriting datasets. Although previous work has shown that Deep Convolutional Neural Networks (DLS_CNN) perform well on small datasets, there are still shortcomings in extracting fine-grained features of handwriting. Therefore, this paper introduces attention mechanisms into DLS_CNN to enhance the model's ability to capture subtle features. Specifically, two types of attention mechanisms are incorporated into the DLS_CNN. Experimental results demonstrate that adding attention mechanism modules can significantly improve recognition accuracy, with the addition of CBAM attention mechanism being more effective than SE attention mechanism.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper proposes a new method to compensate for asymmetric lens distortion through feature point projection transformation and linear reconstruction. The method involves traversing the world coordinates of calibration image feature points using a minimum reference grid and calculating the composite projection transformation error of all grids. The composite projection error is obtained through weighted calculations based on linear constraints, cross-ratio constraints, and parallel line constraints. After a second screening, the reference grid area with the minimum projection error is identified. Subsequently, the optimal reference grid is optimized with the goal of minimizing the composite error. Finally, the optimized feature point coordinates are solved with the corresponding world coordinates to derive the homography matrix. This allows the linear reconstruction of the entire calibration image's feature points through projection transformation, and the parameters of the asymmetric lens distortion model are optimized. Consequently, a high-precision asymmetric lens distortion model is obtained, allowing for compensation and correction of the asymmetric distortion.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Using traditional depth and image-based methods for reconstructing eyeglass frames poses significant challenges due to their unique properties, such as minimal textural features and thin components. In this paper, we propose a novel self-supervised approach for reconstructing eyeglass frames based on vertex displacement prediction. This method enables the precise 3D shape recovery of full-frame eyeglass models from multi-view images. To begin, we first establish an eyeglasses frame template, leveraging prior and domain-specific knowledge. Subsequently, when provided with an input image of an eyeglass frame characterized by thin structures and few texture features, we develop an iterative self-supervised camera pose estimation module to accurately predict the camera pose. Following this, we employ a self-supervised vertex offset prediction network to extract features from both the input multiview images and the 3D template eyeglass frame model. This network predicts the vertex displacement of the frame to be reconstructed relative to the template frame model. We iteratively optimize the network to obtain precise vertex offsets by minimizing the image loss between rendered images and input images using differentiable rendering. Experimental findings from both synthetic and real datasets demonstrate the effectiveness of our proposed method.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Addressing the issues of reduced accuracy and robustness in traditional video shadow detection methods due to occlusions and deformations caused by lighting changes and complex backgrounds, a video shadow detection method based on the Segment Anything Model (SAM) is proposed. The decoder of the SAM was fine-tuned to better suit shadow detection. Leveraging its high-precision segmentation capabilities, shadow regions were extracted, and the XMem model was introduced to integrate information from previous and subsequent frames by combining sensory, short-term,and long-term memories, thus optimizing and stabilizing the shadow detection results. Experimental results show that compared to traditional methods on the ViSha dataset, the proposed method reduced the mean absolute error by approximately 31.8% and increased the Intersection over Union (IoU) by approximately 19.7%. Both qualitative and quantitative results indicate that this method not only improves the accuracy of video shadow processing and analysis but also exhibits good robustness.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
With the rapid development of 3D point cloud data acquisition technology, its application in fields such as virtual reality, augmented reality, medical imaging analysis, and robotics is becoming increasingly widespread. Although current research mainly focuses on implicit space point cloud processing, an increasing amount of evidence suggests that explicit space methods, especially those based on point data primitives, can more effectively achieve neural surface reconstruction and point cloud deformation. This study completes reconstruction by imposing constraints in point space, normal space, and image space, and use isometric constraints to construct a point cloud deformation field for precise deformation of point cloud models. Experiments on the Studio dataset and the Stanford 3D Scanning Repository demonstrated good neural surface reconstruction effects. Compared to the Instant NGP method, the reconstruction loss was reduced by 13.9% and 27.7%, respectively, and effectively avoided reconstruction loopholes. Ablation experiments further confirmed the role of AIAP constraints in improving the quality of point cloud deformation, with optimization effects reaching 7.4%, respectively. The point cloud reconstruction and deformation method based on point primitives proposed in this paper, from explicit semantic space constraints and guided learning processes, not only completes the reconstruction and deformation of point clouds with high fidelity but also shows strong robustness and generalization.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Ball curves and surfaces play a crucial role in modeling 3D objects with varying shapes. This paper presents theorems and algorithms for the generation of a new type of generalized Ball curves: double partial Said-Ball curves. The authors also investigate the properties of the new curves, including nonnegative, interpolation at the end points and partition of unity, give the detailed proof of the latter. Furthermore, the shape differences between the new curves and the other existent generalized Ball curves are demonstrated through image.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This study is dedicated to solving the problem of single-image super-resolution reconstruction, particularly by introducing a multiscale attention mechanism to enhance the reconstruction effectiveness. Advances in super-resolution technology have provided unprecedented possibilities for reconstructing images with low clarity yet challenges regarding reconstruction quality and efficiency remain. Based on a review of the literature concerning attention mechanisms and their applications in super-resolution, this paper proposes a new network architecture that incorporates a carefully designed multi-scale attention module. The research demonstrates through experiments that the new algorithm can effectively capture the feature information of images and achieve improvements in detail recovery. Both quantitative and qualitative experimental results indicate that the proposed method outperforms existing approaches in the task of super-resolving images, showcasing significant advantages and practical application value. The paper also provides a detailed analysis of key modules within the algorithm and discusses the results obtained. Overall, this research successfully opens up new avenues for enhancing image reconstruction performance and offers valuable insights for future work.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Light field reconstruction, a crucial domain in computer vision, generates 3D models of real-world scenes by combining multi-view images. Multi-view stereo technology is essential in this process, leveraging images from various angles to reconstruct 3D structures. It is widely used in applications such as cultural heritage preservation, robotics, and virtual reality. However, current methods often depend on rigid feature extraction and fusion techniques, restricting their capacity to fully exploit the rich multi-view data. This frequently results in blurred reconstructions, particularly in complex scenes with intricate edges and contours. To address these limitations, we propose CALFNet, a novel light field reconstruction method based on a channel attention mechanism. CALFNet combines this attention mechanism with a detail loss module, enabling the network to prioritize crucial channel information and edge features. This significantly enhances the accuracy and completeness of 3D reconstructions from multi-view images. Experiments on the DTU3Dreconstruction dataset demonstrate that CALFNet improves overall accuracy by 2.4% compared to existing methods. Additionally, tests on the Tanks and Temples dataset demonstrate the model's strong generalization capabilities, effectively handling diverse reconstruction tasks across various scenarios. This advancement paves the way for new applications in fields such as cultural heritage, robotics, and virtual reality.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper addresses the problem of image deblurring in the presence of impulse noise. While methods based on total variation regularization have a long-standing history in image processing, the traditional convex TVL1 model that using the L1 fidelity may introduce bias in the recovery results. To overcome this drawback, we propose a novel image deblurring model for impulse noise, called TVcapped-L1 model, which integrates the nonconvex capped-L1 fidelity with total variation regularization. We also provide an algorithmic framework that employs the difference-of-convex algorithm to solve the proposed model. Numerical experiments demonstrate that our proposed method outperforms existing methods in recovering images degraded by Gaussian blur and various types of impulse noise.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Multimodal Perception and Feature Fusion Techniques in Image Processing
This paper focuses on the image composition of transparent objects, where existing image matting methods suffer from composition errors due to the lack of accurate foreground during the composition process. We propose a foreground prediction model named ALGM, which leverages the local feature extraction capabilities of Convolutional Neural Networks (CNNs) and incorporates an attention mechanism for global information modeling. The proposed alpha-assisted foreground prediction module extracts foreground information from the original image and conveys it. The extracted foreground color information is combined with the deep structural features of the encoder and used for foreground color prediction. ALGM reduces image composition errors in the quantitative data from the Composition-1k dataset and improves the visual quality of composed images on the AIM-500 and Transparent-460 datasets.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Aiming at the problems that the feature points extracted by traditional SLAM algorithms are very easy to pile up in texture-rich areas, and the correct rate and speed of feature point matching cannot meet the needs of real-time construction of scene maps, we design a point and line feature fusion as well as an extraction and matching algorithm based on mesh motion statistics. Firstly, the accuracy of line feature extraction is improved by improving the parameters of the line feature extraction algorithm, and then the uniformity of the distribution of feature points is improved by using the grid division during point feature extraction, and then the grid motion statistics algorithm is introduced to eliminate the false matches by counting the distinguishable degree of matching neighborhoods on the basis of the constraints of motion smoothness. Comparing the number of feature points extracted, time consumption, distribution and matching effect, the improved algorithm is able to extract a sufficient number of feature points with uniform distribution, which makes the most use of the useful information of the image; in addition, compared with other similar algorithms, the number of matching pairs is greatly increased while the correct rate is also significantly improved, which in turn verifies the superiority of the improved algorithm.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In recent years, deep learning-based image inpainting methods have made significant progress by incorporating structural priors. Due to the lack of information interaction between textures and structures and the difficulty of integration, these methods are inadequate in handling complex inpainting tasks, which lead to structural distortion and detail loss. This paper proposes a novel twin-stream adversarial model (TSAM) for image inpainting. The model establishes a Multilevel Conditional Encoding (MCE) network in the twin-stream encoding stage, which enriches the information interaction between textures and structures. The MCE network incorporates both the structure-constrained texture synthesis and texture-guided structure reconstruction, which facilitates mutual improvement and improves the effect. In the twin-stream decoding stage, a Progressive Feature Fusion and Decoding (PFFD) network is constructed to fuse textures and structures, guaranteeing the global consistency and rich details of inpainting results on complex scenes. Qualitative and quantitative experiments on the CelebA-HQ, Paris StreetView, and Places2 datasets show the superiority of this method.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
As a model of cross-media intelligence that combines computer vision and natural language processing, video semantic annotation facilitates automatic location of events in videos and describes video content in natural language. Unlike standard video annotation, densi video annotation requires the detection and description of multiple events in long videos, adding additional complexity to locate events in long videos. By proposing a dense video semantic annotation method based on deep learning, a single discrete tag sequence can be predicted for a given multimodal input, which includes a title tag for the event and a time tag representing the timestamp of the event. The proposed model uses unlabeled narrative videos for pre-training and uses transcribed speech and corresponding timestamps as a weakly supervised source of dense video annotation to replace manual annotation information, which expands the size of available datasets. Furthermore, by fine-tuning the model, we can apply it to the problem of paragraph annotation, generating paragraph descriptions about the entire video. The results show that the proposed model can predict high-quality event descriptions and relatively accurate time boundaries in different scenarios.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Image retrieval is the process of searching for similar images based on their visual content, which has significant applications in the field of medical imaging. Quickly retrieving similar images from a medical image database can assist doctors in rapid browsing and diagnosis. Although feature fusion-based descriptors have achieved great success in the field of image retrieval, there are relatively few cases of using fusion feature descriptors for image retrieval in the medical imaging field. To address this gap, we propose a novel feature extraction model, Fuzzy-AttnNet, aimed at enhancing the performance of medical image retrieval using fusion descriptors. The model simulates dilated convolutions and self-attention mechanisms, comprehensively modeling both local and global features of medical images and performing feature fusion. The fused descriptors are then applied to image retrieval. Additionally, Fuzzy-AttnNet employs a parallel optimization strategy using Arcface loss and Softmax loss, which increases intra-class spacing and significantly improves the model's recognition performance across different categories. In subclass classification tasks, Fuzzy-AttnNet demonstrates significant advantages, with an average precision improvement of approximately 10% and a recall increase of about 30%. The experimental results indicate that our proposed feature extraction model significantly enhances the accuracy of medical image retrieval.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Recovering a complete point cloud from a partial point cloud is a critical and challenging task for many 3D applications. In this paper, a point cloud completion network is proposed that focuses on improving the point cloud feature extraction and the initial generated point cloud in the encoding phase. The local details of the original point cloud are highlighted by introducing trigonometric positional embedding for point cloud encoding. Moreover, a self-attention mechanism for feature fusion is proposed to facilitate the generation of a complete point cloud. The experiments on multiple public datasets demonstrate that our network effectively achieves 3D point cloud completion with strong generalization, outperforming recent point cloud completion methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Hyperspectral image change detection is a hot research field. However, some methods only consider spectral information or spatial information. Besides, the feature extraction methods based on neural network usually need a large number of training samples, but it is difficult to collect labeled training samples. Therefore, a hyperspectral change detection method is proposed based on 3D convolution autoencoder and feature difference constraint. First, 3D convolution is used in the unsupervised autoencoder to extract deep spatial-spectral features. Then, in order to make spatial-spectral features not only retain the main information of the original data but also enhance the robustness to change discrimination, the loss function of the autoencoder is constructed based on reconstruction errors and feature difference constraints. Finally, using the extracted deep spatial-spectral features of bi-temporal hyperspectral images as input, a simple classification network is designed to judge whether the input feature pairs have changed. The inputted spatial-spectral features are low-dimensional, which can effectively reduce the parameters of the classification network. Therefore, only a small amount of label data is needed to complete the training, which improves the training efficiency and solves the problem of insufficient label data. Experiments conducted on three real hyperspectral datasets demonstrate that the proposed method achieves better detection performance than state of the arts.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Computer Aided Painting and Multimedia Applications
Unmanned Aerial Vehicle (UAV) platforms often equip dual-modal cameras that capture paired optical and thermal infrared images and utilizing optical information to guide the Super-Resolution (SR) of thermal images has become an effective way to obtain High-Resolution (HR) thermal images from UAV viewpoint. However, due to the different positions and resolutions of optical and thermal sensors, the acquired optical and thermal images are usually in a non-aligned state, which leads to less effective guidance for thermal image SR. This paper proposes an Alignment Fusion Network (AFNet), which aligns and fuses optical features in thermal image SR to avoid the effect of the non-aligned issue. First, the Feature Alignment Module (FAM) is designed to align optical images at the feature level with thermal images using deformable convolution and to enhance features using the channel attention mechanism. A Bidirectional Fusion Module (BFM) is then developed to effectively fuse optical and thermal features. Extensive experiments on public datasets show that our AFNet effectively mitigates the effect of the non-aligned issue and provides better HR thermal images both numerically and visually against state-of-the-art methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Low-rank matrix estimation is an important tool in modern data analysis and machine learning. The basic idea of low-rank matrix estimation is to estimate the original matrix with a low-rank matrix, thereby achieving dimensionality reduction, noise reduction, or simplification of computations. In this paper, we propose two new matrix decomposition methods based on CUR decomposition, Row Random Decomposition and Column-Row Random Decomposition. These two methods significantly improve computational efficiency, reduce randomness and introduce a degree of determinism compared to the classic CUR decomposition. Then, we apply Row Random Decomposition and Column-Row Random Decomposition to update the low-rank matrix estimation in RPCA problem, resulting in two algorithms: Row Random RPCA and Column-Row Random RPCA. These two algorithms can quickly yield an accurate low-rank matrix estimation while avoiding the need to perform SVD decomposition on the original data matrix. By leveraging Row Random Decomposition and Column-Row Random Decomposition, they effectively reduce computational complexity and improve efficiency in solving the RPCA problem. Additionally, numerical experiments conducted on synthetic datasets demonstrate the computational advantages of our algorithms. These experiments reveal that our methods not only provide linear convergence but also maintain high accuracy and robustness across various scenarios. And experiments in video background subtraction highlight the Row Random RPCA and the Column-Row Random RPCA excellent performance in both speed and visual quality.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Drawing architectural plan is a crucial step in the preservation of ancient Chinese building. This process is often time-consuming and typically carried out by skilled human. This paper presents a method for automatically extracting line art of building structures, which significantly enhances the efficiency of creating architectural plans. We first introduce a dataset of images featuring Huizhou-style ancient buildings. Each building image is then segmented into various components to identify the main structure. The SUPIR model can optionally be employed to enhance the component images with super-resolution. Moreover, we developed a split-and regroup algorithm specifically for the eave's component, which often presents challenges such as broken, tilted, or occluded roof tiles. Finally, we trained a line art extraction model using ControlNet and LoRA. The experiments demonstrate that our method yields more detailed images than existing algorithms.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The objective of physics-based differentiable rendering (PBDR) is to propagate gradients between scene parameters and the intensities of image pixels in a manner that is physically correct. The gradients obtained can be applied in optimization algorithms for the reconstruction of 3D geometry or materials, or they can be further propagated into neural network to learn neural representations of the scene. However, applying automatic differentiation techniques directly to the primary rendering process will result in biased gradients, as the rendering integral contains moving high-dimensional discontinuities. Based on how these discontinuities are managed—either implicitly or explicitly—existing PBDR methods can be categorized into two groups: reparameterization methods and boundary sampling methods. Boundary sampling methods need to construct paths that have one segment tangent to the geometry being differentiated in order to estimate a boundary integral to address the discontinuities explicitly. Such paths are usually constructed by sampling the tangent segment first and then extending it to complete the paths for subsequent processing. Fortunately, the number of dimensions in the space composed of such tangent segments is only three. In scenes comprised solely of triangle meshes, the first dimension is used to parameterize all the edges on the mesh, which determines a point on the tangent segment. The remaining two dimensions are used to parameterize the direction of the tangent segment. However, state-of-the-art boundary sampling methods parameterize the first dimension uniformly, which is inefficient because only a small portion of the edges contributes to the boundary integral, resulting in wasted parameter space. In this paper, we parameterize the first dimension by considering both edge length and contributions, thereby allocating more parameter space to important edges. Experiments demonstrate that our methods achieve lower variance gradients in the forward differentiable rendering process and improved geometry reconstruction quality in the inverse rendering results.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Over the past few years, NeRF research has led to widespread applications across diverse fields, including virtual filmmaking, product design, and more. However, conveniently editing NeRF scenes still faces challenges. There's a pressing requirement to investigate the editability of NeRF to bolster user interaction with NeRF scenes, facilitating swift modifications to scene appearance and content. At present, a multitude of investigations have focused on NeRF editing, with some constrained to altering overall scenes, while others are confined to particular editing tasks, lacking adaptability and flexibility. We introduce a method called BoxNeRF, which enables efficient and flexible editing operations on objects within NeRF scenes including copy-paste, deletion, and affine transformations, by selecting them within a single 2D image. We use generated 3D masks to perform editing operations on objects and utilize the parameters of the original NeRF model to infer object content instead of optimizing a new NeRF model, thereby preserving the texture details while also maintaining accuracy in both color and geometry. Our approach empowers users to conduct superior scene editing with ease and simplicity while remaining scene coherence. Besides traditional explicit 3D editing operations, we also enable object transfer between scenes and recoloring, resulting in satisfactory editing outcomes. Our method showcases straightforwardness, effectiveness, and high quality, achieving editing speeds in seconds even on a commercial mid-range GPU.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Video Frame Interpolation (VFI) is a technique that enhances video quality by generating new frames between adjacent video frames. However, many existing VFI methods exhibit suboptimal performance when dealing with scenes involving complex backgrounds and significant motion. In this paper, we propose an innovative VFI approach named BEOUNet to address these challenges. The model adopts a bidirectional encoding structure to comprehensively capture motion and contextual information between video frames. Furthermore, we introduce depthwise over-parameterized recurrent residual convolutional units (DO-RRCUs) and Convolutional Block Attention Module (CBAM) to optimize the U-Net architecture, thereby enhancing performance while reducing computational complexity. Finally, a multi-scale frame synthesis network is employed to generate interpolated frames. We conduct benchmark tests on various datasets, and experimental results demonstrate its excellent performance in terms of efficiency and quality.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Animation rendering is the process of converting 3D models and scenes into 2D images or a series of images, visually presenting dynamic and realistic 3D environmental animations. Traditional methods involve rendering geometric scenes frame by frame on a timeline, which can lead to waste of computational resources. A novel method is proposed based on the temporal interval inversion for the visual representation of implicit surface animation. The scene to be rendered is implicitly modeled, and then a sparse octree grid is used to spatially partition the implicit scene. Interval arithmetic is employed to recursively subdivide the implicit scene into three parts: the interior, exterior, and surface of the implicit body. The deformation amount of continuous multi-frame animations within the time interval is calculated using the time derivative. If the deformation amount exceeds a given threshold, the animation is selectively updated. Experimental results show that this method can trade temporal accuracy for the speed of rendering implicit surface animations, efficiently achieving occlusion between multiple implicit surfaces in the animation.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
To address the issue that existing double integral calculators cannot calculate surface integrals, a method of converting surface integral input modes into double integral input modes is proposed. First, nine input modes for calculating double integrals are presented. Next, a method for converting surface integral input modes into double integral input modes is designed. Finally, a method for calculating surface integrals using a double integral calculator is designed. Experimental results show that the improved double integral calculator can calculate surface integrals.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Image Processing Technology Based on Deep Learning and Multimodal Perception
Geodesic voting method is known as a powerful tool for extracting curvilinear structures, which is able to find a tree structure from a single point. However, this method may fail to generate accurate results in complex scenarios such as complex network-like structures, due to the limitation of single source point. In order to solve this problem, we propose an adaptive curvature-penalized geodesic voting method where multiple source points with geometric voting constraint can be used for constructing the voting score map. In addition, we exploit the introduced adaptive geodesic voting method for the task of retinal vessel tracking, in conjunction with a deep learning-based junction points detection procedure. Experimental results on both synthetic images and retinal images prove the efficiency of the introduced adaptive geodesic voting method.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Due to the complex nature of the underwater environment, underwater images often suffer from degradation issues such as low contrast, blurring, and color distortion. Obtaining clear underwater images is crucial for advancements in marine development. While existing convolution-based methods for underwater image enhancement have shown efficient improvement in visual quality, they still exhibit deficiencies in two key aspects: the ability to capture contextual information and the issue of information redundancy during image reconstruction. In this work, we propose MAGAN-UIE, a novel generative adversarial network for underwater image enhancement. MAGAN-UIE leverages dilated convolutions and depth-wise convolutions to enable efficient extraction of local features and contextual information. Furthermore, the model incorporates multiple attention mechanisms to mitigate information redundancy. Our extensive experiments demonstrate that the proposed method achieves significant improvements in underwater image enhancement, as evidenced by both visual inspection and quantitative evaluation metrics.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Reversible information hiding is capable of fully reconstructing the original cover image while extracting the embedded secret information losslessly, making it a frequent choice for demanding fields such as medical diagnosis, military and remote sensing image processing. We propose a novel reversible information hiding method that embeds secret data into the Side Match Vector Quantization (SMVQ) compressed index table. Firstly, the original image is compressed through vector quantization (VQ), and then the compressed image is modified using SMVQ to obtain a transformed image, which is the SMVQ compressed index table. Finally, information hiding is performed based on the histogram distribution of the index table, resulting in a larger embedding capacity and a lower bit rate. Experimental results demonstrate that this method significantly increases the embedding capacity and further enhances the compression ratio.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In order to solve the problem of poor compression performance in partial images, an image compression algorithm based on data embedding is proposed. Through operations such as image blocking, feature extraction, judgment, preprocessing, and lossless embedding, the compression performance of images can be improved. This algorithm compresses the complex image blocks and the processed image separately, with the strategy of reducing the compression ratio for local regions and increasing the compression ratio for the overall image. Then the compressed image block is embedded into the compressed overall image losslessly, and a method of lossless data embedding in arbitrary byte data is proposed in this algorithm. This method utilizes the characteristics of the data itself to set keys and carry information on the data, avoiding the mental set of traditional methods that hide information by type. Simulation results show that the proposed algorithm can improve the compression performance of complex image block by 3 dB while the performance of other image regions is unchanged, and the actual embedding rate of compressed images can meet the requirements of this algorithm. Therefore, the algorithm is feasible. The overall compression performance of images is improved in this paper, and the usability of images has been effectively enhanced.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Zero-shot image super-resolution methods have attracted considerable attention due to their high potential for applications with few additional training data. However, these methods often encounter challenges such as inconsistent results and blurred details. To mitigate these issues, we propose WGID: Wavelet-Guided Iterative Detail Enhancement Diffusion Models for single-image super-resolution. In this method, diffusion iteration is guided by wavelet transform to enhance details across different scales while maintaining the similarity between the reference and diffusion-generated images. In addition, the reference images are dynamically updated to provide a suitable guidance during the diffusion process. Meanwhile, the diffusion model progressively refines details, suppresses noise and preserves the natural appearance of the generated image. By integrating these two techniques, the proposed method produces reconstructed super-resolution image with enhanced visual quality, clear details, and more realistic results, accompanied by improved assessment metrics.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Leveraging the widespread applications of Synthetic Aperture Radar (SAR) imaging technology and Multiple Input Multiple Output (MIMO) arrays in millimeter wave radar, it is now possible to obtain high-resolution forward-looking (FL) images. Forward-Looking Multiple-Input Multiple-Output Synthetic Aperture Radar (FL-MIMO-SAR) represents the state-of-the-art technology for achieving high-resolution imaging of the area in front of a vehicle. This advanced imaging capability is crucial for numerous applications, including autonomous driving, advanced driver assistance systems (ADAS), as well as surveillance and reconnaissance. However, traditional forward-looking synthetic aperture radar is characterized by a left-right ambiguity problem, while MIMO array real aperture imaging struggles with low resolution in side-view regions. To address these challenges, this paper proposes a joint imaging method for FL-MIMOSAR. This method utilizes MIMO arrays to resolve the left-right ambiguity problem, achieving high-resolution imaging in the forward-looking region. In the squint region, higher resolution can be obtained by synthetic aperture processing because the length of the synthetic aperture is usually larger than the actual aperture. Furthermore, this paper thoroughly analyzes the amplitude and phase errors of MIMO arrays and their impact on imaging, proposing an effective calibration method to address these issues. Finally, the effectiveness of the proposed algorithm is validated through experiments with real measurement data.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Low-light image enhancement (LLIE) in Raw space has posed a challenge in the field of image processing and computational photography. Different from image processing in sRGB space, Raw images store more image information and thus have more space for image processing. Existing methods often struggle to approximate real images in terms of image detail enhancement and color restoration. Drawing inspiration from the latest advancements in denoising diffusion models, we introduce an innovative framework that leverages wavelet transform and diffusion models to tackle the low light image enhancement challenge in Raw space. In our method, the input low-light image is first decomposed into different frequency sub-bands by discrete wavelet transform (DWT) from both image and feature levels. Then a redesigned diffusion model is trained to adaptively denoise and enhance the different frequency sub-bands. In order to achieve better results in human eye perception, we also improved the loss function during training process. Finally, the enhanced result of the low-light image is finally reconstructed through inverse wavelet transform (IWT) and subsequent processing, transitioning from Raw space to sRGB space. Extensive experiments on SID and ELD datasets show that our refined method outperforms several existing methods, delivering superior performance both quantitatively and qualitatively.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Structured light imaging, particularly phase shifting methods, has become a research hotspot in the field of three-dimensional imaging due to its non-contact nature, high precision, and flexibility. Phase shifting methods require the use of gray codes to resolve phase order for 3D measurement. Given the low coding efficiency of binary images, this paper proposes a 3D imaging method that utilizes a quaternary gray code to assist three-step phase shifting, which has higher coding efficiency. This method requires fewer projection images compared to traditional binary gray code methods, making it more suitable for dynamic scene measurement. However, in actual measurement processes, noise and threshold segmentation operations can cause significant distortion at discontinuities between truncated phases, leading to errors in phase order and finally lead to the error of the measurement result. In this work, we utilize the method of quaternary complementary gray code to construct two complementary fringe order with half fringe period difference for unwrap the truncated phase. This approach is proven effective in addressing errors in the phase truncation regions.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
AI based Image Generation and Restoration Technology
Restoring the murals` various kinds of deterioration is urgently necessary, which gives the growing awareness of the need to protect cultural relics. Due to lack of original information to guide the whole inpainting process, patch-based methods, only using patch to search information from known parts of murals, were largely used in mural inpainting. Traditional patch-based method comprised by steps of patch selection and similarity computation, which often has the problem of searching mistakes such as discontinuous. To solve the problem of linear discontinuous happened through patch-based inpainting work, structure-guided patch selection and another way for similarity computation, considering more features of patches, were introduced in this paper. In the patch selection part, only major structure of image will be considered during patch priority computation. Through this step, the priority of patches in the edge of damaged part will be rethink and the overall computation time will reduce because of less unnecessary structure. In the similarity computation part, the local patch feature and global patch feature both will be considered rather than only considered local one in the traditional methods. Through this step, the patch feature will be comprehensively used in the inpainting work. This improved method was used in simulated murals and actual murals. It did better inpainting job than diffusion- based methods and other traditional patch-based methods, which not only improved linear continuous in some complex parts of mural but also reduced time-consuming.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Reversible Data Hiding (RDH) is one special type of data hiding and is widely used for many intended applications. Moreover, to protect the cover image, RDH in Encrypted Image (RDHEI) is accordingly proposed. With the reversibility and hiding data in encrypted image, RDHEI has potential utility in fields such as military and medical image transmission, and forensic analysis. Recently, Panchikkil et al. propose a RDHEI by using pixel swapping (RDHE-PSW) to swap some pixels in a block. The pixel swapping does not change pixel values, and thus Panchikkil et al.’s RDHEI-PSW can retain entropy and histogram. The smoothness of a block is calculated to extract the hidden bits and meantime recover the cover image. Panchikkil et al.’s RDHEI-PSW only hides two secret bits in every block. Recently, Kim et al. use Pixel Shifting (PSH), a circular right shift on a block, instead of pixel swapping to propose a RDHEI-PSH. In this paper, we still adopt PSH approach but use it in a different way (a modified PSH) to propose an enhanced RDHEI-PSH with a higher embedding capacity than Kim et al.’s RDHEI-PSH. Our enhanced RDHEI-PSH double the patterns on a block, on which we can embed more secret bits. Experimental results demonstrate that our enhanced RDHEI-PSH surpasses Panchikkil et al.’s RDHEI-PSW and Kim et al.’s RDHEI-PSH.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Image matting is a technique dedicated to the precise extraction of foreground opacity from a target image, widely used in image compositing and video editing. Pixel pair optimization-based methods select the best pair of foreground and background pixels for each unknown pixel based on a pixel pair evaluation function, thereby achieving opacity estimation. However, existing pixel pair evaluation function methods are inaccurate, causing the alpha values of some foreground pixels to be erroneously estimated a s 0. To address this issue, this paper proposes a Hybrid Feature Distortion pixel pair Evaluation function (HFDE). This method designs a Semantic Distance Feature vector Distortion (SDFD) term from the dimension of high-order features and combines it with existing low-order feature evaluation terms to measure the quality of pixel pairs from both dimensions. Experimental results show that, compared to existing pixel pair evaluation functions, the hybrid feature distortion pixel pair evaluation function can more accurately measure the quality of pixel pairs improving the image matting performance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Guidelines play an important role in calligraphy study since they allow beginners to get an idea of how the calligraphic strokes are formed and how to write a calligraphic stroke. However, manually generating guidelines for calligraphic characters is tedious, so only limited calligraphic books with guidelines provided are available. To automatically generate the guidelines, one needs to first decompose the strokes from the character and then automatically generate guideline for each stroke. However, this is very challenging since the existing guideline generation methods only mimic the shapes of the guidelines but generally cannot produce precise guidelines that can be practically used for amateurs to trace the calligraphic strokes. In this paper, we first adopt a state-of-the-artstroke segmentation method based on the information of stroke categories. Then we propose a novel guideline generation method by fitting ellipse-shaped brushes to each stroke based on contour tangents. Extensive visual and quantitative experiments have been conducted to validate the effectiveness of our method. Results and statistics show that our method outperforms existing methods both visually and quantitatively.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This study introduces a novel 3D shape texture generation technique to enhance 3D art creation efficiency. It proposes an iterative strategy using a pre-trained depth-to-image diffusion model, reducing computational costs and time compared to traditional fractional distillation methods. The method incorporates a three-mapping partition formula to ensure texture consistency and detail richness. Comparative tests with Text2Mesh and Latent Paint show superior quality and text fidelity, with scores of 3.89 and 4.24 respectively, against Text2Mesh's 2.58 and 3.59, and Latent Paint's 2.99 and 3.99. The runtime is significantly reduced to an average of 6 minutes, versus 34 and 48 minutes for the other methods. This advancement offers 3D artists, game developers, and modelers a powerful tool for quickly generating high-quality, text-consistent textures, promising further optimizations for even more efficient and precise results.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper, we propose an innovative framework to create the Latent Age Attribute Module (LAAM), which maps age attribute variables to a prior latent space that is easy to sample. By continuously sampling the high-dimensional age space, we accurately characterize the distribution of age attribute variables. Additionally, we introduce the Age-AdaIN Fusion Module (AFM), which fully integrates the age attributes features mapped from LAAM with the content features of the face, generating images that transition smoothly and continuously across different age. Furthermore, due to the advantage of the continuous characterization of the latent age attribute, the proposed method could better capture and generate fine-grained aging details especially for elderly individuals. Through quantitative and qualitative analyses on existing datasets, we validate the effectiveness of our proposed method in face aging and the performance enhancement especially with respect to the elderly aging facial image generation.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The Neural Radiance Field (NeRF) method has emerged as a groundbreaking technique for human reconstruction, enabling the generation of high-quality, photorealistic rendering of reconstructed objects. Despite its promising results, several challenges persist, particularly when it comes to generalizing to unknown poses during the generation of virtual human animations. This paper proposes a method for generating and optimizing novel human poses based on NeRF. Given that existing pose driven NeRF methods heavily rely on the accuracy of SMPL body parameters, we implemente human pose optimization based on motion capture data as prior knowledge, resulting in more accurate SMPL parameters, which contributes to a more precise human canonical model. Additionally, we propose a pose-driven deformation field based on linear blend skinning, combining blend weight fields with the 3D human skeleton to achieve precise mapping from observed poses to the canonical pose. By estimating the camera ray transmittance distribution during volume rendering, we propose importance sampling method avoiding sampling points in free space, which further improves rendering quality. Extensive experiments demonstrate that our method yields significant improvements in rendering quality for both training poses and novel poses, as evidenced by results on the ZJU-MoCap dataset and our custom dataset. This work not only advances the capabilities of NeRF in human reconstruction but also opens new avenues for realistic virtual human animation.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Traditional methods for generating Synthetic Aperture Radar (SAR) images often face challenges such as complex modeling, large computational demands, and long processing times. When dealing with non-cooperative targets, it is impossible to obtain accurate models and omnidirectional SAR images of the target, which greatly hinders subsequent research. This paper optimizes the loss function of the original CWGAN-GP network and proposes a feature encoding method to describe the azimuth features of targets in SAR images. Experiments generate 360 simulated SAR images with an interval of 1°, expanding the SAR image dataset. Finally, the expanded image dataset is used for training the recognition network. The improvement in recognition accuracy validates the feasibility of generating simulated SAR image data.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper presents a method to simulate the dynamic handwriting process by constructing visually smooth stroke trajectories and utilizing variated stroke width to highlight Chinese calligraphy characteristic in real time. Based on the input skeleton data, a piecewise cubic Bézier curve is firstly created on the fly to define the writing trajectory for each stroke. The continuity condition of the resulting curve is relaxed to G1 at joint points in exchange for the freedom of adjusting control points to keep the trajectory curve visually smooth. A practical and efficient stroke width model is also invented by analyzing handwriting data to enhance the Chinese calligraphy features, with writing speed and acceleration data taken into account. Finally, an image-based rendering technique is implemented to depict the resulting calligraphies in real time, which is applicable to various digital devices at a low computational cost. This approach has shown promising results in many experiments, with Chinese calligraphy characteristics demonstrated successfully.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In dynamic scenes, it is necessary to effectively identify and eliminate dynamic feature points for SLAM systems to maintain localization accuracy and accurate map construction. At present, obtaining richer semantic information for dynamic object detection based on deep learning algorithms has shown its great advantages. Traditional visual SLAM is robust in static scenes, but in challenging environments, how to improve the real-time performance, positioning accuracy and adaptability of SLAM remains to be solved. In this paper, based on the ORB-SLAM3 framework, YOLOv5 object detection and semantic segmentation module are fused to extract the key point information of obvious and potential dynamic objects. Secondly, combined with the optical flow method, a two-stage tracking strategy is proposed to track dynamic feature points and static feature points. The depth key point information and geometric constraint method are combined to predict the potential dynamic object and eliminate the dynamic feature points. The deep learning network is used to improve the image feature extraction in the dynamic environment, and the stable feature point feature extraction and dynamic feature point elimination are realized, so as to improve the robustness of the SLAM system. Finally, experimental evaluation on public datasets proves that the proposed method is superior to the ORB-SLAM3 algorithm in terms of accuracy and robustness in dynamic scenes.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The matching and fusion of visible and infrared images of power equipment are essential for real-time monitoring. Nevertheless, the computation of an accurate transform model between images for alignment remains a tough challenge due to notable changes in spectrum, resolution, and intensity between visible and infrared images. To address this issue, this paper proposes a hierarchical matching algorithm for visible and infrared images of power equipment based on multiscale local normalized filtering. First, the input image is preprocessed by multi-scale local normalized filtering to enhance the image structures. Then, the contours of the filtered image are extracted by the Canny operator. A k-cosine curvature-based multi-scale corner detection algorithm is used to detect the feature points on the contours and the scale-invariant PIIFD is constructed for each feature point. Finally, a hierarchical matching strategy is proposed for feature matching, and the consistency check algorithm is applied to eliminate the false matches. The experiment is carried out on four groups of visible and infrared images of power equipment. The results show that our method can effectively achieve a large number of correct matches despite of scale and viewpoint changes.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Keypoint matching on multi-spectral images refers to identifying and establishing associations for corresponding objects or locations from two or more images of different spectral bands. Rigorous keypoint labels of real images are difficult to obtain by human annotators because of the less strict definition of keypoint. Supervised methods typically use the detection results of classical methods or pre-trained models as keypoint labels to train the learning-based detector. However, the non-linear appearance differences in multi-spectral images lead to the fact that keypoint labels of a scene are not exactly the same, which may limit the detector’s ability to detect keypoints consistently. To relieve this issue, this paper presents a method to construct fusion label for training keypoint detector on multi-spectral images. The fusion label combines keypoint information of original labels of different spectral bands. Specifically, we aggregate all points in original labels of multi-spectral images and remove the points that only exist in labels of a few spectral bands. And the fusion label is composed of the preserved keypoints that are more universal in multi-spectral images. To train the detector, we use the same fusion label for multi-spectral images of a scene, thereby fostering the consistent keypoint detection across different spectral bands. Keypoint detection and matching experiments on multi-spectral images show that using fusion label to train the detector increases the repeatability by 6.0% on average and increases the matching score
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Monocular 3D object detection is a critical task in autonomous driving and robotic vision. Recently, center point-based monocular 3D object detection methods have achieved a balance between speed and accuracy. However, these methods face challenges in depth information acquisition and performance potential, often relying heavily on local information. To address these challenges, we propose a novel monocular 3D object detection model named MoPoSD (Monocular 3D Object Detector with Potential for Specialized Exploration in Depth). MoPoSD utilizes an attention mechanism tailored for object detection and incorporates an encoder-decoder architecture. Our model introduces an efficient encoder, TIE (Targeted Interaction Encoder), which updates features in an interlaced manner to achieve lightweight processing. Additionally, we propose a branch module for depth information estimation called Depth Attention Fusion Module). This module leverages multi-scale feature layer pyramid fusion and attention mechanism to improve depth estimation accuracy and the model's robustness when handling objects at varying distances. Furthermore, our proposed decoder also introduces an advanced group-decision mechanism to enhance the supervision signal of the training target, named MSMD (Multiple Sets of Matching Decoders). Extensive experiments demonstrate that MoPoSD achieves state-of-the-art performance on the KITTI benchmark using monocular images as input without requiring additional dense depth annotations.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Currently, the registration of non-rigid 3D objects remains a challenging task. This paper proposes an unsupervised non-rigid point cloud registration network based on local correspondence relationships, namely the LCRNet. By learning the partial matching matrix to obtain more accurate point cloud correspondences and optimizing the matrix in combination with the Sinkhorn algorithm, we successfully eliminate the noise and outliers in the initial matching. In addition, through the adaptive learning of the transformer network and using complex geometric properties, we realize the deformation of non-rigid point clouds. The experiments on multiple datasets indicate that the LCRNet achieves accurate unsupervised non-rigid object registration, outperforming comparative methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The manipulation of deformable objects for a manipulator poses significant challenges due to their high levels of uncertainty and complexity, making it difficult for traditional robotic arm control methods to achieve accurate and stable grasping. To address this issue, this paper proposes a Double Q-Map Proximal Policy Optimization (DQMPPO) method for manipulation of deformable objects. In DQMPPO, the manipulation policy is divided into two parts: a position evaluation policy and a pose grasping policy. The former is responsible for assessing the value of potential grasping locations on the deformable object to determine the optimal grasping point. The latter, on the other hand, generates appropriate robotic arm movements to achieve stable grasping and manipulation of the deformable object. The training methodology of DQMPPO is similar to that of the classical Proximal Policy Optimization (PPO). This paper builds a simulation environment based on Softgym and validates the proposed method. The experimental results show that compared with existing methods, the proposed method in this paper has higher accuracy and robustness.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Image rain removal task has been drawing considerable attention as rain streaks can severely degrade the image quality and affect the performance of outdoor vision tasks. Conventional methods based on model-driven or data-driven have achieved effective performance. However, when the image background is highly similar to the rain steaks, conventional methods cannot effectively identify the background and rain steaks, resulting in excessive removal of background or ineffective rain steaks removal. Inspired by the amazing performance of diffusion models in dealing with image inverse problems, we convert the refinement of those results into an image inpainting task. Specifically, we obtain the rain steaks and the background as two images from the rain degraded image by the conventional rain removal method. Then we process the rain steaks image by setting threshold and binarization, then mask the background with the processed image. Due to the powerful generating ability of diffusion model, our method can repair the excessive removal parts of background while remove the residual rain streaks. The experiment demonstrates that our method achieves superior results in both quantitative evaluation metrics and visual effects compared to current methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The accuracy of images is frequently compromised during the processes of pattern generation and transformation, presenting challenges for existing methodologies to achieve an optimal balance between efficiency and accuracy. To address this issue, a novel primitive fidelity method utilizing adaptive sampling is introduced. This method takes into account the constituent elements of the pattern, designating the primitive as the fundamental unit of vectorization. It initializes the path through adaptive sampling, dynamically regulates the number of segments, and employs a loss function for optimization, thereby enhancing both efficiency and accuracy. Notably, this method is model-free and does not require shape primitive labels, which allows the proposed approach to break through the limitations of specific application domains and escape the challenges of collecting and generalizing vector graphic datasets. The experimental results indicate that the proposed method demonstrates good performance in terms of generation quality and operational efficiency on both public datasets and constructed datasets.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Although Large Language Models (LLMs) have achieved significant success in various tasks, they often struggle with hallucination issues in scenarios requiring deep reasoning. Incorporating external knowledge into LLM reasoning can alleviate some of these issues. This paper proposes an efficient graph search method that integrates knowledge graphs. A knowledge graph is firstly constructed from the original document. This graph is then partitioned into multiple relevant communities using a community detection algorithm, and natural language summaries for each community are generated using an LLM. During the search phase, the LLM performs beam search on the knowledge graph, iteratively exploring multiple potential reasoning paths on the graph until it determines that it can answer the question based on the current reasoning path. Experiments demonstrate that our search method has lower computational costs and better generalizability, outperforming even large LLMs like GPT-4 in certain vertical domains.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper presents a pose estimation algorithm based on an improved YOLOv8n, aimed at addressing the performance limitations of existing methods under resource-constrained conditions. Firstly, the NAM attention mechanism module is introduced, which combines channel and spatial attention to effectively capture multimodal information across different dimensions without increasing computational load. The C2f-GC module is designed to generate multi-scale features through grouped feature processing and pointwise convolution, optimizing parameter count and improving computational speed. The SPD layer is integrated to enhance the model's ability to handle small-sized features by compressing spatial information into the depth dimension. Finally, the improved backbone network is substituted for the original network in the YOLO-6D algorithm, with corresponding function adjustments made to achieve more accurate 6D pose estimation. Ablation experiments conducted on the Pascal VOC2012 dataset demonstrate that the parameter count of the improved model is reduced by 3% compared to the original model, while mAP0.5 and mAP0.5-0.95 are improved by 1.5% and 0.7%, respectively, validating the effectiveness of the optimizations. Comparative pose estimation experiments on the LINEMOD dataset show that the proposed algorithm achieves an average accuracy of 93.54% across 13 test objects, representing a 3.17 percentage point improvement over the original algorithm, thereby significantly enhancing performance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In image classification, an image often has multiple labels, but it is expensive to obtain accurate label information. The main task of Partial Multilabel Learning (PML) is to learn in weakly supervised conditions where only a subset of the provided labels is correct and make correct predictions. Different from traditional multi-label learning, PML needs to train multi-label classification models in an imperfect environment, which belongs to weakly supervised learning. In order to reduce the false guidance of noise labels to the classifier and obtain correct prediction results, this paper proposes a novel PML method via Dual Subspace Collaboration (PML-DSC). Specifically, it first uses samples and labels to learn two common sub-spaces, then use fuzzy clustering to learn label class centers and calculate the confidence that the sample belongs to each label class center so that the first subspace can approximate the true label space to reduce the influence of noisy labels. The second subspace is then used to learn the potential semantic information, and the second subspace splits the classifier into two parts to improve the generalization ability of the model. Finally, the first subspace is used to guide the classifier learning, and the two sub-spaces work together to fully exploit the hidden information and improve the prediction accuracy of the model. Extensive experiments and analyses on both real-world and synthetic datasets demonstrate that the proposed PML-DSC is superior to the state-of-the-art methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The onset and symptoms of Major Depressive Disorder (MDD) are associated with abnormalities in macroscopic brain structure. Revealing the relationship between macroscopic brain structure and microscopic transcriptional signatures could promote the understanding of the pathogenesis of MDD. Here, case-control differences in regional Morphometric Inverse Divergence (MIND) were calculated to measure MDD-related morphometric alternations. The relationship between MIND differences and brain-wide gene expression was further examine. The expression of related genes was spatially correlated with MIND differences. Furthermore, genes correlated with MIND changes showed significant enrichments in inflammation- and metabolism-related processes. Dysregulation of astrocyte, microglia and neuronal cells may be an important factor leading to MDD-related MIND alternations. Collectively, these findings link structural abnormalities and microscopic transcriptional signatures, promoting the understanding of the pathogenesis of MDD.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Due to variations in the size, shape, and location of liver tumors among individuals, surgical plans for liver tumor resection differ significantly. Traditional simulators use relatively uniform models in terms of size, position, and shape, which cannot replicate the patient's true anatomical structure. Through 3D reconstruction technology, a customized liver simulation scenario can be quickly established for each patient using a 3D surface model reconstructed from their medical imaging data. This scenario allows for the simulation of different surgical plans. Initially, a 3D surface model is reconstructed based on the actual medical imaging data of each patient's liver. Since 3D reconstruction technology cannot reconstruct the ligaments of the liver, virtual ligaments and virtual ligament envelopes are reconstructed on the liver surface according to its anatomical structure. Next, these components are assembled in a volume model generator, where boundary conditions are set on the virtual ligament envelopes to generate the volume model scenario. Finally, a simulator is used to perform surgical simulations on the patient's customized volume model scenario. Simulation results indicate that the customized virtual liver scenario can produce realistic deformation, collision, and cutting effects. The simulation frame rate meets real-time requirements, allowing surgical simulations based on the patient's pathological characteristics, thereby greatly simulating real surgery.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In our pursuit of sustainable innovation for Intangible Cultural Heritage (ICH), we have adopted a methodology driven by the co-creation of human and artificial intelligence (AI). Our inclusive community, consisting of diverse stakeholders such as folk inheritors, professional choreographers, cultural center staff, square dance enthusiasts, designers, and AI engineers, forms a robust foundation for innovative practices. The community’s outputs, including dataset, experimental dance, creative short film, digital 3D works, Non-Fungible Tokens (NFTs), and an App, are multi-modal transformable. We underscore the communal utilization of resources across diverse practices and advocate for the transformation of outputs across various modalities. Importantly, each of these practices integrates AI technology into the workflow, positioning it as a pivotal enabler for fostering sustainable innovation within the domain of ICH.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
RGBT tracking aims to take full advantage of the complementary advantages of RGB and Thermal Infrared (TIR) modalities to achieve robust tracking in complex scenes. However, current approaches face limitations when dealing with the quality-imbalanced problem. In this paper, we introduce a novel augmentative fusion learning framework that aims to maximize the potential of existing fusion modules in modality quality imbalanced scenarios. In particular, we design a modality stochastic degradation strategy to improve the robustness of the fusion module in modality quality imbalance scenarios. Meanwhile, to further enhance the fusion performance with the modality quality imbalance inputs, a self-supervised constraint is introduced to reconstruct the modality features before degradation by combining high- quality modality and degraded modality information. Finally, the effectiveness of the proposed method is verified by evaluating it on two standard RGBT datasets and two state-of-the-art algorithms. And the results indicate that our method achieved superior performance without adding any parameters or computational complexity.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
RGBT tracking is an important research direction in the field of computer vision and has received increasing attention. It aims to exploit the complementary advantages between RGB and TIR modalities to achieve robust object tracking. However, existing studies usually focus on the fusion and interaction between modalities, ignoring the importance of learning modal representations. To address this issue, we propose a novel Dual-granularity Knowledge Distillation RGBT tracker named DKDTrack. In particular, the method introduces an adaptive distillation strategy to achieve representation enhancement between modalities by online measurement of modality strength and weakness relationships. In addition, we design a dual-granularity distillation module for jointly guiding the learning of weaker modalities at the feature level and the attention level. Extensive experiments on three publicly available RGBT datasets demonstrate the effectiveness of DKDTrack and highlight the importance of modal representation learning.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In recent years, entry-exit management technologies have gained attention due to the rise of smart cities and the proliferation of IoT devices. These technologies are used not only in indoor environments like offices and retail stores but also in outdoor settings such as public transportation. Particularly in public transport, there is growing demand for Origin Destination (OD) data collection to better understand passenger movement. In this study, we developed a system to estimate the number of passengers boarding and alighting from buses by installing cameras at the bus doors and capturing video footage of passengers. The system utilizes person detection and tracking algorithms to estimate passenger trajectories and applies a deep neural network (DNN) model to classify whether a detected individual is boarding or alighting. We introduced a DNN model for binary classification to determine whether individuals are inside or outside the bus, using bounding box and time information. By performing a grid search, we identified optimal model parameters. The system was tested using both YOLOR and the latest YOLOv10 object detectors, and the MPNTrack person tracking algorithm. The combination of YOLOv10 and the DNN model achieved an error rate of 11.7% for the entrance camera, while YOLOR and the DNN model achieved an error rate of 0.803% for the exit camera, significantly improving the accuracy over previous methods. This study demonstrates that integrating object detection, tracking, and DNN-based classification improves the accuracy of passenger counting systems, which is crucial for the effective management of public transportation systems.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
As a potent greenhouse gas, methane (CH4) leakage not only intensifies global warming, but also brings huge security risks. In recent years, with the rapid development of remote sensing technology, methane leakage detection using hyperspectral images has become a research hotspot. Among them, the airborne Visible Infrared Imaging Spectrometer (AVIRIS-NG) shows great potential in methane leakage detection with its high spectral resolution and wide spectral coverage. However, due to the complexity of land type conditions and the interference of confounding factors similar to the spectral characteristics of methane, the existing analytical methods are often difficult to obtain ideal detection results. Therefore, based on the DETR model, a novel end-to-end spectral absorption wavelength sensing transformer target detection network is proposed to improve the accuracy and efficiency of methane gas leakage detection. The reference spectrum feature generator and query refiner modules together improve the design of traditional transformer. Based on the traditional linear filter, a new spectral linear filter is proposed, which can better whiten the background distribution and amplify the methane signal by strategically selecting the relevant pixels in the spectral domain. The method presented in this paper has certain significance for improving the detection of methane gas leakage.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Detecting predefined key points on the satellite body in monocular images is a key method for performing satellite pose estimation. However, existing methods always design specialized predefined key points for a single satellite, which results in the low universality of such visual-based methods. Moreover, under remote observation conditions, the information on the surface of the satellite is not clear, and this complex presetting of key points is not easily detected. In this study, a neural network capable of predicting a 3D bounding box with relative depth information detection is proposed as a means of adapting to different satellite targets using a more general and geometrically simple definition of key points. Furthermore, the network is able to estimate satellite rotation accurately based on the aforementioned prediction results. The generalization ability of the proposed CNN-based model was enhanced by conducting network training and testing using a custom-built synthetic image dataset composed of a collection of 15 distinct satellite models captured using a virtual camera. The testing results demonstrate that the proposed pose estimation approach obtains accurate performance for non-cooperative targets with distinct shapes in the absence of predefined key point designs. Moreover, the observed accuracy of pose estimation is consistent with that of existing advanced models trained with only a single satellite target.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Text-to-Image methods have shown impressive results, allowing casual users to generate high quality images by providing textual descriptions. Main text-to-image methods base on Generative Adversarial Network and diffusion models, show better results in terms of fidelity and novelty, but lack customization and multi-attribute fusion. This paper presents a text-to-image customization method that generate customized images fusing multiple attributes based on textual descriptions. Users guide image-generation with natural language without changing the properties of the attribute. New images change the appearance, pose, and context drastically. In this paper, we use the cross-attention mechanism to effectively capture the hierarchical relationships between text and images, guide image-generation through diffusion model in the process of multi-attribute fusion and present the effect of customized images with multi-attribute fusion. Experiment shows that, generated images exhibit good visual fidelity, novelty, and text alignment, and meet the customized needs of users.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Density-based clustering algorithms, such density peaks clustering (DPC), have the ability to identify clusters of any shape, automatically detect and exclude abnormal points, and accurately determine the number of clusters. Nevertheless, the sample distribution process is susceptible to incidental errors, and the density peaks clustering approach is ineffective at grouping data with fluctuating densities. This research presents the density peaks clustering method, which combines the inverse neighbors and k-nearest neighbors’ ideas. The algorithm devises a cluster weight formula to determine the optimal weights for the samples in order to complete the final clustering. It categorizes the samples into non-boundary and boundary points by analyzing the characteristics of the inverse nearest neighbor. Additionally, it incorporates the concepts of k-nearest neighbor and inverse nearest neighbor to calculate the local density of the samples and identify the highest density point. Ultimately, the method is assessed by comparing it to other standard methods using both synthetic datasets with complex structures and real datasets. The results showcased the efficacy of our approach in effectively mitigating the "domino effect" and accurately selecting sample density maxima in sparsely populated regions.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.