The existing methods for detecting abnormal human behavior suffer from the large size of parameters, a lack of capacity to extract spatial-temporal features effectively and also exhibit imbalances between positive and negative samples, as well as between difficult and easy samples. To cope with these problems, this paper improves SlowFast by taking in attention mechanism and changing loss function. Firstly, use grayscale video frame clips as input data on the fast path to reduce GFLOPs effectively. Another improvement involves swapping out the original Non-local modules with ANN modules, enhancing the capability to capture spatial-temporal features while also decreasing the parameter count. Then, use Focal Loss to classify the fused feature map, addressing the issue of imbalance between positive and negative samples, as well as the challenge of classifying difficult and easy samples. The effectiveness and superiority of this method were ultimately verified through the AVA dataset and actual scene videos.
Pedestrian target detection based on fisheye scene is a challenging task. Due to natural cylindrical and radial distortion of fisheye images, it is difficult to ensure the accuracy and real-time of detection by conventional methods. The original feature pyramid network in YOLOv5 is replaced with an improved BiFPN in this paper, and the problem of information loss during feature graph construction is reduced by adding residual links and module stacking. Secondly, by replacing the deformable convolution module, the network's feature extraction ability is significantly improved. A vast number of experimental findings on the WEPDTOF dataset validate the method's effectiveness and superiority when compared to various advanced approaches.
As the remote sensing image information rapidly becomes abundant, it is a challenge for the detection of tiny targets with dense distribution. Therefore, a multi-scale rotating object detection model based on the improved YOLOv5 is proposed in this paper. Firstly, because of adding a prediction feature layer to the network, the detection precision of tiny targets has substantially increased. Secondly, the loss between rotation anchors can be fitted at a high precision due to the compressed loss function aiming at the calculation of the IoU loss of rotation is proposed. Finally, a hybrid prior bounding box strategy is applied in the feature prediction layer to suit the targets to be detected in different sizes. Experiments conducted on the DOTA dataset indicates that this method significantly exceeded the original YOLOv5. It has extraordinary performances for the task of object detection in optical remote sensing image fields.
Road extraction from remote sensing image is a fundamental task. Although, the methods based on CNNs have achieved great progress. It is difficult for network-based on CNNs to achieve a breakthrough in performance due to the limitation of receptive field. However, Transformer has better capabilities to build the global receptive field than CNNs. This paper proposes a novel network called TransLinkNet which combines CNNs and Transformer to obtain robust feature representations. Specifically, a stack of Transformer blocks is interspersed between LinkNet layers. Convolution operations are good at obtaining local features, while the attention mechanism in Transformer to build the global receptive field. Experiments have proved that the model achieves competitive performance on The Massachusetts dataset.
Object detection is a research hot issue. With the improvement of remote sensing technology, the task of remote sensing target detection has become more important. Features of remote sensing aircraft images: small and dense targets to be measured. Common small object detection methods are implemented using CNN. In the small object detection task, since its convolution structure, information is lost. The Transformer's non-convolution structure can increase the receptive field while making full use of the image feature information, but because its attention calculation is between the tokens, the internal information of the token cannot be calculated. Therefore, we improve the encoder part and construct a scale fusion module to solve this problem. Experiments prove that the network proposed in the paper has good performance on the DIOR dataset.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.