In Hyperspectral Image (HSI) classification, each pixel sample is assigned to a land cover category. Recently, HSI classification methods based on Convolutional Neural Networks (CNN) have significantly improved performance due to their superior feature representation capabilities. However, these methods exhibit a limited ability to capture deep semantic features, and computational costs increase significantly as the number of layers grows. The Vision Transformer (ViT), leveraging its self-attention mechanism, offers promising classification performance compared to CNNs, and Transformer-based methods exhibit robust global spectral feature modeling capabilities. However, these methods struggle to effectively extract local spatial features. In this paper, we propose a novel Transformer-based method with efficient self-attention for HSI classification, capable of fully aggregating both local and global spatial-spectral features. The proposed method employs spectral and spatial convolution operations, integrated with attention mechanisms, to enhance structural and shape information. Initially, 3-D convolution with adaptive pooling and 3-D convolution with residual connections are employed to extract fused spatial-spectral features. Subsequently, an interactive self-attention module is applied across the height, width, and spectral dimensions, achieving a deep fusion of spatial-spectral features. Experimental analyses on three standard datasets confirm that the proposed method Hybrid Spectral-Spatial ResNet and Transformer (HSSRT) outperforms existing classification techniques, delivering state-of-the-art classification performance.
|