Open Access
26 November 2022 Tackling the over-smoothing problem of CNN-based hyperspectral image classification
Shi He, Huazhu Xue, Jiehai Cheng, Lei Wang, Yaping Wang, Yongjuan Zhang
Author Affiliations +
Abstract

Convolutional neural networks (CNNs) are very important deep neural networks for analyzing visual imagery. However, most CNN-based methods have the problem of over-smoothing at boundaries, which is unfavorable for hyperspectral image classification. To address this problem, a spectral-spatial multiscale residual network (SSMRN) by fusing two separate deep spectral features and deep spatial features is proposed to significantly reduce over-smoothing and effectively learn the features of objects. In the implementation of the SSMRN, a multiscale residual convolutional neural network is proposed as a spatial feature extractor and a band grouping-based bi-directional gated recurrent unit is utilized as a spectral feature extractor. Considering that the importance of spectral and spatial features may vary depending on the spatial resolution of images, we combine both features with two weighting factors with different initial values that can be adaptively adjusted during the network training. To evaluate the effectiveness of the SSMRN, extensive experiments are conducted on public benchmark data sets. The proposed method can retain the detailed boundary of different objects and yield competitive results compared with several state-of-the-art methods.

1.

Introduction

With the rapid development of remote sensing imaging spectroscopy technology, hyperspectral images (HSIs) have become increasingly important in Earth observation due to their rich spectral and spatial information. Classification is an important technique for HSI data exploitation. HSI classification (HSIC) is the task of identifying the category for each pixel with a proper land-cover label,1 which is more challenging because of the large dimensionality, spectral heterogeneity, and complex spatial distribution of the objects.2

To alleviate these problems, traditional HSIC methods involve two steps: (1) feature selection and extraction.3 This step relies on utilizing feature engineering skills and domain expertise to design several human-engineered features. (2) Classifier training. A classifier in machine learning is an algorithm that automatically orders or categorizes data into one or more of a set of classes. However, the traditional HSIC approaches use handcrafted features to train the classifier. These features may be insubstantial in the case of real data. Therefore, it is difficult to fine-tune between robustness and discriminability as a set of optimal features considerably vary between different data.4

Deep neural networks (DNNs) can automatically learn the features from data in a hierarchical manner to construct a model with growing semantic layers until a suitable representation is achieved.5 To overcome the issue of high intraclass variability and high interclass similarity in HSI, stacked autoencoders68 and deep belief networks9,10 are introduced as accurate unsupervised methods to extract layerwise trained deep features. However, their standard fully connected (FC) architecture imposes a feature flattening process before the classification, leading to the loss of spatial-contextual information.11 On the contrary, convolutional neural networks (CNNs) can automatically extract spectral-spatial features from the raw input data. Recurrent neural networks (RNNs) process the spectral information of HSI data as a time sequence considering the spectral bands as time steps. There are three basic models of RNN: (1) Vanilla, (2) long–short-term memory (LSTM), and (3) gated recurrent unit (GRU). Therefore, a large number of CNN or RNN-based methods are proposed for end-to-end modeling and can handle HSI data in spectral and spatial domains individually, and also in a coupled fashion.12

For instance, Yang et al.13 designed a CNN model with two-branch architecture to learn the spectral features and spatial features jointly. Zhong et al.14 raised an end-to-end three-dimensional (3D) residual CNN architecture for spectral-spatial feature learning and classification. Motivated by the attention mechanism of the human visual system, a residual spectral-spatial attention network (RSSAN)15 was proposed for HSI classification. To reduce computations, fully convolutional networks were proposed for HSIC.16 For correctly discovering the contextual relations among pixels, the graph convolutional network was adopted for dealing with the HSIC, which was originally designed for arbitrarily structured non-Euclidean data.17 The morphological operations, i.e., erosion and dilation, are powerful nonlinear feature transformations. Inspired by these, an end-to-end morphological CNN (MorphCNN)11 was introduced for HSIC by concatenating the outputs from spectral and spatial morphological blocks extracted in a dual-path fashion. To represent high-level semantic features well, a spectral-spatial feature tokenization transformer (SSFTT) method18 was proposed to capture spectral-spatial features and high-level semantic features. Keeping in view the sequential property of HSI to determine the class labels, an RNN-based HSIC framework with a novel activation function (parametric rectified tanh) and GRU was proposed.19 The work20 proposed a spectral-spatial LSTM-based network that learns spectral and spatial features of HSI by utilizing two separate LSTM-followed Softmax layers for classification, while a decision fusion strategy is implemented to get joint spectral-spatial classification results. In the literature, several works have proposed a CNN joint RNN architecture for HSIC. Spatial-spectral unified network (SSUN) combined a spectral dimensional band grouping-based LSTM model with 2D CNN for spatial features and integrated the spectral feature extraction (FE), spatial FE, and classifier training into a unified neural network.2 In a spectral-spatial attention network (SSAN),21 RNN with attention can learn inner spectral correlations within a continuous spectrum, while CNN with attention is designed to focus on saliency features and spatial relevance between neighboring pixels in the spatial dimension. The work22 integrated CNN with bidirectional convolutional LSTM (CLSTM) in which a 3D CNN model is used to capture low-level spectral-spatial features and CLSTM recurrently analyzes this low-level spectral-spatial information.

CNN is commonly applied to analyze visual imagery.23 Most of the above methods are based on the CNN backbone and its variants. However, most CNN-based methods have the problem of over-smoothing at boundaries, which is unfavorable for HSIC. DNNs usually yield overfitting methods24 and are sensitive to perturbations.25 A large number of training samples are usually required for deep learning methods.26,27 To significantly reduce the over-smoothing effect and effectively learn the features of objects, a multi-task learning spectral-spatial multiscale residual network (SSMRN) is proposed for the end-to-end HSIC. The contributions can be summarized as follows:

  • 1. An end-to-end SSMRN is designed by fusing two separate deep spectral features and deep spatial features to extract spectral-spatial features for HSIC. The model yields competitive results under different training sample conditions.

  • 2. The proposed framework takes the weight between spectral and spatial features into consideration, which increases the influence of the current pixel and reduces over-smoothing. Meanwhile, the multi-task learning technology is integrated into the framework, improving the stability of results.

The rest of the sections are organized as follows. First, Sec. 2 introduces the preliminary knowledge of CNN, residual networks, and RNN. The proposed architecture along with the design methodology is introduced in Sec. 3. Next, experimental data sets and results are given in Sec. 4. Then, the impact of the SSMRN architecture on classification results is analyzed in Sec. 5. Finally, Sec. 6 concludes the paper with a summary of the proposed method and the scope of future work.

2.

Preliminary

In this section, we mainly recall the background information on CNN, residual networks, and RNN.

2.1.

Convolutional Neural Network

A CNN28 is a class of DNNs, most commonly applied to analyzing visual imagery. Three main types of layers are used to build CNNs architectures: convolutional layer, pooling layer, and FC layer. Compared with multilayer perceptron neural networks, CNNs are easier to train because of the parameter sharing scheme and local connectivity.

While CNN-based methods have achieved large improvement in HSIC, they usually suffer from severe over-smoothing problems at edge boundaries. There are two major reasons: (1) the scales between supervised information and spatial features do not match. The supervised information of HSIC is pixel-level, while the spatial features are extracted from the neighbourhood of the current pixel. (2) The parameter sharing scheme makes the spatial features extracted for the patch instead of the current pixel. Two major reasons lead to an insufficient influence of the current pixel in classification. Attentional mechanisms can counteract the effects of parameter sharing,15,21 but increase the amount of computation. A smaller size patch will also decrease the possibility of the over-smoothing phenomenon2 but result in insufficient extraction of spatial information and lower classification accuracy (CA).29 Another approach is to utilize superpixel segmentation,17 but the segmentation algorithm affects the classification results.

2.2.

Residual Networks

A residual network is an effective extension to CNNs that has empirically shown to increase performance in ImageNet classification. A residual network does this by utilizing a skip connection to jump over some layers. As shown in Fig. 1, the typical residual block is implemented with double-layer skips that contain nonlinearities. The skip connections between layers add the outputs from previous layers to the outputs of stacked layers. One motivation for skipping over layers is to avoid the problem of vanishing gradients, by reusing activations from a previous layer until the adjacent layer learns its weights.30 Skipping effectively simplifies the network, using fewer layers in the initial training stages. The residual block is easy to understand and optimize and can be stacked to any depth and embedded in any existing CNN.

Fig. 1

Architecture of a typical residual block.

JARS_16_4_048506_f001.png

2.3.

Recurrent Neural Network

RNNs allow us to operate over sequences of input, output, or both at the same time. RNN makes them applicable to challenging tasks involving sequential data such as speech recognition and language modeling. LSTM and GRU31 are introduced to learn long-term dependencies and alleviate the vanishing/exploding gradient problem. These two architectures do not have any fundamental differences from each other, but they use different functions to compute the hidden state. LSTM is strictly stronger than the GRU as it can easily perform unbounded counting. The GRU has fewer parameters than LSTM, and GRU has been shown to exhibit better performance on certain smaller and less frequent data sets. Bi-directional RNNs (Bi-RNNs) utilize a finite sequence to predict or label each element of the sequence based on the element's past and future contexts, as shown in Fig. 2. Bi-RNN concatenating the outputs of two RNNs allows them to receive information from the sequence from left to right, the other one from right to left.

Fig. 2

Architecture of Bi-RNN.

JARS_16_4_048506_f002.png

Hyperspectral data usually have hundreds of bands. So, pixel classification in HSI can be treated as a many-to-one task where we are given a sequence of bands of a pixel and then classify what classification that pixel is. A natural idea is to consider each band as a time step. The large length of RNNs input sequence can lead to an overfitting issue, which consumes high computing and storage resources. In addition, a large number of spectral channels and limited training samples restrict the performance of HSIC.26

3.

Proposed Framework

The deep networks used for HSIC are divided into spectral-feature networks, spatial-feature networks, and spectral-spatial-feature networks. To effectively learn the features of objects, we utilize the spectral-spatial-feature networks to extract joint deep spectral-spatial features for HSIC. The joint deep spectral-spatial features are mainly obtained by the following three ways:32 (1) mapping the low-level spectral-spatial features to high-level spectral-spatial features via deep networks; (2) directly extracting deep features from original data or several principal components of the original data; and (3) fusing two separate deep spectral features and deep spatial features. Considering that the importance of spectral and spatial features may vary depending on the spatial resolution of images, we adopt the way of fusing two separate deep features to conveniently adjust the influence of different features on the classification results.

Three sections are playing crucial roles in our methodology: a multiscale residual CNN (MRCNN)-based spatial feature learner, a bi-directional GRU (bi-GRU)-based spectral feature learner, and a multi-task learning model that combines both features with two weighting factors.

3.1.

Multiscale Residual CNN for Spatial Classification

The proposed MRCNN architecture is shown in Fig. 3. Let XRr×c×b be the original HSI data, where r,   c and b are the row number, column number, and band number, respectively. First, to suppress noise and reduce the computational costs, the principal component analysis is applied to the original HSI data, and only the first p principle components are reserved. Denote the dimension-reduced data by XpRr×c×p. Around each pixel, a neighbor region is extracted with the size of k×k×p as the input of the spatial branch.

Fig. 3

Architecture of the proposed MRCNN.

JARS_16_4_048506_f003.png

Considering the complex environment of the HSI, where different objects tend to have different scales, we propose to extract both shallow and deep features by applying a convolution layer with rectified linear unit (ReLU) activation and two residual blocks in the classification. The local max pooling layer is adopted in residual blocks. We add a flatten layer and an FC layer with the same number of neurons after each scale output. Then, these FC layers are merged into a new FC layer. Let h(j)=f(W(j)x(j)+b(j)), j=1,2,3 denotes the j’th FC layer, where x(j) is the flattened features in the jth flatten layer, W(j) and b(j) are the corresponding weight matrix and bias term, respectively. The fourth FC layer h(4) can be calculated as h(4)=concat[h(1),h(2),h(3)]. In this way, features in different layers are taken into consideration during the classification stage, and the network will possess the property of multiscale.

The loss function for cross entropy of MRCNN can be expressed as

Eq. (1)

L=1Mm=1Mn=1Nymnlog(y^mn),
where ymn and y^mn denote the truth and predicted labels, respectively. M is the number of training samples and N is the number of classes.

3.2.

Bi-GRU for Spectral Classification

GRU has fewer parameters than LSTM for modeling various sequential problems, and Bi-GRU allows the sequential vector to be fed into the architecture one by one to learn continuous features with forward and backward directions. So, we utilize Bi-GRU for spectral classification.

The complete spectral classification framework is shown in Fig. 4. To reduce computation, a suitable grouping strategy2 is used in this paper. For each pixel x in the HSI, let x=(λ1,λ2,,λj,λb)T be the spectral vector, where λj is the reflectance of the j’th band and b is the number of bands. Let r(b) be the number of time steps (e.g., number of groups). The transformed sequences can be denoted by (c1,c2,,ct,cr), where ct is the sequence at the tth time step. Specifically, the grouping strategy is

Eq. (2)

c1=(λ1,λ1+r,,λ1+(m1)r)Tc2=(λ2,λ2+r,,λ2+(m1)r)Tct=(λt,λt+r,,λt+(m1)r)Tcr=(λr,λr+r,,λr+(m1)r)T,
where m=floor(b/r) is the sequence length of each time step and floor(·) function rounds numbers down. After grouping, spectral vector x is transformed into sequences (c1,c2,,ct,cr).

Fig. 4

Band grouping-based Bi-GRU model for spectral classification.

JARS_16_4_048506_f004.png

The input to our model is the sequences (c1,c2,,ct,cr), and the bi-directional hidden vector is calculated as

Forward hidden state:

Eq. (3)

ht(1)=tanh(W(1)·ct+U(1)·ht1(1)+b(1)).

Backward hidden state:

Eq. (4)

ht(2)=tanh(W(2)·ct+U(2)·ht+1(2)+b(2)),
where the coefficient matrices W(1) and W(2) are from the input at the present step, U(1) is from the hidden state ht1(1) at the previous step, U(2) is from ht+1(2) at the succeeding step, tanh is the hyperbolic tangent, and the memory of the input as the output of this encoder is gt:

Eq. (5)

gt=concat(ht(1),ht(2)),
where concat(·) is a function of concatenation between the forward hidden state and backward hidden state.

The grouping strategy uses the original HSI spectral vector as the feature of the new sequence and the RNN uses the parameter sharing scheme, so a one-dimensional convolutional residual block is added to reassign the weight of the feature based on the channel attention mechanism. So, we can compute the predicted label yi of pixel xi as follows:

Eq. (6)

yi=V(F1d(g1,,gt,,gr)+(g1,,gt,,gr)),
where F1d(·)is one-dimensional convolutional layer with stride one and V(·) indicates a series of operations as shown in Fig. 4, including a ReLU activation, a flatten function, an FC layer and a Softmax activation function.

3.3.

SSMRN

The proposed SSMRN framework is shown in Fig. 5, which starts with two branches, learning the spatial and spectral features, respectively. Then, concatenate these two branches into a layer. λspatial and λspectral are the corresponding weighting factors.

Fig. 5

Architecture of the proposed SSMRN.

JARS_16_4_048506_f005.png

To better train the whole network, two auxiliary tasks are added to the framework.2 So, the proposed SSMRN is a triple-task framework, including one main task (classification based on spectral-spatial information) and two auxiliary tasks (classification based on spectral information and classification based on spatial information). The complete loss function for cross entropy of the SSMRN is defined as

Eq. (7)

L=Ljoint+Lspectral+Lspatial=1Mm=1Mn=1Nymnlog(y^mnjoint)1Mm=1Mn=1Nymnlog(y^mnspectral)1Mm=1Mn=1Nymnlog(y^mnspatial),
where Ljoint is the main loss function, Lspectral and Lspatial are two auxiliary loss functions,y^mnjoint, y^mnspectral, and y^mnspatial are the corresponding predicted labels, ymn is the true label. M is the number of training samples and N is the number of classes. The whole network is trained in an end-to-end manner, where all the parameters are optimized by the batch stochastic gradient descent algorithm at the same time. In this way, the complete loss function will balance the convergences of both the whole network and the subnetworks.

4.

Experiment

In this section, we introduce three public data sets used in our experiment and the configuration of the proposed SSMRN. In addition, classification performance based on the proposed method and other comparative methods is presented.

4.1.

Experimental Data

Three publicly available hyperspectral data sets are utilized to evaluate the performance of the proposed method, i.e., Indian Pines (IP) from the airborne visible/infrared imaging spectrometer (AVIRIS) sensor, Pavia University (PU) from the reflective optics systems imaging spectrometer (ROSIS) sensor, and Salinas (SA) from the AVIRIS sensor. The data set details are shown in the following Table 1.

Table 1

Summary of the HSI datasets used for experimental evaluation.

Data setSourceNumber of pixelsNumber of spectral reflectance bandsWavelength range (10−6  m)Spatial resolution (m/pixel)Number of classes
IPAVIRIS145×1452000.4–2.52016
PUROSIS613×3401030.43–0.861.39
SAAVIRIS512×2172040.4–2.53.716

4.2.

Experimental Setting

4.2.1.

Evaluation indicators

To quantitatively analyze the effectiveness of the proposed method and other methods for comparison, three quantitative evaluation indexes are introduced, including class-specified CA, overall classification accuracy (OA), and Kappa coefficient (Kappa). The larger value of each indicator represents a better classification effect.

4.2.2.

Configuration

All the experiments are implemented with an Intel(R) Xeon(R) Sliver 4210 CPU @ 2.20-GHz with 64 GB of RAM and an NVIDIA RTX2080 graphic card, TensorFlow 2.3.1, and Keras 2.4.3 with python 3.7.6. We use the Adam optimizer to train the networks with a learning rate of 0.001. The gradient of each weight is individually clipped so that its norm is no higher than 1. The training epochs are set as 1500 with batch size 1048.

4.2.3.

Parameter setting

All the experiments in this paper are randomly repeated 30 times. In each repetition, we first randomly generate the training set from the whole data set with the same number of the labelled class. Then, the remaining samples make up the test set. Details are given in Tables 2Table 34.

Table 2

Number of training and test samples used in the IP data set.

Class nameTrainingTest
 1Alfalfa3016
 2Corn-notill301398
 3Corn-mintill30800
 4Corn30207
 5Grass-pasture30453
 6Grass-trees30700
 7Grass-pasture-mowed1513
 8Hay-windrowed30448
 9Oats155
10Soybean-notill30947
11Soybean-mintill302425
12Soybean-clean30563
13Wheat30175
14Woods301235
15Buildings-grass-trees-drives30356
16Stone-steel-towers3063

Table 3

Number of training and test samples used in the PU data set.

Class nameTrainingTest
1Asphalt2006431
2Meadows20018,449
3Gravel2001899
4Trees2002864
5Painted metal sheets2001145
6Bare soil2004829
7Bitumen2001130
8Self-blocking bricks2003482
9Shadows200747

Table 4

Number of training and test samples used in the SA data set.

Class nameTrainingTest
 1Brocoli_green_weeds_12001809
 2Brocoli_green_weeds_22003526
 3Fallow2001776
 4Fallow_rough_plow2001194
 5Fallow_smooth2002478
 6Stubble2003759
 7Celery2003379
 8Grapes_untrained20011,071
 9Soil_vinyard_develop2006003
10Corn_senesced_green_weeds2003078
11Lettuce_romaine_4wk200868
12Lettuce_romaine_5wk2001727
13Lettuce_romaine_6wk200716
14Lettuce_romaine_7wk200870
15Vinyard_untrained2007068
16Vinyard_vertical_trellis2001607

For the proposed MRCNN, the input is a 24×24×4 patch, where 4 is the number of reserved principal components. All convolutional layers have 64 filters. The kernel size of the first left convolutional layer is 1×1, and the other kernel sizes are 3×3. The size of the max pooling layers is 2×2. The three FC layers after each scale output each own 64 units. For the proposed Bi-GRU, let 3 be the number of time steps. The hidden size in GRU is 64, so one-dimensional convolutional layers have 128 filters because of Bi-GRU. For the proposed SSMRN, the input is as same as the Bi-GRU and MRCNN. The number of neurons of the FC layer in the spectral branch and spatial branch is 192, so the number of neurons in the joint FC layer is 384.

In our study, we adopt the way of fusing two separate deep spectral features and deep spatial features. Since the importance of spectral and spatial features may vary depending on different spatial resolutions, we consider the weight of these two parts and we need to specify the initial value of these hyperparameters. The principle is that the higher the spatial resolution and the smaller the influence of the mixed pixel effect, the greater the initial spectral weight should be. Suppose the sum of the two weights is 1 and the weights for both parts are close to each other. Owing to the proposed strategy, the weights for the spectral and spatial parts can be adjusted adaptively. The initial value of weighting factors λspatial and λspectral are given in Table 5.

Table 5

Initial value of weighting factors.

Data setSpatial resolution (m/pixel)λspatialλspectral
1IP200.50.5
2PU1.30.40.6
3SA3.70.40.6

4.2.4.

Ablation study

In this section, we compare the SSMRN with the SSMRN without auxiliary tasks. As shown in Table 6, the SSMRN surpasses the SSMRN without auxiliary tasks, especially for small samples of the IP data set. These results demonstrate that multi-task learning can select the useful HSI data for feature learning.

Table 6

OA (%) of SSMRN with two modules.

ModuleIPPUSA
SSMRN94.87±1.2199.90±0.0799.76±0.14
SSMRN without auxiliary tasks92.57±1.2699.77±0.1799.71±0.24

4.3.

Classification Results

To demonstrate the superiority and effectiveness of the proposed SSMRN model, it is compared with the proposed Bi-GRU, MRCNN, and advanced spectral-spatial DNNs methods, such as SSUN,2 SSAN,21 RSSAN,15 MorphCNN,11 and SSFTT.18 Bi-GRU is the spectral FE branch of SSMRN. MRCNN is the spatial FE branch of SSMRN. SSUN, SSAN, RSSAN, MorphCNN, SSFTT, and SSMRN are all based on the CNN backbone and its variants, integrating spatial features, and spectral features. RSSAN and SSFTT directly extract the joint deep spectral-spatial features via CNNs. SSUN, SSAN, MorphCNN, SSFTT, and SSMRN obtain deep spectral features and deep spatial features via two deep networks. And the two kinds of features are fused to generate the joint deep spectral-spatial features. The difference is that SSMRN considers the weight relationship between the spectral and spatial branches depending on the spatial resolution of images, and embeds multi-task learning technology at the same time.

For SSUN, SSAN, and MorphCNN, the input is a 24×24×4 patch, where 4 is the number of reserved principal components. Limited by our computer configuration, we cannot run RSSAN properly with the original input size in the corresponding reference, so the input of RSSAN is a 24×24×8 patch, where 8 is the number of reserved principal components instead of the number of spectral bands. According to the reference, the input of SSFTT is a 13×13×30 patch. For the SSUN, SSAN, RSSAN, MorphCNN, and SSFTT, all network settings are as described in their corresponding references. For a fair comparison, the training sample sets and test sample sets of all methods are randomly selected, as shown in Tables 2Table 34.

Quantitative evaluation: Tables 7Table 89 report the CA, OA, and Kappa using all the mentioned methods for the IP, PU, and SA datasets, respectively. All algorithms are executed 30 times. The average results with the standard deviation obtained are reported to reduce random selection effects. The optimal results are denoted in bold. The evaluation data clearly show that the proposed SSMRN method performs the best. The SSMRN obtains the highest OA and Kappa. SSMRN also generates most of the highest class-specific accuracy, where the results of a few classes have slightly lower precisions than MRCNN, SSUN, and SSFTT. Particularly in the IP datasets, the results of SSMRN are higher than other methods, which shows that SSMRN can effectively learn the features of objects, especially under the condition of a small number of samples. The CA, OA, and Kappa of Bi-GRU are lower than other methods, specifically in Table 5. Because Bi-GRU only uses the spectral feature, and the IP datasets have lower spatial resolution and the bigger influence of the mixed pixel effect. MRCNN's results are second only to SSMRN, which shows that good results can be obtained using spatial features and proper deep network structure. Especially in the SA data set, the results of the MRCNN and SSMRN models are almost identical. The likely reason is that the ground objects of interest in the image are homogeneous, regular, and have a large area. The pixel-level supervised information can be better regarded as the patch-level supervised information. The scales between supervised information and spatial features match. The structures of SSUN and SSAN are similar to that of SSMRN, which belongs to the way of fusing two separate deep spectral features and deep spatial features. However, the reason why the results of SSUN and SSAN are not as good as SSMRN may be that the network depth of spectral and spatial FE is not enough. The structures of RSSAN, MorphCNN, and SSFTT belong to the way of directly extracting deep features from original data or several principal components of the original data. The RSSAN and SSFTT are powerful methods. The main limitation of RSSAN and SSFTT is that a certain number of samples are required, which may result in poor performance with small samples, such as in the IP data sets. The accuracy of classification results of MorphCNN is low and unstable in Tables 7 and 8. Because compared with the objects in PU and SA, the morphological feature contained in the patch is not obvious.

Table 7

Classification results of different methods for the IP data set. Bold indicates the best result.

LabelClass nameBi-GRUMRCNNSSUNSSANRSSANMorphCNNSSFTTSSMRN
 1Alfalfa95.6210010010099.7996.2510099.79
 2Corn-notill68.3981.1372.3578.2463.2478.1684.7687.76
 3Corn-mintill64.4093.6088.9289.4774.4082.8690.2195.75
 4Corn82.2598.9696.7198.2491.7592.7599.1499.35
 5Grass-pasture89.2595.6593.0092.8189.8385.2494.7796.57
 6Grass-trees94.9097.2492.8594.9094.1487.2399.1099.47
 7Grass-pasture-mowed96.4110099.7410010093.33100100
 8Hay-windrowed96.4699.8899.6999.6199.6397.6699.9099.99
 9Oats99.3310010010010091.33100100
10Soybean-notill77.9990.9979.6782.0276.6887.5489.9794.29
11Soybean-mintill59.7788.9175.0882.3770.7988.7084.4292.11
12Soybean-clean78.6591.3688.0888.8877.8281.5287.7596.52
13Wheat99.1299.8899.3799.9498.9391.9099.9099.92
14Woods83.9497.8689.9293.8087.8294.3397.0299.03
15Buildings-grass-trees-drives76.8998.6896.6598.2691.8994.4498.5699.89
16Stone-steel-towers94.5599.3199.7899.4798.4697.3099.31100
OA (%)74.9691.9384.0287.7078.9887.4890.7494.87
±1.22±2.05±1.55±1.41±2.29±13.86±1.41±1.21
Kappa ×10071.7390.8881.8986.0276.2285.5889.4694.13
±1.34±2.23±1.73±1.57±2.56±16.22±1.59±1.38
Runtime (s)59.09111.6650.75162.2689.97205.2758.68160.58
±0.75±0.57±0.59±1.28±0.47±1.84±0.12±0.84

Table 8

Classification results of different methods for the PU data set. Bold indicates the best result.

LabelClass nameBi-GRUMRCNNSSUNSSANRSSANMorphCNNSSFTTSSMRN
1Asphalt88.7499.7495.9296.8698.6891.5599.3999.81
2Meadows92.2699.8897.2996.5399.5283.8299.7699.94
3Gravel86.0599.9397.3998.6499.3192.3299.4599.97
4Trees96.8499.2699.2998.2898.6391.9298.8899.61
5Painted metal sheets99.7399.8299.9899.9799.8296.9899.9099.96
6Bare soil93.6898.0399.0099.6399.8790.4599.9999.99
7Bitumen93.7910099.0999.7799.7489.4299.9999.99
8Self-blocking bricks85.4599.6498.3998.6198.4792.5898.3999.85
9Shadows99.8599.4599.8399.9299.3895.3799.5199.99
OA (%)91.7199.5797.6897.5999.2888.2599.5499.90
±0.64±1.31±0.67±1.20±0.37±23.50±0.17±0.07
Kappa ×10089.0099.4196.9096.7999.0386.5199.3899.87
±0.82±1.82±0.89±1.59±0.50±24.85±0.23±0.10
Runtime (s)107.75332.8598.55321.25335.54744.56204394.69
±1.61±4.81±0.92±2.00±3.94±5.14±1.89±5.15

Table 9

Classification results of different methods for the SA data set. Bold indicates the best result.

LabelClass nameBi-GRUMRCNNSSUNSSANRSSANMorphCNNSSFTTSSMRN
 1Brocoli_green_weeds_199.4510099.8410099.9792.8410099.99
 2Brocoli_green_weeds_299.8510099.8399.9699.9882.8499.9999.99
 3Fallow99.7010099.8399.9999.9472.6110099.99
 4Fallow_rough_plow99.4799.9899.8799.8799.9387.7499.7799.92
 5Fallow_smooth98.9099.9199.5399.6799.9076.9099.6399.54
 6Stubble99.8910099.9699.9099.9789.1699.9999.99
 7Celery99.7699.9899.9599.9599.8987.4599.9699.94
 8Grapes_untrained79.3599.1190.1796.4496.3659.8398.3399.39
 9Soil_vinyard_develop99.6610099.8599.7099.7860.7199.9999.98
10Corn_senesced_green_weeds95.5199.9799.0899.7099.8770.7199.7899.64
11Lettuce_romaine_4wk99.0299.9399.9299.6199.9073.0410099.97
12Lettuce_romaine_5wk99.9610099.9699.8299.9548.45100100
13Lettuce_romaine_6wk99.3210099.9699.8699.9253.7999.9999.92
14Lettuce_romaine_7wk98.2610099.9399.9399.9665.5999.9399.83
15Vinyard_untrained76.3899.4195.6898.3298.5346.1599.1899.70
16Vinyard_vertical_trellis99.1610099.9099.8999.8288.7399.9399.95
OA (%)91.7199.7197.1398.8998.9468.1599.4899.76
±0.86±0.28±0.48±0.46±1.18±16.64±0.38±0.14
Kappa ×10090.7399.6896.7998.7698.8264.3199.4299.74
±0.94±0.31±0.54±0.51±1.32±19.06±0.42±0.16
Runtime (s)192.09595.64159.611065.16577.141333.84365.99740.52
±1.85±2.04±1.09±8.65±2.68±40.29±0.99±4.54

As shown in Tables 7Table 89, Bi-GRU, SSUN, and SSFTT generally cost less time than MRCNN and other spectral-spatial feature methods. The reasons may be the grouping strategy of Bi-GRU, the grouping strategy and insufficient network depth of SSUN, and the transformer encoder module of SSFTT. The runtime of the MorphCNN is the longest. The reason is that network structure is more complex and deeper than other networks.

Tables 10Table 1112 show the OA of SSUN, SSAN, RSSAN, MorphCNN, SSFTT, and SSMRN with different training samples. Considering the stability and robustness of the proposed method under different training samples, 5, 10, 15, and 30 labeled samples of each class are randomly selected as training data for the IP and 30, 50, 100, and 200 for the PU and SA in the experiment. With the change in sample size, the results of MorphCNN fluctuate sharply. It further proves that the morphological feature is unstable. As the number of samples increases, the results of SSUN, SSAN, RSSAN, SSFTT, and SSMRN become better. And SSMRN significantly outperforms other methods under different training sample conditions. In the case of a small number of samples, our method still results in a good performance. In addition, when the number of samples of each class is 100 in PU and SA, the OAs of all the other methods are <99%, but the accuracy of the SSMRN can reach 99.5%. These prove that SSMRN can effectively learn the features of objects under different training sample conditions.

Table 10

OA(%) of different methods under different training sample numbers of each class on the IP data set. Bold indicates the best result.

Training sample numbers of each classSSUNSSANRSSANMorphCNNSSFTTSSMRN
1557.9357.2950.5962.2267.8767.80
21069.8870.8762.7277.7378.1383.91
31575.1377.1768.3768.2083.8889.23
43084.0287.7078.9887.4890.7494.87

Table 11

OAs (%) of different methods under different training sample numbers of each class on the PU data set. Bold indicates the best result.

Training sample numbers of each classSSUNSSANRSSANMorphCNNSSFTTSSMRN
13087.2784.5787.8482.4094.9095.15
25090.4688.4691.9880.8497.0097.71
310095.0994.1097.7590.1098.7699.55
420097.6897.5999.2888.2599.5499.90

Table 12

OAs (%) of different methods under different training sample numbers of each class on the SA data set. Bold indicates the best result.

Training sample numbers of each classSSUNSSANRSSANMorphCNNSSFTTSSMRN
13093.8493.8095.7980.0596.4297.63
25095.0395.4696.5988.9597.6598.48
310096.3597.3896.5677.9698.8699.51
420097.1398.8998.9468.1599.4899.74

Qualitative evaluation: the classification maps of different methods are shown in Figs. 6Fig. 78. By visual comparison, the classification map obtained by SSMRN is the cleanest and the closest to the ground-truth map. Due to the lack of spatial features, classification maps of Bi-GRU suffer from the pepper noise and misclassification inside an object. Compared with spectral FE methods, spatial FE methods make full use of the continuity of the ground object and yield a cleaner classification map. The main problem of MRCNN lies in the over-smoothing phenomenon. SSRAN, MorphCNN, and SSFTT have the over-smoothing phenomenon, too. They belong to the way of directly extracting joint deep spectral-spatial features from original data or several principal components of the original data, and spectral features come from the patch scale. Meanwhile, SSMRN, SSUN, and SSAN can better retain the detailed boundary of different objects, and acquire more smooth and homogeneous results, especially within the white dashed box. The most likely reason is that they have spatial and spectral FE branches, and spectral features come from the pixel scale. But SSUN and SSAN do not consider the weight relationship between the two branches depending on the spatial resolution of images. The proposed SSMRN takes the weight between spectral and spatial features into consideration and can further reduce over-smoothing.

Fig. 6

IP data set and classification maps using different methods: (a) false-color image; (b) ground-truth map; (c) Bi-GRU; (d) MRCNN; (e) SSUN; (f) SSAN; (g) RSSAN; (h) MorphCNN; (i) SSFTT; and (j) SSMRN.

JARS_16_4_048506_f006.png

Fig. 7

PU data set and classification maps using different methods: (a) false-color image; (b) ground-truth map; (c) Bi-GRU; (d) MRCNN; (e) SSUN; (f) SSAN; (g) RSSAN; (h) MorphCNN; (i) SSFTT; and (j) SSMRN.

JARS_16_4_048506_f007.png

Fig. 8

SA data set and classification maps using different methods on the: (a) false-color image; (b) ground-truth map; (c) Bi-GRU; (d) MRCNN; (e) SSUN; (f) SSAN; (g) RSSAN; (h) MorphCNN; (i) SSFTT; and (j) SSMRN.

JARS_16_4_048506_f008.png

5.

Discussion

The experimental results of the three public data sets indicate that SSMRN has a more competitive performance in terms of three measurements (CA, OA, and Kappa) and classification maps than all the compared methods. This is due to:

  • 1. The SSMRN is designed with a spectral branch and a spatial branch to extract spectral-spatial features. These operations join spectral features and spatial information together sufficiently.

  • 2. The proposed framework takes the weight between spectral and spatial features into consideration and can reduce over-smoothing. Meanwhile, the multi-task learning technology is integrated into the framework, improving the stability of results.

6.

Conclusion

To significantly reduce the over-smoothing effect and effectively learn the features of objects, a multi-task learning SSMRN has been proposed to extract spectral-spatial features. The experimental results of the three public data sets demonstrate that the method not only mitigates the over-smoothing phenomenon, but also has a better performance compared with the other methods in terms of CA, OA, and Kappa. Our method significantly outperforms other methods under different training sample conditions.

Although we utilize the proposed band Bi-GRU and MRCNN as the spectral and spatial feature extractors in the implementation of the proposed SSMRN, other deep networks can also be introduced into our model, especially for spectral extractors. It deserves to be investigated in future work.

Acknowledgments

This work was supported in part by the Major Science and Technology Program of Henan Province (Grant Nos. 222102320341, 212102311149, and 212102310432), by the key scientific research projects of colleges and universities in Henan Province (Grant No. 22B420004), by the Fundamental Research Funds for the Universities of Henan Province (Grant No. NSFRF210401), Doctoral Foundation of Henan Polytechnic University (Grant No. B2017-09, B2017-14, and B2015-22), by the National Natural Science Foundation of China (Grant No. 41801318). We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

References

1. 

H. Sun et al., “Spectral–spatial attention network for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., 58 (5), 3232 –3245 https://doi.org/10.1109/TGRS.2019.2951160 IGRSD2 0196-2892 (2020). Google Scholar

2. 

Y. Xu et al., “Spectral–spatial unified networks for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., 56 (10), 5893 –5909 https://doi.org/10.1109/TGRS.2018.2827407 IGRSD2 0196-2892 (2018). Google Scholar

3. 

L. Zhang et al., “Simultaneous spectral-spatial feature selection and extraction for hyperspectral images,” IEEE Trans. Cybern., 48 (1), 16 –28 https://doi.org/10.1109/TCYB.2016.2605044 (2018). Google Scholar

4. 

M. Ahmad et al., “Hyperspectral image classification—traditional to deep models: a survey for future prospects,” IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens., 15 968 –999 https://doi.org/10.1109/JSTARS.2021.3133021 (2022). Google Scholar

5. 

W. Liu et al., “A survey of deep neural network architectures and their applications,” Neurocomputing, 234 11 –26 https://doi.org/10.1016/j.neucom.2016.12.038 NRCGEO 0925-2312 (2017). Google Scholar

6. 

P. Zhou et al., “Learning compact and discriminative stacked autoencoder for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., 57 (7), 4823 –4833 https://doi.org/10.1109/TGRS.2019.2893180 IGRSD2 0196-2892 (2019). Google Scholar

7. 

M. Ahmad et al., “Multi-layer extreme learning machine-based autoencoder for hyperspectral image classification,” in VISIGRAPP (4: VISAPP), 75 –82 (2019). https://doi.org/10.5220/0007258000750082 Google Scholar

8. 

B. Liu et al., “Spatial–spectral jointed stacked auto-encoder-based deep learning for oil slick extraction from hyperspectral images,” J. Indian Soc. Remote Sens., 47 (12), 1989 –1997 https://doi.org/10.1007/s12524-019-01045-y (2019). Google Scholar

9. 

B. Ayhan and C. Kwan, “Application of deep belief network to land cover classification using hyperspectral images,” Lect. Notes Comput. Sci., 10261 269 –276 https://doi.org/10.1007/978-3-319-59072-1_32 LNCSD9 0302-9743 (2017). Google Scholar

10. 

G. E. Hinton, S. Osindero and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, 18 (7), 1527 –1554 https://doi.org/10.1162/neco.2006.18.7.1527 NEUCEB 0899-7667 (2006). Google Scholar

11. 

S. K. Roy et al., “Morphological convolutional neural networks for hyperspectral image classification,” IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens., 14 8689 –8702 https://doi.org/10.1109/JSTARS.2021.3088228 (2021). Google Scholar

12. 

L. ZhangL. Zhang and B. Du, “Deep learning for remote sensing data: a technical tutorial on the state of the art,” IEEE Geosci. Remote Sens. Mag., 4 (2), 22 –40 https://doi.org/10.1109/MGRS.2016.2540798 (2016). Google Scholar

13. 

J. Yang, Y.-Q. Zhao and J.C.-W. Chan, “Learning and transferring deep joint spectral–spatial features for hyperspectral classification,” IEEE Trans. Geosci. Remote Sens., 55 (8), 4729 –4742 https://doi.org/10.1109/TGRS.2017.2698503 IGRSD2 0196-2892 (2017). Google Scholar

14. 

Z. Zhong et al., “Spectral–spatial residual network for hyperspectral image classification: a 3-D deep learning framework,” IEEE Trans. Geosci. Remote Sens., 56 (2), 847 –858 https://doi.org/10.1109/TGRS.2017.2755542 IGRSD2 0196-2892 (2018). Google Scholar

15. 

M. Zhu et al., “Residual spectral–spatial attention network for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., 59 (1), 449 –462 https://doi.org/10.1109/TGRS.2020.2994057 IGRSD2 0196-2892 (2021). Google Scholar

16. 

Y. Xu, B. Du and L. Zhang, “Beyond the patchwise classification: spectral-spatial fully convolutional networks for hyperspectral image classification,” IEEE Trans. Big Data, 6 (3), 492 –506 https://doi.org/10.1109/TBDATA.2019.2923243 (2020). Google Scholar

17. 

S. Wan et al., “Hyperspectral image classification with context-aware dynamic graph convolutional network,” IEEE Trans. Geosci. Remote Sens., 59 (1), 597 –612 https://doi.org/10.1109/TGRS.2020.2994205 IGRSD2 0196-2892 (2021). Google Scholar

18. 

L. Sun et al., “Spectral–spatial feature tokenization transformer for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., 60 1 –14 https://doi.org/10.1109/TGRS.2022.3144158 IGRSD2 0196-2892 (2022). Google Scholar

19. 

R. Hang et al., “Cascaded recurrent neural networks for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., 57 (8), 5384 –5394 https://doi.org/10.1109/TGRS.2019.2899129 IGRSD2 0196-2892 (2019). Google Scholar

20. 

F. Zhou et al., “Hyperspectral image classification using spectral-spatial LSTMs,” Neurocomputing, 328 39 –47 https://doi.org/10.1016/j.neucom.2018.02.105 NRCGEO 0925-2312 (2019). Google Scholar

21. 

X. Mei et al., “Spectral-spatial attention networks for hyperspectral image classification,” Remote Sens., 11 (8), 963 https://doi.org/10.3390/rs11080963 (2019). Google Scholar

22. 

M. Seydgar et al., “3-D convolution-recurrent networks for spectral-spatial classification of hyperspectral images,” Remote Sens., 11 (7), 883 https://doi.org/10.3390/rs11070883 (2019). Google Scholar

23. 

M. V. Valueva et al., “Application of the residue number system to reduce hardware costs of the convolutional neural network implementation,” Math. Comput. Simul., 177 232 –243 https://doi.org/10.1016/j.matcom.2020.04.031 MCSIDR 0378-4754 (2020). Google Scholar

24. 

C. Zhang et al., “A study on overfitting in deep reinforcement learning,” (2018). Google Scholar

25. 

Y. Xu, B. Du and L. Zhang, “Assessing the threat of adversarial examples on deep neural networks for remote sensing scene classification: attacks and defenses,” IEEE Trans. Geosci. Remote Sens., 59 (2), 1604 –1617 https://doi.org/10.1109/TGRS.2020.2999962 IGRSD2 0196-2892 (2021). Google Scholar

26. 

C. Cheng et al., “Hyperspectral image classification via spectral-spatial random patches network,” IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens., 14 4753 –4764 https://doi.org/10.1109/JSTARS.2021.3075771 (2021). Google Scholar

27. 

Y. Xu et al., “Hyperspectral image classification via a random patches network,” ISPRS J. Photogramm. Remote Sens., 142 344 –357 https://doi.org/10.1016/j.isprsjprs.2018.05.014 IRSEE9 0924-2716 (2018). Google Scholar

28. 

H.-C. Shin et al., “Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning,” IEEE Trans. Med. Imaging, 35 (5), 1285 –1298 https://doi.org/10.1109/TMI.2016.2528162 ITMID4 0278-0062 (2016). Google Scholar

29. 

L. Ma et al., “Deep learning in remote sensing applications: a meta-analysis and review,” ISPRS J. Photogramm. Remote Sens., 152 166 –177 https://doi.org/10.1016/j.isprsjprs.2019.04.015 IRSEE9 0924-2716 (2019). Google Scholar

30. 

K. He et al., “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. and Pattern Recognit., 770 –778 (2016). https://doi.org/10.1109/CVPR.2016.90 Google Scholar

31. 

L. Mou, P. Ghamisi and X. X. Zhu, “Deep recurrent neural networks for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., 55 (7), 3639 –3655 https://doi.org/10.1109/TGRS.2016.2636241 IGRSD2 0196-2892 (2017). Google Scholar

32. 

S. Li et al., “Deep learning for hyperspectral image classification: an overview,” IEEE Trans. Geosci. Remote Sens., 57 (9), 6690 –6709 https://doi.org/10.1109/TGRS.2019.2907932 IGRSD2 0196-2892 (2019). Google Scholar

Biography

Shi He is an assistant professor at the Henan Polytechnic University. He received his BS degree from China Agricultural University in 2009, respectively, and his PhD in cartography and geographic information system from Beijing Normal University in 2016. His current research interests include optical remote sensing, machine learning, and image classification.

Huazhu Xue received his PhD in cartography and geographic information systems in the area of quantitative remote sensing from the School of Geography, Beijing Normal University, Beijing, China, in 2012. Since 2012, he has been an associate professor at the School of Surveying and Land Information Engineering, Henan Polytechnic University, Jiaozuo, China. His research interests include vegetation parameters inversion, satellite image processing, and GIS applications.

Jiehai Cheng is an associate professor at Henan Polytechnic University. He received his PhD in cartography and geographic information system from Beijing Normal University in 2013. His current research interests include high-resolution remote sensing, deep learning, image classification, and GIS intelligent analysis.

Lei Wang received his PhD in cartography and geographical information engineering from China University of Mining and Technology, Beijing, in 2016. Since 2016, he has been an associate professor at Henan Polytechnic University in Jiaozuo City. His current research interests include global GIS modeling, discrete global grid, and the application of remote sensing.

Yaping Wang is an associate professor at the Henan Polytechnic University. She received her PhD in cartography and geographic information engineering from China University of Mining and Technology, Beijing, in 2014. Her current research interests include image classification, remote sensing of water resources, and GIS applications.

Yongjuan Zhang received her bachelor’s degree from Henan Polytechnic University in June 2022. Currently, she is studying for a master’s degree in Nanjing Normal University. Her research direction is target extraction based on deep learning.

CC BY: © The Authors. Published by SPIE under a Creative Commons Attribution 4.0 International License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI.
Shi He, Huazhu Xue, Jiehai Cheng, Lei Wang, Yaping Wang, and Yongjuan Zhang "Tackling the over-smoothing problem of CNN-based hyperspectral image classification," Journal of Applied Remote Sensing 16(4), 048506 (26 November 2022). https://doi.org/10.1117/1.JRS.16.048506
Received: 28 April 2022; Accepted: 15 November 2022; Published: 26 November 2022
Lens.org Logo
CITATIONS
Cited by 2 scholarly publications.
Advertisement
Advertisement
KEYWORDS
Image classification

Education and training

Feature extraction

Hyperspectral imaging

Spatial resolution

Data modeling

Neural networks

Back to Top