Accurately identifying the semantic information of complex objects is a challenging problem in semantic segmentation of remote sensing images. We propose a bi-encoder network for semantic segmentation of complex targets, called the SN-Unetformer. It combines ConvNeXt and Swin Transformer into a bi-encoder and constructs a feature fusion module (FFM) to fully integrate the semantic information of the bi-encoder by exploiting channel dependence. Moreover, an efficient attention mechanism has been introduced to model the global–local relationship. To the best of our knowledge, our proposed network is innovative, as it is the first method to combine two popular networks, ConvNeXt and the Swin Transformer, into a dual encoder. Our SN-Unetformer has been tested on large-scale Vaihingen and Potsdam datasets, as well as the LoveDA dataset, with significant challenges. Compared to current advanced methods for semantic segmentation for remote sensing images, our accuracy is significantly better. In particular, our method achieves 84.3% of mean intersection over union on the Vaihingen dataset, which is the best result currently available for this dataset. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one
Transformers
Image segmentation
Semantics
Remote sensing
Buildings
Windows
Feature extraction