Evaluation of DeepLabV3+ with ResNet backbone for building segmentation using UAV images

Building segmentation using remote sensing, aerial, and UAV images with deep learning has gained significant attention. Buildings are crucial for urban development, management, and population estimation. Therefore, the automatic extraction of buildings from UAV images is essential for both research and practical applications. This paper presents a building dataset comprising 6,500 image samples, each measuring 512 x 512 pixels, derived from high-resolution UAV images taken with diverse building characteristics in various regions of Vietnam. The study evaluates the effectiveness of building extraction from UAV images using the DeepLabV3+ model with ResNet as the backbone of our dataset. The results indicate that the accuracy for predicting buildings reaches an Intersection over Union (IoU) of 0.774 when employing the ResNet101 backbone. However, this accuracy is significantly influenced by the architectural characteristics and spatial distribution of the buildings. In newly developed urban and suburban areas, the IoU metrics for predicted buildings can reach 0.874 and 0.857, respectively. In contrast, the accuracy declines in industrial zones and older urban areas, with IoU values of 0.762 and 0.673, respectively. This study has practical applications for urban management, development, and the construction of smart cities in our country.

How to Cite

Pham, D.Trung, Truong, H.Minh, Doan, P.Nam Thi, Ta, H.Thu Thi, Nguyen, H.Thi and Nguyen, M.Thi 2025. Evaluation of DeepLabV3+ with ResNet backbone for building segmentation using UAV images (in Vietnamese). Journal of Mining and Earth Sciences. 66, 3 (Jun, 2025), 14-28. DOI:https://doi.org/10.46326/JMES.2025.66(3).02.

References

Al Shafian, S., and Hu, D. (2024). Integrating machine learning and remote sensing in disaster management: A decadal review of post-disaster building damage assessment. Buildings, 14(8), 2344.

Atik, S. O., Atik, M. E., and Ipbuker, C. (2022). Comparative research on different backbone architectures of DeepLabV3+ for building segmentation. Journal of Applied Remote Sensing, 16(2), 024510-024510.

Bhatt, D., Patel, C., Talsania, H., Patel, J., Vaghela, R., Pandya, S.,,... Ghayvat, H. (2021). CNN variants for computer vision: History, architecture, application, challenges and future scope. Electronics, 10(20), 2470.

Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2017a). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4), 834-848.

Chen, L. C., Papandreou, G., Schroff, F., and Adam, H. (2017b). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587.

Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801-818.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,,... Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. Proceedings of the IEEE conference on computer vision and pattern recognition, (CVPR), 2016, pp. 3213-3223.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition, Miami, FL, USA, 2009, pp. 248-255, doi: 10.1109/CVPR.2009.5206848.

Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. J. I. j. o. c. v. (2010). The pascal visual object classes (voc) challenge. 88, 303-338. https://doi.org/10.1007/s11263-009-0275-4.

Feng, W., Sui, H., Hua, L., Xu, C., Ma, G., and Huang, W. J. I. J. o. R. S. (2020). Building extraction from VHR remote sensing imagery by combining an improved deep convolutional encoder-decoder architecture and historical land use vector map. International Journal of Remote Sensing, 41(17), 6595–6617. https://doi.org /10.1080/01431161.2020.1742944.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778.

He, K., Zhang, X., Ren, S., Sun, J. J. I. t. o. p. a., and intelligence, m. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(9), 1904-1916.

Hochreiter, S. J. N. C. M.-P. (1997). Long Short-term Memory. Neural Computation 8(9), 1735-1780. DOI: 10.1162/neco.1997.9.8. 1735.

Hu, Q., Zhen, L., Mao, Y., Zhou, X., and Zhou, G. J. A. i. C. (2021). Automated building extraction using satellite remote sensing imagery. Automation in Construction 123, 103509. https://doi.org /10.1016/j.autcon.2020.103509.

Huang, J., Li, P., Wang, W., and Pei, Y. (2022). Research on Building Extraction method based on Object-oriented and ArcGIS Engine. 2022 3rd International Conference on Geology, Mapping and Remote Sensing (ICGMRS), IEEE. DOI: 10.1109/ICGMRS55602. 2022.9849324.

Ioffe, S. J. a. p. a. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:448-456, 2015.

Jadon, S. (2020). A survey of loss functions for semantic segmentation. 2020 IEEE conference on computational intelligence in bioinformatics and computational biology (CIBCB), IEEE DOI: 10.1109/CIBCB48159.2020.9277638.

Ji, S., Wei, S., Lu, M. J. I. T. o. g., and sensing, r. (2018). Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Transactions on Geoscience and Remote Sensing 57(1), 574-586. DOI: 10.1109/TGRS.2018.2858817.

Khan, S., Rahmani, H., Shah, S. A. A., Bennamoun, M., Medioni, G., and Dickinson, S. (2018). A guide to convolutional neural networks for computer vision. Springer Cham. https://doi. org/10.1007/978-3-031-01821-3.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. J. A. i. n. i. p. s. (2012). Imagenet classification with deep convolutional neural networks. Publication History 6(60) 84-90. https://doi. org/10.1145/3065386.

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. J. N. c. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation 1(4), 541-551. DOI: 10.1162/ neco.1989.1.4.541.

Li, J., Huang, X., Tu, L., Zhang, T., and Wang, L. (2022). A review of building detection from very high resolution optical remote sensing images. GIScience and Remote Sensing 59(1), 1199-1225. https://doi.org/10.1080/15481603.2022.21 01727.

Li, W., and Zhao, S. (2022). Semantic segmentation of buildings in high-resolution remote sensing images based on DeepLabV3+ algorithm. In Journal of Physics: Conference Series (Vol. 2400, No. 1, p. 012037). IOP Publishing.

Li, Z., and Guo, Y. (2020). Semantic segmentation of landslide images in Nyingchi region based on PSPNet network. 2020 7th International Conference on Information Science and Control Engineering (ICISCE), IEEE. DOI: 10.1109/ICISCE50968.2020.00256.

Long, L., He, F., and Liu, H. J. T. J. o. S. (2021). The use of remote sensing satellite using deep learning in emergency monitoring of high-level landslides disaster in Jinsha River. J Supercomput 77, 8728–8744 (2021). https:// doi.org/10.1007/s11227-020-03604-4.

Luo, L., Li, P., and Yan, X. J. E. (2021). Deep learning-based building extraction from remote sensing images: A comprehensive review. Energies 2021, 14, 7982. https://doi.org/10.3390 /en14237982.

Maggiori, E., Tarabalka, Y., Charpiat, G., and Alliez, P. (2017). Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. 2017 IEEE International geoscience and remote sensing symposium (IGARSS), IEEE DOI: 10.1109/IGARSS.2017. 8127684

Mnih, V. (2013). Machine learning for aerial image labeling. University of Toronto (Canada). University of Toronto (Canada) ProQuest Dissertations and Theses, 2013. NR96184.

Punn, N. S., Agarwal, S. J. A. T. o. M. C., Communications,, and Applications. (2020). Inception u-net architecture for semantic segmentation to identify nuclei in microscopy cell images. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM),16(1), 1-15. https://doi.org/10.1145 /3376922

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18 (pp. 234-241). Springer international publishing.

Rottensteiner, F., Sohn, G., Jung, J., Gerke, M., Baillard, C., Benitez, S., and Breitkopf, U. (2012). The ISPRS benchmark on urban object classification and 3D building reconstruction. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, I-3, 1(1), 293-298.

Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Srivastava, R. K., Greff, K., and Schmidhuber, J. J. a. p. a. (2015). Highway networks. Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) https://doi.org/10. 48550/ arXiv.1505.00387.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D.,,... Rabinovich, A. (2015). Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition, (CVPR), 2015, pp. 1-9

Wang, Y., Yang, L., Liu, X., and Yan, P. J. S. R. (2024). An improved semantic segmentation algorithm for high-resolution remote sensing images based on DeepLabv3+. Sci Rep 14(1), 9716. https://doi.org/10.1038/s41598-024-60375-1

Wang, Z., Xu, N., Wang, B., Liu, Y., and Zhang, S. (2022). Urban building extraction from high-resolution remote sensing imagery based on multi-scale recurrent conditional generative adversarial network. GIScience and Remote Sensing 59(1), 861-884. https://doi.org/10. 1080/15481603.2022.2076382

Xu, S., and Wang, Y. (2024). Fusion of fractal features DeepLabV3+ remote sensing image building segmentation. 2024 43rd Chinese Control Conference (CCC), IEEE DOI:10.23919/CCC6 3176.2024.10662351

Xu, Y., Wu, L., Xie, Z., and Chen, Z. J. R. S. (2018). Building extraction in very high resolution remote sensing imagery using deep learning and guided filters. Remote Sens 10(1), 144. https://doi.org/10.3390/rs10010144.

Yu, W., Yang, K., Bai, Y., Xiao, T., Yao, H., and Rui, Y. (2016). Visualizing and comparing AlexNet and VGG using deconvolutional layers. Proceedings of the 33 rd International Conference on Machine Learning, (Vol. 3, pp. 43-76).

Zhao, X., Wang, L., Zhang, Y., Han, X., Deveci, M., and Parmar, M. (2024). A review of convolutional neural networks in computer vision. Artificial Intelligence Review, 57(4), 99.

Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., and Torralba, A. J. I. J. o. C. V. (2019). Semantic understanding of scenes through the ade20k dataset. 127, 302-321. Int J Comput Vis 127, 302–321 (2019). https://doi.org/10. 1007/s11263-018-1140-0.

	Citations	1249
	h-index	11
	i10-index	22

Abstracting & Indexing