Vietnamese Visual Question Answering

Baseline deep learning model for Vietnamese VQA research, published at ICISN 2025

Overview

Designed and validated a baseline deep learning model for Vietnamese Visual Question Answering research, combining computer vision and natural language processing to generate interpretable answers with explanations.

Technical Details

Technologies Used

  • Computer Vision: Image feature extraction models
  • NLP: Vietnamese language models
  • Framework: Deep learning multimodal architecture
  • Dataset: 32,886 QA pairs

Research Contributions

  • First baseline model for Vietnamese VQA with explanations
  • Systematic experiments on large-scale Vietnamese dataset
  • Demonstrated visual understanding with Vietnamese reasoning
  • Foundation for future multimodal AI research

Results & Impact

  • Published at ICISN 2025
  • Validated on 32,886 QA pairs
  • Model generates interpretable answers with explanations
  • Establishes foundation for Vietnamese multimodal AI research

Publication

Duong, T.-B., Tran, H.-M., Le-Nguyen, B.-N., & Duong, D.-T. (2026). An Automated Pipeline for Constructing a Vietnamese VQA-NLE Dataset. Proceedings of the Fifth International Conference on Intelligent Systems and Networks, 164–173. https://doi.org/10.1007/978-981-95-1746-6_18

References