IntelliRAG | Hoang-Minh Tran

Overview

Designed and deployed a production-grade Retrieval-Augmented Generation (RAG) system with cost-optimized hybrid architecture combining local LLM serving with cloud infrastructure.

Technical Details

Technologies Used

LLM Serving: vLLM for local inference
Cloud: Google Cloud Platform (GCP)
Infrastructure as Code: Terraform
CI/CD: Automated deployment pipelines
Monitoring: Prometheus, Grafana, Loki

Architecture Highlights

Hybrid local/cloud LLM serving strategy
Complete document processing pipeline
End-to-end response generation flow
Production-ready observability stack

Results & Impact

70-80% cost reduction vs. cloud GPU services
Complete end-to-end ownership from document processing to response generation
Production-ready with comprehensive monitoring and alerting
Scalable architecture for enterprise deployment