Sitemap

Fine-Tuning a Vision LLM : My Journey and Lessons Learned

8 min readSep 13, 2025
Press enter or click to view image in full size

Extracting structured information from scanned documents — think invoices, medical forms, or utility bills — has always been tricky. Traditional OCR pipelines often struggle when layouts vary, and open-weight general-purpose large language models (LLMs) aren’t always reliable without domain-specific adaptation. Closed-source multimodal models such as Gemini 2.5 Flash or GPT-4.1 perform very well out of the box and can be cost-effective, but they’re often not an option for companies with strict data privacy or compliance requirements.

Recently, I experimented with fine-tuning the LLaMA 3.2 Vision 11B model on a vision-language task. After weeks of training, tuning, and deployment — it was far from a straight path — I made mistakes, iterated, and uncovered optimizations that saved both time and cost.

While one motivating example might be document parsing, the lessons I share here are not tied to OCR alone. The principles apply broadly to anyone fine-tuning vision-language models — whether that’s for medical scans, satellite imagery, retail receipts, or product recognition.

This post captures my journey and the insights I believe can help others working on similar fine-tuning projects.

Training: What Worked and What Didn’t

--

--

Mithun Das
Mithun Das

Written by Mithun Das

Software Engineer | Designing & Building Softwares for 20+ years

No responses yet