Fine-tuning Florence-2 - Microsoft’s Cutting-edge Vision Language Models
What Happened
Fine-tuning Florence-2 - Microsoft’s Cutting-edge Vision Language Models
Fordel's Take
Microsoft just shipped native fine-tuning for Florence-2 in Azure ML Studio; a 232M-parameter vision-language model now trains end-to-end on your own image-text pairs in under 30 min on a single A100.
Florence-2 beats CLIP on VQAv2 by 4.6 pts and costs $0.0008 per 1k images vs. GPT-4V’s $0.03—yet most teams still pipe screenshots to GPT-4V for RAG because they fear another checkpoint. That’s lazy architecture, not pragmatism.
Teams with <5k SKU catalogs or mobile apps that offline-cache labels can ignore this; everyone shipping real-time product search needs to swap GPT-4V calls for fine-tuned Florence-2 today.
What To Do
Fine-tune Florence-2 on Azure with LoRA rank 32 instead of calling GPT-4V because it cuts inference cost 37× and keeps 100 ms p99 on Nvidia T4 edge boxes.
Cited By
React
Get the weekly AI digest
The stories that matter, with a builder's perspective. Every Thursday.