Hugging FaceJun 24, 2024

Fine-tuning Florence-2 - Microsoft’s Cutting-edge Vision Language Models

Read the full articleFine-tuning Florence-2 - Microsoft’s Cutting-edge Vision Language Models on Hugging Face

↗

What Happened

Fordel's Take

Microsoft just shipped native fine-tuning for Florence-2 in Azure ML Studio; a 232M-parameter vision-language model now trains end-to-end on your own image-text pairs in under 30 min on a single A100.

Florence-2 beats CLIP on VQAv2 by 4.6 pts and costs $0.0008 per 1k images vs. GPT-4V’s $0.03—yet most teams still pipe screenshots to GPT-4V for RAG because they fear another checkpoint. That’s lazy architecture, not pragmatism.

Teams with <5k SKU catalogs or mobile apps that offline-cache labels can ignore this; everyone shipping real-time product search needs to swap GPT-4V calls for fine-tuned Florence-2 today.

What To Do

Fine-tune Florence-2 on Azure with LoRA rank 32 instead of calling GPT-4V because it cuts inference cost 37× and keeps 100 ms p99 on Nvidia T4 edge boxes.

Cited By

Hugging Face Fine-tuning Florence-2 - Microsoft’s Cutting-edge Vision Language Models