Hugging FaceOct 3, 2022

Very Large Language Models and How to Evaluate Them

Read the full articleVery Large Language Models and How to Evaluate Them on Hugging Face

↗

What Happened

Fordel's Take

look, everyone's throwing around 'evaluation,' but it's a total mess. you can't just look at perplexity or simple accuracy; those metrics are garbage when you're dealing with emergent reasoning. we need more rigorous, human-in-the-loop evaluation methods, like adversarial testing, not just simple benchmarks. the problem is that LLMs aren't deterministic; they're statistical parrots dressed up as thinkers. stop relying on public benchmarks; build your own stress tests.

What To Do

implement custom, stress-test evaluations based on domain-specific reasoning tasks.

Cited By

Hugging Face Very Large Language Models and How to Evaluate Them

React

Newsletter

Get the weekly AI digest

The stories that matter, with a builder's perspective. Every Thursday.

Loading comments...