Llm Benchmark Gaming

10 articles, oldest first

Feb 3, 2026

H Company’s new Holo2 model takes the lead in UI Localization

H Company's new Holo2 model takes the lead in UI Localization

Feb 9, 2026

Import AI 444: LLM societies; Huawei makes kernels with AI; ChipBench

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe. Subscribe now Google paper suggests that LLMs simulate multiple personalities to answer questions:…The smarter we make language models, the more t

opinionFeb 16, 2026

Import AI 445: Timing superintelligence; AIs solve frontier math proofs; a new ML research benchmark

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe. Subscribe now Economist: Don’t worry about AI-driven unemployment, because people like paying for the ‘human touch’:…Even when you have the techno

researchFeb 23, 2026

Import AI 446: Nuclear LLMs; China’s big AI benchmark; measurement and AI policy

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe. Subscribe now Want to make AI go better? Figure out how to measure it:…One simple policy intervention that works well…Jacob Steinhardt, an AI rese

researchMar 13, 2026

Identifying Interactions at Scale for LLMs

researchMar 16, 2026

ImportAI 449: LLMs training other LLMs; 72B distributed training run; computer vision is harder than generative text

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe. Subscribe now Can LLMs autonomously refine other LLMs for new tasks? Somewhat.…PostTrainBench shows startling growth in AI capabilities at post-tr

opinionMar 18, 2026

The PhD students who became the judges of the AI industry

Artificial intelligence models are multiplying fast, and competition is stiff. With so many players crowding the space, which one will be the best — and who decides that? Arena, formerly LM Arena, has emerged as the de facto public leaderboard for frontier LLMs, influencing

researchMar 29, 2026

Why Are Large Language Models so Terrible at Video Games?

Large language models (LLMs) have improved so quickly that the benchmarks themselves have evolved, adding more complex problems in an effort to challenge the latest models. Yet LLMs haven’t improved across all domains, and one task remains far outside their grasp: They have no idea how to play video

researchApr 12, 2026

Researchers define what counts as a world model and text-to-video generators do not

An international research team wants to bring order to the fragmented world model research landscape with OpenWorldLib. Text-to-video models like Sora are explicitly left out of their definition. The article Researchers define what counts as a world model and text-to-video generators do not appeared

researchApr 14, 2026

Google AI Research Proposes Vantage: An LLM-Based Protocol for Measuring Collaboration, Creativity, and Critical Thinking

Standardized tests can tell you whether a student knows calculus or can parse a passage of text. What they cannot reliably tell you is whether that student can resolve a disagreement with a teammate, generate genuinely original ideas under pressure, or critically dismantle a flawed argument. These a