Skip to main content
All Arcs
ongoing10 chapterssince Feb 2026

Llm Benchmark Gaming

10 articles, oldest first

1

H Company’s new Holo2 model takes the lead in UI Localization

H Company's new Holo2 model takes the lead in UI Localization

2

Import AI 444: LLM societies; Huawei makes kernels with AI; ChipBench

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe. Subscribe now Google paper suggests that LLMs simulate multiple personalities to answer questions:…The smarter we make language models, the more t

3
opinion

Import AI 445: Timing superintelligence; AIs solve frontier math proofs; a new ML research benchmark

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe. Subscribe now Economist: Don’t worry about AI-driven unemployment, because people like paying for the ‘human touch’:…Even when you have the techno

4
research

Import AI 446: Nuclear LLMs; China’s big AI benchmark; measurement and AI policy

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe. Subscribe now Want to make AI go better? Figure out how to measure it:…One simple policy intervention that works well…Jacob Steinhardt, an AI rese

5
research

Identifying Interactions at Scale for LLMs

<meta name=&quot;twitter:image&quot; content=&quot;https://bair.berkeley.edu/static/blog/spex/tease

6
research

ImportAI 449: LLMs training other LLMs; 72B distributed training run; computer vision is harder than generative text

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe. Subscribe now Can LLMs autonomously refine other LLMs for new tasks? Somewhat.…PostTrainBench shows startling growth in AI capabilities at post-tr

7
opinion

The PhD students who became the judges of the AI industry

Artificial intelligence&#160;models are multiplying fast, and competition is stiff. With so many players crowding the space, which one will be the best&#160;— and who decides that?&#160;Arena, formerly LM Arena, has&#160;emerged&#160;as the de facto public leaderboard for frontier LLMs, influencing

8
research

Why Are Large Language Models so Terrible at Video Games?

Large language models (LLMs) have improved so quickly that the benchmarks themselves have evolved, adding more complex problems in an effort to challenge the latest models. Yet LLMs haven’t improved across all domains, and one task remains far outside their grasp: They have no idea how to play video

9
research

Researchers define what counts as a world model and text-to-video generators do not

An international research team wants to bring order to the fragmented world model research landscape with OpenWorldLib. Text-to-video models like Sora are explicitly left out of their definition. The article Researchers define what counts as a world model and text-to-video generators do not appeared

10
research

Google AI Research Proposes Vantage: An LLM-Based Protocol for Measuring Collaboration, Creativity, and Critical Thinking

Standardized tests can tell you whether a student knows calculus or can parse a passage of text. What they cannot reliably tell you is whether that student can resolve a disagreement with a teammate, generate genuinely original ideas under pressure, or critically dismantle a flawed argument. These a