Meta Executive Refutes Claims of Inflating Llama 4 Benchmark Scores

On Monday, Ahmad Al-Dahle, Meta’s vice president of generative AI, refuted allegations that the company had skewed its AI models’ performance to excel on specific benchmarks while hiding their weaknesses. In a post on X, Al-Dahle stated that it is “simply not true” that Meta trained its latest Llama 4 Maverick and Llama 4 Scout models on “test sets.” These test sets, which are crucial for assessing a model’s capabilities after training, can produce misleadingly high performance scores if the model has been trained on them. The rumor began circulating over the weekend, largely fueled by an unverified claim from a user on a Chinese social media platform who alleged they had quit Meta over the company’s benchmarking practices. The discourse picked up momentum on platforms like X and Reddit, especially as reports emerged suggesting that the Maverick and Scout models were underperforming in certain tasks.

Compounding these concerns was Meta’s use of an experimental version of Maverick to achieve enhanced scores in the LM Arena benchmark, leading to observations of significant behavioral differences between the publicly available versions of Maverick and those tested on the LM Arena. Al-Dahle acknowledged that users are experiencing “mixed quality” outcomes from the Maverick and Scout models across various cloud services. He suggested that since the models were released as soon as they were ready, it may take some time for various implementations to stabilize. “We expect it’ll take several days for all the public implementations to get dialed in,” he noted, assuring that the company will continue addressing bugs and collaborating with partners for improvement.

Posted in AI

Leave a Reply

Your email address will not be published. Required fields are marked *