Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • R recruit-vet
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 55
    • Issues 55
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Infrastructure Registry
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Alda Pastor
  • recruit-vet
  • Issues
  • #27

Closed
Open
Created Feb 10, 2025 by Alda Pastor@aldapastor2596Maintainer

If there's Intelligent Life out There


Optimizing LLMs to be excellent at specific tests backfires on Meta, Stability.

-. -. -. -. -. -. -

When you purchase through links on our website, we may make an affiliate commission. Here's how it works.

Hugging Face has actually released its second LLM leaderboard to rank the best language designs it has evaluated. The new leaderboard looks for to be a more tough uniform standard for evaluating open big language model (LLM) efficiency across a range of tasks. Alibaba's Qwen designs appear dominant in the leaderboard's inaugural rankings, taking three spots in the leading 10.

Pumped to announce the brand brand-new open LLM leaderboard. We burned 300 H100 to re-run brand-new assessments like MMLU-pro for all major open LLMs!Some learning:- Qwen 72B is the king and Chinese open are controling total- Previous evaluations have actually ended up being too easy for current ... June 26, 2024

Hugging Face's second leaderboard tests language models across four jobs: understanding testing, thinking on extremely long contexts, intricate mathematics capabilities, and direction following. Six standards are utilized to test these qualities, with tests consisting of resolving 1,000-word murder secrets, explaining PhD-level concerns in layman's terms, and a lot of difficult of all: high-school mathematics equations. A complete breakdown of the benchmarks utilized can be found on Hugging Face's blog site.

The frontrunner of the new leaderboard is Qwen, Alibaba's LLM, which takes first, 3rd, and 10th place with its handful of variants. Also appearing are Llama3-70B, Meta's LLM, and a handful of smaller open-source tasks that handled to exceed the pack. Notably missing is any sign of ChatGPT; Hugging Face's leaderboard does not check closed-source models to ensure reproducibility of results.

Tests to qualify on the leaderboard are run solely on Hugging Face's own computers, which according to CEO Clem Delangue's Twitter, are powered by 300 Nvidia H100 GPUs. Because of Hugging Face's open-source and collaborative nature, anyone is free to send brand-new designs for testing and admission on the leaderboard, with a brand-new ballot system prioritizing popular brand-new entries for screening. The leaderboard can be filtered to show only a highlighted range of significant designs to prevent a confusing glut of little LLMs.

As a pillar of the LLM space, Hugging Face has actually become a trusted source for LLM learning and neighborhood partnership. After its very first leaderboard was released in 2015 as a method to compare and replicate screening outcomes from a number of established LLMs, the board quickly took off in appeal. Getting high ranks on the board ended up being the goal of numerous designers, little and big, and as designs have actually ended up being usually stronger, 'smarter,' and optimized for the particular tests of the first leaderboard, its outcomes have become less and less meaningful, for this reason the development of a 2nd variant.

Some LLMs, including newer variations of Meta's Llama, asteroidsathome.net significantly underperformed in the new leaderboard compared to their high marks in the very first. This came from a trend of over-training LLMs just on the very first leaderboard's benchmarks, causing regressing in real-world performance. This regression of performance, thanks to hyperspecific and self-referential information, follows a trend of AI efficiency growing worse gradually, showing when again as Google's AI responses have actually revealed that LLM efficiency is only as great as its training information which real synthetic "intelligence" is still many, several years away.

Remain on the Leading Edge: Get the Tom's Hardware Newsletter

Get Tom's Hardware's finest news and extensive evaluations, straight to your inbox.

Dallin Grimm is a contributing author for Tom's Hardware. He has actually been building and breaking computers considering that 2017, acting as the resident youngster at Tom's. From APUs to RGB, Dallin has a deal with on all the latest tech news.

Moore Threads GPUs allegedly show 'outstanding' reasoning efficiency with DeepSeek models

DeepSeek research study suggests Huawei's Ascend 910C delivers 60% of Nvidia H100 reasoning efficiency

Asus and MSI hike RTX 5090 and RTX 5080 GPU rates by approximately 18%

-. bit_user. LLM efficiency is only as excellent as its training information which real artificial "intelligence" is still numerous, several years away. First, this declaration discounts the function of network architecture.

The definition of "intelligence" can not be whether something procedures details precisely like humans do, otherwise the search for additional terrestrial intelligence would be entirely useless. If there's smart life out there, it probably does not think rather like we do. Machines that act and behave smartly also need not always do so, either. Reply

-. jp7189. I do not love the click-bait China vs. the world title. The reality is qwen is open source, open weights and can be run anywhere. It can (and gantnews.com has currently been) tweaked to add/remove predisposition. I praise hugging face's work to produce standardized tests for LLMs, and for putting the concentrate on open source, open weights initially. Reply

-. jp7189. bit_user said:. First, this statement discounts the function of network architecture.

Second, intelligence isn't a binary thing - it's more like a spectrum. There are various classes cognitive jobs and abilities you may be acquainted with, if you study child development or animal intelligence.

The definition of "intelligence" can not be whether something procedures details exactly like people do, or else the search for extra terrestrial intelligence would be totally useless. If there's smart life out there, it probably doesn't believe quite like we do. Machines that act and act wisely likewise needn't necessarily do so, either. We're creating a tools to assist human beings, therfore I would argue LLMs are more handy if we grade them by human intelligence standards. Reply

- View All 3 Comments

Most Popular

Tomshardware becomes part of Future US Inc, a global media group and leading digital publisher. Visit our corporate site.

- Terms and conditions. - Contact Future's specialists. - Privacy policy. - Cookies policy. - Availability Statement. - Advertise with us.

  • About us. - Coupons.
  • Careers

    © Future US, Inc. Full 7th Floor, 130 West 42nd Street, New York City, NY 10036.
Assignee
Assign to
Time tracking