Which LLM is Better - Open LLM Leaderboard

Anson Park

8 min read


Jan 7, 2023

The Open LLM Leaderboard

The Open LLM Leaderboard is a significant initiative on Hugging Face, aimed at tracking, ranking, and evaluating open Large Language Models (LLMs) and chatbots. This leaderboard is an essential resource for the AI community as it provides a comprehensive and up-to-date comparison of various open-source LLMs. The platform allows for the submission of models for automated evaluation on a GPU cluster, ensuring a standardized and fair assessment of each model's capabilities.

Which LLM is better?

The Open LLM Leaderboard is part of a broader collection of benchmarks on Hugging Face, known as The Big Benchmarks Collection. This collection extends beyond the Open LLM Leaderboard, gathering benchmark spaces to provide a wider perspective on the performance of LLMs across different tasks and scenarios. These benchmarks are crucial for understanding the strengths and limitations of various models, and they contribute significantly to the field of AI by enabling more informed choices when selecting an LLM for specific applications.

The leaderboard is maintained by the Open LLM Leaderboard organization on Hugging Face. This organization oversees the dataset with detailed results and queries for the models on the leaderboard, ensuring transparency and accessibility for the AI community. The leaderboard regularly updates a list of models with the best evaluations, providing a snapshot of the most effective and efficient LLMs currently available.

Overall, the Open LLM Leaderboard is a valuable tool for AI researchers, developers, and enthusiasts. It offers insights into the performance of different open-source LLMs, fostering a competitive and collaborative environment that drives innovation and improvement in the field of AI.

Open LLM Leaderboard


The Language Model Evaluation Harness

The Language Model Evaluation Harness serves as the backend for Hugging Face's Open LLM Leaderboard. It is a comprehensive framework developed primarily by EleutherAI, with later expansions by BigScience. It's an open-source tool designed to put LLMs through a robust evaluation process. This tool is crucial for testing the accuracy and reliability of LLMs in a standardized way.

Key aspects of the Language Model Evaluation Harness include:

  1. Robust Evaluation Process: It allows researchers and developers to test LLMs against various benchmarks, ensuring they are evaluated for accuracy, precision, and reliability. The process includes tests such as question and answer, multiple-choice questions, and tasks that assess against gender bias.

  2. Standardized Framework: Before its development, evaluating LLMs for efficacy was challenging due to the lack of a unified testing mechanism. The Language Model Evaluation Harness solves this by providing a single framework where models can be implemented and evaluated across numerous tasks.

  3. Reproducibility and Comparability: One of the primary goals of the Evaluation Harness is to enable users to replicate results mentioned in research papers and compare them with other results in the literature. This approach enhances transparency and reliability in LLM research.

  4. Wide Usage: The Evaluation Harness is used not only by EleutherAI and BigScience but also in research papers by major organizations like Google and Microsoft. It's integral in setting benchmarks for auditing LLMs and is backed by entities like the Mozilla Technology Fund.

  5. Focus on Multilingual Evaluation: Recognizing the predominance of English and Chinese language models, one of the goals is to improve tools for evaluating multilingual LLMs, addressing the nuances embedded in different language systems.

  6. Comprehensive LLM Evaluation and Benchmarking: It supports over 60 standard academic benchmarks for LLMs with hundreds of subtasks and variants implemented. It supports a wide range of models and offers features for custom prompts and evaluation metrics.

This tool represents a significant step towards more responsible and accountable AI development, ensuring that LLMs are not only powerful but also accurate and free from biases.

Check the links below:

Beginner's Guide to the Open LLM Leaderboard

The Open LLM Leaderboard is straightforward and user-friendly. However, for those unfamiliar with such leaderboards, it may seem a bit complex at first glance. Allow me to offer a simple explanation on how to navigate and understand the leaderboard.

  • You can see a list of LLM models.

  • Here, you're able to view 6 key benchmark scores for each individual LLM.

    • AI2 Reasoning Challenge (ARC): Created by Clark et al. in 2018, this challenge is a rigorous test for LLMs' question-answering capabilities. It consists of 7,787 multiple-choice science questions from 3rd to 9th-grade level exams, divided into an "Easy Set" and a more difficult "Challenge Set". The Challenge Set includes questions that require more complex reasoning beyond simple fact retrieval, thus testing the LLMs' deeper comprehension skills.

    • HellaSwag: This benchmark assesses common-sense reasoning in physical situations. It involves challenging incorrect answers generated through "Adversarial Filtering", making it difficult for LLMs that rely heavily on probabilities. It is a significant test for understanding an LLM's ability to apply commonsense reasoning.

    • Massive Multitask Language Understanding (MMLU): This benchmark evaluates LLMs across a broad range of language understanding tasks. It is designed to test the model's proficiency in various domains and its ability to adapt to different types of language tasks.

    • TruthfulQA: This benchmark focuses on evaluating the truthfulness of LLM responses. It's a critical measure in the age of information, where the accuracy of data provided by LLMs is paramount.

    • Winogrande: This benchmark tests LLMs on their ability to solve Winograd schema-style pronoun disambiguation problems, which are crucial for assessing an LLM's understanding of language and context.

    • GSM8k: This set comprises 8,500 grade-school math problems requiring basic to intermediate math operations. It tests LLMs' ability to work through multi-step math problems, which is valuable for assessing AI's capability in solving basic mathematical problems, especially in educational contexts.

  • For all benchmark scores, a higher number indicates better performance."

  • In the "Average" column, you can see the mean value of individual benchmark scores.

  • By default, LLMs with higher "Average" benchmark scores are displayed at the top.

  • 🟒 Pretrained Model: This icon represents new, base models that have been trained on a given corpus. These are foundational models created from the ground up.

  • πŸ”Ά Fine-Tuned Model: This category includes pretrained models that have been further refined and improved upon by training on additional data.

  • β­• Instruction-Tuned Model: These are models specifically fine-tuned on datasets of task instructions. They are tailored to better understand and respond to task-specific directions.

  • 🟦 RL-Tuned Model: Indicates models that have undergone reinforcement fine-tuning. This process usually involves modifying the model's loss function with an added policy.

In the context of LLMs, different precision types like float16, bfloat16, 8bit, 4bit, and GPTQ refer to the way numerical data is represented and processed within the model, impacting both the model's memory footprint and computational efficiency.

  1. float16 (Half-Precision Floating-Point Format): This format occupies 16 bits in computer memory. It's often used in applications where high precision is not essential. Using float16 can accelerate the training and inference processes of LLMs by reducing memory requirements and computational overheads, especially on GPUs that support this precision format. However, it may lead to issues like reduced numerical stability and potential loss in model accuracy.

  2. bfloat16: The bfloat16 format is a truncated version of the standard 32-bit floating-point format. It preserves the exponent bits while reducing the precision of the significand. This format is beneficial for neural networks, as it provides a balance between performance and precision. It allows for fast conversion to and from a 32-bit float, making it suitable for LLMs that need both performance and accuracy.

  3. 8bit and 4bit Quantization: These are techniques used to reduce the precision of the model's weights, typically from 16-bit to 8-bit or 4-bit, with minimal performance degradation. Lowering the bit precision of the weights significantly reduces the model's memory footprint, making it possible to train and deploy larger models on limited hardware resources. However, this might come with a tradeoff in terms of accuracy and numerical stability.

  4. GPTQ (Gradient-Partitioned Training Quantization): This is a quantization method specifically designed for GPT models, focusing on GPU inference. GPTQ aims to lower the precision of the model’s weights while maintaining performance. It's a post-training quantization (PTQ) method that allows for efficient storage and computation of large models.

Each of these precision types has its unique advantages and tradeoffs. For instance, using lower precision formats like float16 or bfloat16 can significantly speed up the training and inference processes but may impact the model's accuracy and numerical stability. On the other hand, 8bit and 4bit quantization techniques enable the use of larger models on hardware with limited memory but require careful implementation to avoid significant performance degradation.

Model Size = Number of Parameters in LLMs (in Billions)

LLMs are distinguished by their size and the number of parameters they contain. The size of an LLM is directly related to its complexity and learning capabilities, where larger models can encapsulate and learn more complex patterns due to their expansive parameter space. This is particularly important in tasks involving intricate language structures, multi-modality, or subtle context dependencies such as long conversations.

The parameter count of an LLM is a crucial aspect that impacts its performance, generalization capabilities, and computational requirements. For instance, PaLM 2 has 340 billion parameters, and GPT-4 is estimated to have around 1.8 trillion parameters. These large-scale models require significant computational resources for training and inference, including high GPU/TPU requirements and substantial VRAM. Moreover, larger models tend to excel in a range of tasks, demonstrating their ability to understand and process diverse types of information more effectively.

However, the size of an LLM also poses challenges. Larger models may overfit to the training data, especially if the data lacks diversity or is not extensive enough. This necessitates the use of regularization techniques and careful data selection. Additionally, larger models entail higher environmental and economic costs due to their significant energy consumption for training and operation.

In terms of practical deployment, it is important to consider the balance between the size of the model and the specific needs of the application. While larger models offer advanced capabilities, they may not always be the most efficient choice, especially for simpler tasks. Smaller, more specialized models can provide a more cost-effective and equally effective solution for straightforward applications. Therefore, understanding the complexity of the task at hand and choosing an appropriately sized model is vital to ensure a balance between capability and efficiency.

  • Architecture: Describes the underlying structure of the LLM

  • Merged: Indicates whether different models have been combined.

  • Hub License: The type of license under which the model is released, which affects how it can be used.

  • Available on the Hub: Whether the model is available for use on Hugging Face's Hub.

  • Model SHA: The SHA (Secure Hash Algorithm) is a unique identifier for the model

  • Flagged: This indicates whether the model has been flagged for issues or concerns by users or moderators.You can find additional information about the LLM

You can access additional information about the leaderboard in the 'About' tab.

Interested in participating in the leaderboard? You can submit your own LLM under the "Submit Here" tab.

As of now, the leaderboard has evaluated an 3,051 LLMs. Currently, 5 LLMs are undergoing evaluation, and 4 are queued for assessment. It's a highly active and engaging platform!

Written by Anson Park

CEO of DeepNatural. MSc in Computer Science from KAIST & TU Berlin. Specialized in Machine Learning and Natural Language Processing.