File size: 4,038 Bytes
24086b0
 
867b5a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import requests

def get_public_ip():
    try:
        response = requests.get('https://api.ipify.org')
        public_ip = response.text
        return public_ip
    except Exception as e:
        return f"Error: {str(e)}"

public_ip = get_public_ip()

ABOUT = f"""
# ❓ About 

At Powered-by-Intel LLM Leaderboard we conduct the same benchmarks as the Open LLM Leaderboard and plan to add 
domain-specific benchmarks in the future. We utilize the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> 
Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of 
different evaluation tasks.

Our current benchmarks include:

- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge (25-shot)</a> - a set of grade-school science questions.
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag (10-shot)</a> - a test of commonsense inference, which is easy for humans (~95%) but challenging for state-of-the-art models.
- <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU (5-shot)</a> - a test measuring a text model's multitask accuracy, covering 57 tasks in fields like elementary mathematics, US history, computer science, law, and more.
- <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA (0-shot)</a> - a test measuring a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA is technically a 6-shot task in the Harness because each example is prepended with 6 Q/A pairs, even in the 0-shot setting.
- <a href="https://arxiv.org/abs/1907.10641" target="_blank"> Winogrande (5-shot)</a> - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
- <a href="https://arxiv.org/abs/2110.14168" target="_blank"> GSM8k (5-shot)</a> - diverse grade school math word problems measuring a model's ability to solve multi-step mathematical reasoning problems.
For all these evaluations, a higher score is better. We've chosen these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings. In the future, we plan to add domain-specific benchmarks to further evaluate our models.

We run an adapted version of the benchmark code specifically designed to run the EleutherAI Harness benchmarks on Gaudi processors. 
This adapted evaluation harness is built into the Hugging Face Optimum Habana Library. Review the documentation [here](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation).

## Support and Community 

Join  5000+ developers on the [Intel DevHub Discord](https://discord.gg/yNYNxK2k) to get support with your submission 
and talk about everything from GenAI, HPC, to Quantum Computing.

## "Chat with Top Models on the Leaderboard Here 💬" Functionality

This is a fun on-leaderboard LLM chat functionality designed to provide a quick way to test the top LLMs on the leaderboard. 
As the leaderboard matures and users submit models, we will rotate the available models for chat. Who knows!? You might find
your model featured here soon! ⭐

### Chat Functionality Notice
- All the models in this demo run on 4th Generation Intel® Xeon® (Sapphire Rapids) utilizing AMX operations and quantized inference optimizations.
- Terms of use: By using the chat functionality, users are required to agree to the following terms: The service is a research preview intended for non-commercial 
use only. It can produce factually incorrect output, and should not be relied on to produce factually accurate information. 
The service only provides limited safety measures and may generate lewd, biased or otherwise offensive content. It must not be 
used for any illegal, harmful, violent, racist, or sexual purposes. The service may collect user dialogue data for future research.
- License: The chat functionality is a research preview intended for non-commercial use only.

space ip: {public_ip}
"""