|
def get_public_ip(): |
|
try: |
|
response = requests.get('https://api.ipify.org') |
|
public_ip = response.text |
|
return public_ip |
|
except Exception as e: |
|
return f"Error: {str(e)}" |
|
|
|
public_ip = get_public_ip() |
|
|
|
ABOUT = f""" |
|
# ❓ About |
|
|
|
At Powered-by-Intel LLM Leaderboard we conduct the same benchmarks as the Open LLM Leaderboard and plan to add |
|
domain-specific benchmarks in the future. We utilize the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> |
|
Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of |
|
different evaluation tasks. |
|
|
|
Our current benchmarks include: |
|
|
|
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge (25-shot)</a> - a set of grade-school science questions. |
|
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag (10-shot)</a> - a test of commonsense inference, which is easy for humans (~95%) but challenging for state-of-the-art models. |
|
- <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU (5-shot)</a> - a test measuring a text model's multitask accuracy, covering 57 tasks in fields like elementary mathematics, US history, computer science, law, and more. |
|
- <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA (0-shot)</a> - a test measuring a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA is technically a 6-shot task in the Harness because each example is prepended with 6 Q/A pairs, even in the 0-shot setting. |
|
- <a href="https://arxiv.org/abs/1907.10641" target="_blank"> Winogrande (5-shot)</a> - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning. |
|
- <a href="https://arxiv.org/abs/2110.14168" target="_blank"> GSM8k (5-shot)</a> - diverse grade school math word problems measuring a model's ability to solve multi-step mathematical reasoning problems. |
|
For all these evaluations, a higher score is better. We've chosen these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings. In the future, we plan to add domain-specific benchmarks to further evaluate our models. |
|
|
|
We run an adapted version of the benchmark code specifically designed to run the EleutherAI Harness benchmarks on Gaudi processors. |
|
This adapted evaluation harness is built into the Hugging Face Optimum Habana Library. Review the documentation [here](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation). |
|
|
|
## Support and Community |
|
|
|
Join 5000+ developers on the [Intel DevHub Discord](https://discord.gg/yNYNxK2k) to get support with your submission |
|
and talk about everything from GenAI, HPC, to Quantum Computing. |
|
|
|
## "Chat with Top Models on the Leaderboard Here 💬" Functionality |
|
|
|
This is a fun on-leaderboard LLM chat functionality designed to provide a quick way to test the top LLMs on the leaderboard. |
|
As the leaderboard matures and users submit models, we will rotate the available models for chat. Who knows!? You might find |
|
your model featured here soon! ⭐ |
|
|
|
### Chat Functionality Notice |
|
- All the models in this demo run on 4th Generation Intel® Xeon® (Sapphire Rapids) utilizing AMX operations and quantized inference optimizations. |
|
- Terms of use: By using the chat functionality, users are required to agree to the following terms: The service is a research preview intended for non-commercial |
|
use only. It can produce factually incorrect output, and should not be relied on to produce factually accurate information. |
|
The service only provides limited safety measures and may generate lewd, biased or otherwise offensive content. It must not be |
|
used for any illegal, harmful, violent, racist, or sexual purposes. The service may collect user dialogue data for future research. |
|
- License: The chat functionality is a research preview intended for non-commercial use only. |
|
|
|
space ip: {public_ip} |
|
""" |