open_pt_llm_leaderboard

Running on CPU Upgrade

tweetSentBR is missing from the HUB

by nicholasKluge - opened Feb 22, 2024

Feb 22, 2024

Hello edugarcia!

I tried running your fork on all benchmarks (congratulations for putting this together!), but I got a warning that says eduagarcia/tweetsentbr is not available on the Hub. If you want to reproduce this error, just run this cell on Colab:

!git clone --branch main https://github.com/eduagarcia/lm-evaluation-harness-pt.git
!cd lm-evaluation-harness-pt && pip install -e . -q
!pip install cohere tiktoken sentencepiece -q

!cd lm-evaluation-harness-pt && python lm_eval \
    --model huggingface \
    --model_args pretrained="nicholasKluge/TeenyTinyLlama-160m",revision="main" \
    --tasks enem_challenge,bluex,oab_exams,assin2_rte,assin2_sts,faquad_nli,hatebr_offensive,portuguese_hate_speech,tweetsentbr \
    --device cuda:0 \
    --output_path "./"

Could you re-upload the tweetsentbr ? Or have you removed it from this evaluation harness?

eduagarcia

Owner Feb 22, 2024

•

edited Feb 22, 2024

The tweetsentbr dataset is currently private.
The authors have not made the text of the tweets publicly available due to Twitter/X policies. More info on their git repository: https://bitbucket.org/HBrum/tweetsentbr

To access to the complete dataset, you have two options:

Recreate the dataset from the tweets ids of the git repository using the Twitter / X paid API.
Ask the original authors for a copy of the dataset. (contact info available on the original git repository)

Once you have the text from the tweets you need to create a huggingface dataset with the following format:

Then you just need to update the dataset_path field in the YAML configuration file in my lm-evaluation-harness fork: https://github.com/eduagarcia/lm-evaluation-harness-pt/blob/main/lm_eval/tasks/portuguese/tweetsentbr.yaml

I'm sorry about the hassle with the private dataset. The reason for the inclusion of the tweetsentbr is that we wanted to compare our work with others like the Poeta benchmark from the Sabia model, but I get that it can be annoying. Just so you know, the benchmarks rankings remain largely consistent with averaging the score from the other 8 tasks, you can evaluate yours models without it and still get a good feedback.

Right now, your submission queue is jammed because we are reevaluating all models on a new version of the benchmark, however we are prioritizing new submissions. While it may take some time, feel free to submit to the leaderboard

nicholasKluge

Feb 22, 2024

Thank you for the fast reply and clarifications! I understand how these things are, so no problems.

I already forked and ran the evaluations on my own (but I'm looking forward to having the results of my models on the leaderboard!). For now, I simply did not run the tweetsentbr , but everything else worked perfectly.

Thanks again for this. It is great that we have a standard benchmark for Portuguese LLMs now!

eduagarcia

Owner Feb 23, 2024

Thank you! Based on the current queue rate, your models should be evaluated sometime tomorrow.

I've added a new feature to the leaderboard: now, if you hide a column corresponding to a dataset, the averages and rankings will update accordingly, this should help with offline comparisons

nicholasKluge

Feb 23, 2024

This is actually quite useful; thank you for the work!

eduagarcia changed discussion status to closed Feb 29, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment