Pietro Lesci

pietrolesci

https://pietrolesci.github.io/

AI & ML interests

I like developing and applying causal methods to study the effect of training choices on models’ behaviour, including memorisation, shortcut learning, and tokenisation.

Recent Activity

updated a dataset 6 days ago

InfoTokenizers/finewebedu-20B

published a dataset 6 days ago

InfoTokenizers/finewebedu-20B

updated a model 7 days ago

InfoTokenizers/tokenizers

View all activity

Organizations

pietrolesci's activity

New activity in EleutherAI/pythia_deduped_pile_idxmaps 26 days ago

Availability of `_shuffle_idx.npy`, `_doc_idx.npy`, `*_sample_idx.npy` to reconstruct `EleutherAI/pile-deduped-pythia-preshuffled`

#1 opened 26 days ago by

pietrolesci

New activity in LLM360/AmberDatasets 30 days ago

🌟 Appreciation for providing seamless access to pre-processed pre-shuffled data

#2 opened 30 days ago by

pietrolesci

New activity in bigscience-data/README 30 days ago

Reconstructing pre-training data

#1 opened 30 days ago by

pietrolesci

New activity in JeanKaddour/minipile 6 months ago

Domain and provenance annotation

#1 opened over 1 year ago by

haukur

New activity in HuggingFaceTB/SmolLM-135M 7 months ago

Trapezoidal scheduler with cooldown phase

#4 opened 9 months ago by

maveriq

New activity in bias-amplified-splits/mnli 11 months ago

Bias annotation

#2 opened 11 months ago by

pietrolesci

New activity in EleutherAI/pythia-160m 11 months ago

Tokenizer `merges.txt` files

#5 opened 11 months ago by

pietrolesci

New activity in EleutherAI/pile-deduped-pythia-preshuffled about 1 year ago

Sequence "packing" logic

#2 opened over 1 year ago by

pietrolesci

New activity in EleutherAI/pile-deduped-pythia-preshuffled over 1 year ago

Pad-only sequences from mmap'ed dataset after a certain index

#1 opened over 1 year ago by

pietrolesci

New activity in EleutherAI/pile-duped-pythia-random-sampled over 1 year ago

Add full sequences (beyond the first 64 tokens)

#1 opened over 1 year ago by

pietrolesci

New activity in pfb30/multi_woz_v22 over 2 years ago

Fix swapped start and exclusive_end fields

#3 opened over 2 years ago by

pietrolesci

New activity in mrm8488/PromptSource over 2 years ago

App down

#1 opened over 2 years ago by

pietrolesci

New activity in pfb30/multi_woz_v22 over 2 years ago

`start` and `exclusive_end` seems swapped

#1 opened over 2 years ago by

pietrolesci

Pietro Lesci

AI & ML interests

Recent Activity

Organizations

pietrolesci's activity

Availability of `*_shuffle_idx.npy`, `*_doc_idx.npy`, `*_sample_idx.npy` to reconstruct `EleutherAI/pile-deduped-pythia-preshuffled`

🌟 Appreciation for providing seamless access to pre-processed pre-shuffled data

Reconstructing pre-training data

Domain and provenance annotation

Trapezoidal scheduler with cooldown phase

Bias annotation

Tokenizer `merges.txt` files

Sequence "packing" logic

Pad-only sequences from mmap'ed dataset after a certain index

Add full sequences (beyond the first 64 tokens)

Fix swapped start and exclusive_end fields

App down

`start` and `exclusive_end` seems swapped

Availability of `_shuffle_idx.npy`, `_doc_idx.npy`, `*_sample_idx.npy` to reconstruct `EleutherAI/pile-deduped-pythia-preshuffled`