8 contributors

History: 137 commits

Pablogps

Update README.md

72f4884 over 3 years ago

configs
Changed and added vocab and tokenizer over 3 years ago
evaluation
Adding Colab for token classification fine-tuning over 3 years ago
images
Merge branch 'main' of https://huggingface.co/bertin-project/bertin-roberta-base-spanish into main over 3 years ago
mc4
Add eval scripts over 3 years ago
utils
Explanations over 3 years ago
.gitattributes

737 Bytes

initial commit over 3 years ago
.gitignore

1.84 kB

Initial test with BETO's corpus over 3 years ago
README.md

32.7 kB

Update README.md over 3 years ago
config.json

618 Bytes

Fix config for checkpoint over 3 years ago
config.py

256 Bytes

Preparing code for final runs over 3 years ago
convert.py

876 Bytes

Improved version of conversion script Flax → PyTorch over 3 years ago
events.out.tfevents.1625704081.t1v-n-a4d97d44-w-0.212075.3.v2

40 Bytes
LFS

Adding tfevents over 3 years ago
events.out.tfevents.1625704245.t1v-n-a4d97d44-w-0.216676.3.v2

40 Bytes
LFS

Adding tfevents over 3 years ago
events.out.tfevents.1625705283.t1v-n-a4d97d44-w-0.234462.3.v2

40 Bytes
LFS

Adding tfevents over 3 years ago
flax_model.msgpack

250 MB
LFS

Model at 211k steps, mlm ac 0.6547 over 3 years ago
get_embeddings_and_perplexity.py

1.53 kB

Add script to generate dataset of embeddings and perplexities. Add script to generate t-SNE plot for embedding and perplexity visualization. over 3 years ago
merges.txt

505 kB

Changed and added vocab and tokenizer over 3 years ago
perplexity.py

751 Bytes

Adding checkpointing, wandb, and new mlm script over 3 years ago
pytorch_model.bin

499 MB
LFS

Model at 211k steps, mlm ac 0.6547 over 3 years ago
run.sh

883 Bytes

Adding base config and organizing configs over 3 years ago
run_mlm_flax.py

30 kB

Adding sampling to mc4 over 3 years ago
run_mlm_flax_stream.py

35.1 kB

New logo over 3 years ago
run_stream.sh

932 Bytes

Preparing code for final runs over 3 years ago
special_tokens_map.json

239 Bytes

Changed and added vocab and tokenizer over 3 years ago
tokenizer.json

1.45 MB

Changed and added vocab and tokenizer over 3 years ago
tokenizer_config.json

292 Bytes

Changed and added vocab and tokenizer over 3 years ago
tokens.py

649 Bytes

Scripts for perplexity sampling and fixes over 3 years ago
tokens.py.orig

899 Bytes

Adjust batch size for extrating tokens over 3 years ago
tsne_plot.py

3.02 kB

Remove unused imports over 3 years ago
vocab.json

846 kB

Changed and added vocab and tokenizer over 3 years ago