BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training Paper • 2409.04599 • Published Sep 6 • 1
Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models Paper • 2311.09194 • Published Nov 15, 2023
Toxicity of the Commons: Curating Open-Source Pre-Training Data Paper • 2410.22587 • Published Oct 29 • 9
Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement Paper • 2403.13754 • Published Mar 20
A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages Paper • 2403.00686 • Published Mar 1
When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages Paper • 2311.09205 • Published Nov 15, 2023