license: apache-2.0
datasets:
- tafseer-nayeem/KidLM-corpus
language:
- en
base_model:
- FacebookAI/roberta-base
pipeline_tag: fill-mask
library_name: transformers
KidLM Model
We continue pre-train the RoBERTa (base) model on our KidLM corpus using a masked language modeling (MLM) objective. This approach involves randomly masking 15% of the words in each input sequence, allowing the model to predict the masked words based on their surrounding context. For more details, please refer to our EMNLP 2024 paper.
How to use
You can use this model directly with a pipeline for masked language modeling:
from transformers import pipeline
fill_mask_kidLM = pipeline(
"fill-mask",
model="tafseer-nayeem/KidLM",
top_k=5
)
prompt = "On my birthday, I want <mask>."
predictions_kidLM = fill_mask_kidLM(prompt)
print(predictions_kidLM)
Outputs:
[
{'score': 0.25483939051628113,
'token': 8492,
'token_str': 'cake',
'sequence': 'On my birthday, I want cake.'},
{'score': 0.1356380134820938,
'token': 7548,
'token_str': 'chocolate',
'sequence': 'On my birthday, I want chocolate.'},
{'score': 0.05929633602499962,
'token': 402,
'token_str': 'something',
'sequence': 'On my birthday, I want something.'},
{'score': 0.04304230958223343,
'token': 6822,
'token_str': 'presents',
'sequence': 'On my birthday, I want presents.'},
{'score': 0.0218580923974514,
'token': 1085,
'token_str': 'nothing',
'sequence': 'On my birthday, I want nothing.'}
]
Limitations and bias
The training data used to build the KidLM model is our KidLM corpus. We made significant efforts to minimize offensive content in the pre-training data by deliberately sourcing from sites where such content is minimal. However, we cannot provide an absolute guarantee that no such content is present. We strongly recommend exercising caution when using the KidLM model, as it may still produce biased predictions.
from transformers import pipeline
fill_mask_kidLM = pipeline(
"fill-mask",
model="tafseer-nayeem/KidLM",
top_k=5
)
prompt = "Why are Africans so <mask>."
predictions_kidLM = fill_mask_kidLM(prompt)
print(predictions_kidLM)
[
{'score': 0.3277539908885956,
'token': 5800,
'token_str': 'angry',
'sequence': 'Why are Africans so angry.'},
{'score': 0.13104639947414398,
'token': 5074,
'token_str': 'sad',
'sequence': 'Why are Africans so sad.'},
{'score': 0.11670435220003128,
'token': 8265,
'token_str': 'scared',
'sequence': 'Why are Africans so scared.'},
{'score': 0.06159689277410507,
'token': 430,
'token_str': 'different',
'sequence': 'Why are Africans so different.'},
{'score': 0.041923027485609055,
'token': 4904,
'token_str': 'upset',
'sequence': 'Why are Africans so upset.'}
]
This bias may also affect all fine-tuned versions of this model.
Citation Information
If you use any of the resources or it's relevant to your work, please cite our EMNLP 2024 paper.
@inproceedings{nayeem-rafiei-2024-kidlm,
title = "{K}id{LM}: Advancing Language Models for Children {--} Early Insights and Future Directions",
author = "Nayeem, Mir Tafseer and
Rafiei, Davood",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.277",
pages = "4813--4836",
abstract = "Recent studies highlight the potential of large language models in creating educational tools for children, yet significant challenges remain in maintaining key child-specific properties such as linguistic nuances, cognitive needs, and safety standards. In this paper, we explore foundational steps toward the development of child-specific language models, emphasizing the necessity of high-quality pre-training data. We introduce a novel user-centric data collection pipeline that involves gathering and validating a corpus specifically written for and sometimes by children. Additionally, we propose a new training objective, Stratified Masking, which dynamically adjusts masking probabilities based on our domain-specific child language data, enabling models to prioritize vocabulary and concepts more suitable for children. Experimental evaluations demonstrate that our model excels in understanding lower grade-level text, maintains safety by avoiding stereotypes, and captures children{'}s unique preferences. Furthermore, we provide actionable insights for future research and development in child-specific language modeling.",
}
Contributors
- Mir Tafseer Nayeem (mnayeem@ualberta.ca)
- Davood Rafiei (drafiei@ualberta.ca)