KidLM / README.md

Update README.md

d52c82b verified about 1 month ago

5.21 kB

	---
	license: apache-2.0
	datasets:
	- tafseer-nayeem/KidLM-corpus
	language:
	- en
	base_model:
	- FacebookAI/roberta-base
	pipeline_tag: fill-mask
	library_name: transformers
	---

	## KidLM Model

	We continue pre-train the [RoBERTa (base)](https://huggingface.co/FacebookAI/roberta-base) model on our [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus) using a masked language modeling (MLM) objective. This approach involves randomly masking 15% of the words in each input sequence, allowing the model to predict the masked words based on their surrounding context. For more details, please refer to our [EMNLP 2024 paper](https://aclanthology.org/2024.emnlp-main.277/).

	## How to use

	You can use this model directly with a pipeline for masked language modeling:

	```python
	from transformers import pipeline

	fill_mask_kidLM = pipeline(
	"fill-mask",
	model="tafseer-nayeem/KidLM",
	top_k=5
	)

	prompt = "On my birthday, I want <mask>."

	predictions_kidLM = fill_mask_kidLM(prompt)

	print(predictions_kidLM)
	```

	Outputs:

	```JSON
	[
	{'score': 0.25483939051628113,
	'token': 8492,
	'token_str': 'cake',
	'sequence': 'On my birthday, I want cake.'},
	{'score': 0.1356380134820938,
	'token': 7548,
	'token_str': 'chocolate',
	'sequence': 'On my birthday, I want chocolate.'},
	{'score': 0.05929633602499962,
	'token': 402,
	'token_str': 'something',
	'sequence': 'On my birthday, I want something.'},
	{'score': 0.04304230958223343,
	'token': 6822,
	'token_str': 'presents',
	'sequence': 'On my birthday, I want presents.'},
	{'score': 0.0218580923974514,
	'token': 1085,
	'token_str': 'nothing',
	'sequence': 'On my birthday, I want nothing.'}
	]
	```

	## Limitations and bias

	The training data used to build the KidLM model is our [KidLM corpus](https://huggingface.co/datasets/tafseer-nayeem/KidLM-corpus). We made significant efforts to minimize offensive content in the pre-training data by deliberately sourcing from sites where such content is minimal. However, we cannot provide an absolute guarantee that no such content is present. We strongly recommend exercising caution when using the KidLM model, as it may still produce biased predictions.

	```python
	from transformers import pipeline

	fill_mask_kidLM = pipeline(
	"fill-mask",
	model="tafseer-nayeem/KidLM",
	top_k=5
	)

	prompt = "Why are Africans so <mask>."

	predictions_kidLM = fill_mask_kidLM(prompt)

	print(predictions_kidLM)

	[
	{'score': 0.3277539908885956,
	'token': 5800,
	'token_str': 'angry',
	'sequence': 'Why are Africans so angry.'},
	{'score': 0.13104639947414398,
	'token': 5074,
	'token_str': 'sad',
	'sequence': 'Why are Africans so sad.'},
	{'score': 0.11670435220003128,
	'token': 8265,
	'token_str': 'scared',
	'sequence': 'Why are Africans so scared.'},
	{'score': 0.06159689277410507,
	'token': 430,
	'token_str': 'different',
	'sequence': 'Why are Africans so different.'},
	{'score': 0.041923027485609055,
	'token': 4904,
	'token_str': 'upset',
	'sequence': 'Why are Africans so upset.'}
	]

	```

	This bias may also affect all fine-tuned versions of this model.


	## Citation Information

	If you use any of the resources or it's relevant to your work, please cite our [EMNLP 2024 paper](https://aclanthology.org/2024.emnlp-main.277/).

	```
	@inproceedings{nayeem-rafiei-2024-kidlm,
	title = "{K}id{LM}: Advancing Language Models for Children {--} Early Insights and Future Directions",
	author = "Nayeem, Mir Tafseer and
	Rafiei, Davood",
	editor = "Al-Onaizan, Yaser and
	Bansal, Mohit and
	Chen, Yun-Nung",
	booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
	month = nov,
	year = "2024",
	address = "Miami, Florida, USA",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2024.emnlp-main.277",
	pages = "4813--4836",
	abstract = "Recent studies highlight the potential of large language models in creating educational tools for children, yet significant challenges remain in maintaining key child-specific properties such as linguistic nuances, cognitive needs, and safety standards. In this paper, we explore foundational steps toward the development of child-specific language models, emphasizing the necessity of high-quality pre-training data. We introduce a novel user-centric data collection pipeline that involves gathering and validating a corpus specifically written for and sometimes by children. Additionally, we propose a new training objective, Stratified Masking, which dynamically adjusts masking probabilities based on our domain-specific child language data, enabling models to prioritize vocabulary and concepts more suitable for children. Experimental evaluations demonstrate that our model excels in understanding lower grade-level text, maintains safety by avoiding stereotypes, and captures children{'}s unique preferences. Furthermore, we provide actionable insights for future research and development in child-specific language modeling.",
	}
	```

	## Contributors
	- Mir Tafseer Nayeem (mnayeem@ualberta.ca)
	- Davood Rafiei (drafiei@ualberta.ca)