Fujitsu
/

pytorrent

+---
+license: mit
+widget:
+language:
+- en
+datasets:
+- pytorrent
+---
+#  🔥 RoBERTa-MLM-based PyTorrent 1M 🔥
+Pretrained weights based on [PyTorrent Dataset](https://github.com/fla-sil/PyTorrent) which is a curated data from a large official Python packages.
+We use PyTorrent dataset to train a preliminary DistilBERT-Masked Language Modeling(MLM) model from scratch. The trained model, along with the dataset, aims to help researchers to easily and efficiently work on a large dataset of Python packages using only 5 lines of codes to load the transformer-based model. We use 1M raw Python scripts of PyTorrent that includes 12,350,000 LOC to train the model. We also train a byte-level Byte-pair encoding (BPE) tokenizer that includes 56,000 tokens, which is truncated LOC with the length of 50 to save computation resources.
+### Training Objective
+This model is trained with a Masked Language Model (MLM) objective.
+## How to use the model?
+```python
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("Fujitsu/pytorrent")
+model = AutoModel.from_pretrained("Fujitsu/pytorrent")
+```