mbahrami commited on
Commit
36d121e
1 Parent(s): f726aa1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -0
README.md ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ widget:
4
+ language:
5
+ - en
6
+
7
+ datasets:
8
+ - pytorrent
9
+ ---
10
+
11
+ # 🔥 RoBERTa-MLM-based PyTorrent 1M 🔥
12
+ Pretrained weights based on [PyTorrent Dataset](https://github.com/fla-sil/PyTorrent) which is a curated data from a large official Python packages.
13
+ We use PyTorrent dataset to train a preliminary DistilBERT-Masked Language Modeling(MLM) model from scratch. The trained model, along with the dataset, aims to help researchers to easily and efficiently work on a large dataset of Python packages using only 5 lines of codes to load the transformer-based model. We use 1M raw Python scripts of PyTorrent that includes 12,350,000 LOC to train the model. We also train a byte-level Byte-pair encoding (BPE) tokenizer that includes 56,000 tokens, which is truncated LOC with the length of 50 to save computation resources.
14
+
15
+ ### Training Objective
16
+ This model is trained with a Masked Language Model (MLM) objective.
17
+
18
+ ## How to use the model?
19
+ ```python
20
+ from transformers import AutoTokenizer, AutoModel
21
+
22
+
23
+ tokenizer = AutoTokenizer.from_pretrained("Fujitsu/pytorrent")
24
+ model = AutoModel.from_pretrained("Fujitsu/pytorrent")
25
+ ```