erythropygia commited on
Commit
69bcbf3
1 Parent(s): 7440193

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -1
README.md CHANGED
@@ -5,4 +5,63 @@ tags:
5
  - '#Turkish '
6
  - '#turkish'
7
  - '#gpt2'
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - '#Turkish '
6
  - '#turkish'
7
  - '#gpt2'
8
+
9
+
10
+ # Model Card for Model ID
11
+
12
+
13
+ gpt2 fine-tuned with Turkish corpus data.
14
+
15
+
16
+ ### Training Data
17
+
18
+ - Dataset size: ~2 million
19
+
20
+
21
+ ## Using model
22
+
23
+ ```Python
24
+ from tokenizers import (decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer)
25
+ from transformers import GPT2Tokenizer, GPT2TokenizerFast, GPT2Model, GPT2LMHeadModel
26
+ from transformers import TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments
27
+ import torch
28
+
29
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
30
+ print(device)
31
+
32
+ def generate_output(text):
33
+ # Input text for completion
34
+ input_text = text
35
+
36
+ # Tokenize the input text
37
+ input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
38
+
39
+ # Generate text completions with specified parameters
40
+ output_text = model.generate(input_ids,
41
+ no_repeat_ngram_size = 3,
42
+ max_length=50,
43
+ repetition_penalty=1.1,
44
+ top_k=100,
45
+ top_p=0.7,
46
+ temperature = 0.8,
47
+ do_sample=True,
48
+ num_return_sequences=1)[0]
49
+
50
+ # Decode the generated token IDs to text
51
+ completed_text = tokenizer.decode(output_text, skip_special_tokens=False)
52
+
53
+ #print("Input Text:", input_text)
54
+ return completed_text
55
+
56
+ print(generate_output("Adım Mehmet."))
57
+
58
+ ```
59
+
60
+ #### Training Hyperparameters
61
+
62
+ - **Epochs:** 5
63
+ - **LearningRate:**:4e-5
64
+
65
+
66
+ #### Training Results
67
+ **training_loss:** 4.06675440790132