File size: 3,918 Bytes
11af166
 
 
 
 
7647d27
 
 
e2083e9
7647d27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1beb952
7647d27
 
 
 
 
 
 
 
e2083e9
7647d27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e2083e9
7647d27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
license: gpl-3.0
datasets:
- medalpaca/medical_meadow_medical_flashcards
pipeline_tag: question-answering
---
# Model Description
This is a fine-tuned version of the Minerva model, trained on the [Medical Meadow Flashcard Dataset](https://huggingface.co/datasets/medalpaca/medical_meadow_medical_flashcards) for question answering. The model was developed by the Sapienza NLP Team in collaboration with Future Artificial Intelligence Research (FAIR) and CINECA; specifically, I used the version with 350 million parameters due to computational limits, though versions with 1 billion and 3 billion parameters also exist. For more details, please refer to their repositories: [Sapienza NLP on Hugging Face](https://huggingface.co/sapienzanlp) and [Minerva LLMs](https://nlp.uniroma1.it/minerva/).
<br>
# Issues and possible Solutions
- In the original fine-tuned version, my model tended to generate answers that continued unnecessarily, leading to repeated sentences and a degradation in quality over time. Parameters like '*max_length*' or '*max_new_tokens*' were ineffective as they merely stopped the generation at a specified point without properly concluding the sentence. To address this issue, I redefined the stopping criteria to terminate the generation at the first period ('.'), as demonstrated in the code below:
- ```python
  class newStoppingCriteria(StoppingCriteria):

    def __init__(self, stop_word):
        self.stop_word = stop_word

    def __call__(self, input_ids, scores, **kwargs):

        decoded_text = tokenizer.decode(input_ids[0], skip_special_tokens=True)
        return self.stop_word in decoded_text


  criteria = newStoppingCriteria(stop_word = ".")
  stoppingCriteriaList = StoppingCriteriaList([criteria])
  ```
<br>

- Since the preprocessed text was formatted as "BoS token - Question - EoS token - BoS token - Answer - EoS token," the model generated answers that included the question as well. To resolve this, I implemented a method to remove the question from the generated text, leaving only the answer:

- ```python
  outputText = tokenizer.decode(output_ids[0], skip_special_tokens = True)
  inputText = tokenizer.decode(inputEncoding.input_ids[0], skip_special_tokens = True)
  answer = outputText[len(inputText):].strip()
  ```
<br>

# Use Example

```python
  question = 'What causes Wernicke encephalopathy?'

  inputEncoding = tokenizer(question, return_tensors = 'pt').to('cuda')
  output_ids = model.generate(
    
      inputEncoding.input_ids, 
      max_length = 128, 
      do_sample = True, 
      temperature = 0.7, 
      top_p = 0.97, 
      top_k = 2, 
      pad_token_id = tokenizer.eos_token_id,
      repetition_penalty = 1.2,
      stopping_criteria = stoppingCriteriaList  
  )

  outputText = tokenizer.decode(output_ids[0], skip_special_tokens = True)
  inputText = tokenizer.decode(inputEncoding.input_ids[0], skip_special_tokens = True)
  answer = outputText[len(inputText):].strip()

  # Generated Answer: Wernicke encephalopathy is caused by a defect in the Wern-Herxheimer reaction, which leads to an accumulation of acid and alkaline phosphatase activity.
  # Effective Answer: The underlying pathophysiologic cause of Wernicke encephalopathy is thiamine (B1) deficiency.
  ```
<br>

# Training Information
The model was fine-tuned for 3 epochs using the parameters specified in its original repository:

```python
  trainingArgs = TrainingArguments(

    output_dir = "MedicalFlashcardsMinerva",
    evaluation_strategy = "steps",
    save_strategy = "steps",
    learning_rate = 2e-4,
    per_device_train_batch_size = 6,
    per_device_eval_batch_size = 6,
    gradient_accumulation_steps = 8,
    num_train_epochs = 3,
    lr_scheduler_type = "cosine",
    warmup_ratio = 0.1,
    adam_beta1 = 0.9,
    adam_beta2 = 0.95,
    adam_epsilon = 1e-8,
    weight_decay = 0.01,
    logging_steps = 100,
    report_to = "none",

    )
  ```