English
Spanish
Italian
File size: 4,470 Bytes
5e9c0ba
4835663
 
 
8118fb7
3e2c9a3
 
8118fb7
 
 
 
4835663
 
5e9c0ba
 
6e52432
4835663
 
8118fb7
5e9c0ba
8118fb7
 
 
5e9c0ba
3e2c9a3
5e9c0ba
8118fb7
 
 
 
 
 
 
 
3e2c9a3
8118fb7
 
 
e4bfe53
9f91239
8118fb7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a8ee5e4
8118fb7
 
 
 
 
 
 
 
 
16eee40
8118fb7
16eee40
8118fb7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e2c9a3
 
 
 
 
 
 
 
8118fb7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
model-index:
- name: TaxoLLaMA
  results: []
license: cc-by-sa-4.0
datasets:
  - VityaVitalich/WordNet-TaxoLLaMA
language:
- en
- es
- it
base_model: meta-llama/Llama-2-7b-hf

---

<img src="https://huggingface.co/VityaVitalich/TaxoLLaMA/resolve/main/pipeline_final_final24.drawio-1.png?download=true"  alt="TaxoLLaMA banner" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>


# Model Card for TaxoLLaMA

TaxoLLaMA is a lightweight fine-tune of LLaMA2-7b model, aimed at solving multiple Lexical Semantics task with focus on Taxonomy related tasks, achieving SoTA results on multiple benchmarks.
It was pretrained with instructive dataset, collected from WordNet 3.0 to generate hypernyms for a given hyponym. 
This model also could be used for identifying hypernymy with perplexity, that is useful for Lexical Entailment or Taxonomy Construction.

For more details, read paper: [TaxoLLaMA: WordNet-based Model for Solving Multiple Lexical Sematic Tasks](https://arxiv.org/abs/2403.09207)

## Model description

- **Finetuned from model:** [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- **Language(s) (NLP):** Primarily English, but could be easily extended to other languages. Achieves SoTA also for Spanish and Italian

### Model Sources

- **Repository:** [https://github.com/VityaVitalich/TaxoLLaMA](https://github.com/VityaVitalich/TaxoLLaMA)
- **Instruction Set:** [WordNet-TaxoLLaMA](https://huggingface.co/datasets/VityaVitalich/WordNet-TaxoLLaMA)

## Performance

| Model | Hypernym Discovery (Eng., MRR) | Hypernym Discovery (Span., MRR) | Taxonomy Construction (Enivornment, F1) | Taxonomy Enrichment (WordNet Verb, MRR) |
|-------------|---------------|---------------|---------------|---------------|
| **TaxoLLaMA**  | **54.39** | **58.61** | **45.13** |  **52.4** | 
| **TaxoLLaMA-bench**  | **51.39** | **57.44** | **44.82** | **51.9** | 
| **Previous SoTA**  | **45.22** | **37.56** | **40.00** | **45.2** | 

## Input Format

The model is trained to use the following format :
```
<s>[INST] <<SYS>> You are a helpfull assistant. List all the possible words divided with a coma. Your answer should not include anything except the words divided by a coma<</SYS>>
hyponym: tiger (large feline of forests in most of Asia having a tawny coat with black stripes)| hypernyms: [/INST]
```
### Training hyperparameters

The following hyperparameters were used for instruction tuning:
- learning_rate: 3e-04
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-09
- lr_scheduler_type: CosineAnnealing
- num_epochs: 1.0



## Usage Example

```python
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
from peft import PeftConfig, PeftModel

torch.set_default_device('cuda')
config = PeftConfig.from_pretrained('VityaVitalich/TaxoLLaMA')
# Do not forget your token for Llama2 models
model = LlamaForCausalLM.from_pretrained(config.base_model_name_or_path, load_in_4bit=True, torch_dtype=torch.bfloat16)
tokenizer = LlamaTokenizer.from_pretrained(config.base_model_name_or_path)
inference_model = PeftModel.from_pretrained(model, 'VityaVitalich/TaxoLLaMA')

processed_term = "hyponym: tiger | hypernyms:"

system_prompt = """<s>[INST] <<SYS>> You are a helpfull assistant. List all the possible words divided with a coma. Your answer should not include anything except the words divided by a coma<</SYS>>"""
processed_term = system_prompt + '\n' + processed_term + '[/INST]'

input_ids = tokenizer(processed_term, return_tensors='pt')

# This is an example of generation hyperparameters, they could be modified to fit your task
gen_conf = {
            "no_repeat_ngram_size": 3,
            "do_sample": True,
            "num_beams": 8,
            "num_return_sequences": 2,
            "max_new_tokens": 32,
            "top_k": 20,
        }

out = inference_model.generate(inputs=input_ids['input_ids'].to('cuda'), **gen_conf)

text = tokenizer.batch_decode(out)[0][len(system_prompt):]
print(text)

```

## Citation

If you find TaxoLLaMA is useful in your work, please cite it with:

```
@misc{moskvoretskii2024taxollama,
      title={TaxoLLaMA: WordNet-based Model for Solving Multiple Lexical Sematic Tasks}, 
      author={Viktor Moskvoretskii and Ekaterina Neminova and Alina Lobanova and Alexander Panchenko and Irina Nikishina},
      year={2024},
      eprint={2403.09207},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```