crodri commited on
Commit
19e2bc2
1 Parent(s): ee34683

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -1
README.md CHANGED
@@ -8,4 +8,102 @@ tags:
8
  - RAG
9
  ---
10
 
11
- ## Model optimized for QA
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  - RAG
9
  ---
10
 
11
+ # FLOR-6.3B Model optimized for QA
12
+
13
+
14
+ ## Table of Contents
15
+ <details>
16
+ <summary>Click to expand</summary>
17
+
18
+ - [Model description](#model-description)
19
+ - [Intended uses and limitations](#intended-uses-and-limitations)
20
+ - [How to use](#how-to-use)
21
+ - [Limitations and bias](#limitations-and-bias)
22
+ - [Training](#training)
23
+ - [Evaluation](#evaluation)
24
+ - [Additional information](#additional-information)
25
+
26
+ </details>
27
+
28
+ ## Model description
29
+
30
+ **FlorQARAG** is a 6.3B-parameter transformer-based causal language model for Catalan, Spanish, and English, trained on a customized QA dataset from various sources especifically to be used in RAG (Retrieval-Aumented Generation) Applications.
31
+ The dataset used to fine tune the model is: [PureInstructQA](https://huggingface.co/datasets/projecte-aina/PureInstructQA)
32
+ ## Intended uses and limitations
33
+
34
+ The **FlorQARAG** model is ready-to-use for RAG applications optimized for Catalan language.
35
+ It can perform text-generation Question Answering in the context of RAG applications.
36
+
37
+ ## How to use
38
+ ```python
39
+ import torch
40
+ from transformers import pipeline
41
+
42
+ pipe = pipeline("text-generation", model="projecte-aina/FlorQARAG")
43
+
44
+ instruction = "Quants habitants té Mataró?"
45
+
46
+ context = "Mataró és una ciutat de Catalunya, capital de la comarca del Maresme. Situada al litoral mediterrani, a uns 30 km al nord-est de Barcelona, ha estat tradicionalment un centre administratiu de rellevància territorial i un pol de dinamisme econòmic. Compta amb prop de 130.000 habitants, essent actualment la vuitena població del Principat i la tretzena dels Països Catalans. "
47
+
48
+ # We need to format the prompt and context using ### and \n
49
+
50
+ def givePrediction(instruction, context, max_new_tokens=50, repetition_penalty=1.2, top_k=50, top_p=0.95, do_sample=True, temperature=0.5)
51
+ text = f"### Instruction\n{{instruction}}\n### Context\n{{context}}\n### Answer\n"
52
+ response = pipe(text.format(instruction=instruction, context=context),temperature=temperature,repetition_penalty=repetition_penalty, max_new_tokens=max_new_tokens,top_k=top_k, top_p=top_p, do_sample=do_sample)[0]["generated_text"]
53
+ answer = response.split("###")[-1][8:-1]
54
+ return answer
55
+
56
+ answer = givePrediction(instruction, context)
57
+
58
+ print(answer)
59
+ '130 000'
60
+
61
+ ```
62
+
63
+ ## Limitations and bias
64
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
65
+ However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques
66
+ on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
67
+
68
+
69
+ ## Training
70
+
71
+
72
+ ### Instruction Data
73
+
74
+ The training corpus is composed of 82,539 QA instruction following examples. See Data Card at [PureInstructQA](https://huggingface.co/datasets/projecte-aina/PureInstructQA).
75
+
76
+ ## Additional information
77
+
78
+ ### Author
79
+ The Language Technologies Unit from Barcelona Supercomputing Center.
80
+
81
+ ### Contact
82
+ For further information, please send an email to <langtech@bsc.es>.
83
+
84
+ ### Copyright
85
+ Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
86
+
87
+ ### License
88
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
89
+
90
+ ### Funding
91
+ This work was funded by [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
92
+
93
+ ### Disclaimer
94
+
95
+ <details>
96
+ <summary>Click to expand</summary>
97
+
98
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
99
+
100
+ Be aware that the model may have biases and/or any other undesirable distortions.
101
+
102
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
103
+ or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
104
+ in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
105
+
106
+ In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
107
+ be liable for any results arising from the use made by third parties.
108
+
109
+ </details>