gonzalez-agirre commited on
Commit
cd6011a
1 Parent(s): ffb8213

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -6
README.md CHANGED
@@ -81,6 +81,47 @@ pipeline_tag: text-generation
81
 
82
  The **Cǒndor-7B** is a transformer-based causal language model for Catalan, Spanish, and English. It is based on the [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model and has been trained on a 26B token trilugual corpus collected from publicly available corpora and crawlers.
83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
  ## Language adaptation
85
 
86
  We adapted the original Falcon-7B model to Spanish and Catalan by swapping the tokenizer and adjusting the embedding layer. The adaptation procedure is explained in this [blog](https://medium.com/@mpamies247/ee1ebc70bc79).
@@ -133,13 +174,7 @@ The resulting dataset has the following language distribution:
133
 
134
 
135
 
136
- ## Intended uses & limitations
137
-
138
- The **Cǒndor-7B** model is ready-to-use only for causal language modeling to perform text-generation tasks. However, it is intended to be fine-tuned on a generative downstream task.
139
-
140
 
141
- ## Limitations and biases
142
- At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
143
 
144
  ## Training and evaluation data
145
 
 
81
 
82
  The **Cǒndor-7B** is a transformer-based causal language model for Catalan, Spanish, and English. It is based on the [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model and has been trained on a 26B token trilugual corpus collected from publicly available corpora and crawlers.
83
 
84
+
85
+ ## Intended uses & limitations
86
+
87
+ The **Cǒndor-7B** model is ready-to-use only for causal language modeling to perform text-generation tasks. However, it is intended to be fine-tuned on a generative downstream task.
88
+
89
+ ## How to use
90
+
91
+ Here is how to use this model:
92
+
93
+ ```python
94
+ import torch
95
+ import transformers
96
+ from transformers import AutoTokenizer, AutoModelForCausalLM
97
+
98
+ input_text = "Maria y Miguel no tienen ningún "
99
+ model = "BSC-LT/condor-7b"
100
+ tokenizer = AutoTokenizer.from_pretrained(model)
101
+
102
+ pipeline = transformers.pipeline(
103
+ "text-generation",
104
+ model=model,
105
+ tokenizer=tokenizer,
106
+ torch_dtype=torch.bfloat16,
107
+ trust_remote_code=True,
108
+ device_map="auto",
109
+ )
110
+ generation = pipeline(
111
+ input_text,
112
+ max_length=200,
113
+ do_sample=True,
114
+ top_k=10,
115
+ eos_token_id=tokenizer.eos_token_id,
116
+ )
117
+
118
+ print(f"Result: {generation['generated_text']}")
119
+ ```
120
+
121
+ ## Limitations and biases
122
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
123
+
124
+
125
  ## Language adaptation
126
 
127
  We adapted the original Falcon-7B model to Spanish and Catalan by swapping the tokenizer and adjusting the embedding layer. The adaptation procedure is explained in this [blog](https://medium.com/@mpamies247/ee1ebc70bc79).
 
174
 
175
 
176
 
 
 
 
 
177
 
 
 
178
 
179
  ## Training and evaluation data
180