Jeronymous commited on
Commit
3341f92
1 Parent(s): 0bd1d91

Add links to dataset and code

Browse files
Files changed (1) hide show
  1. README.md +6 -4
README.md CHANGED
@@ -40,6 +40,8 @@ inference:
40
  **Claire-7B-0.1 is a 7B parameter causal decoder-only model built by [LINAGORA](https://labs.linagora.com/) and [OpenLLM-France](https://github.com/OpenLLM-France)**
41
  **adapted from [Falcon-7b](https://huggingface.co/tiiuae/falcon-7b) on French conversational data.**
42
 
 
 
43
  Claire-7B-0.1 is a pretrained language model designed to be attuned to the dynamics of linguistic interactions in dialogue. Without further training, its expected use is to generate continuations of dialogues. Its main purpose is to serve as a base model for fine-tuning on dialogue generation (e.g., chat) and dialogue understanding (e.g., meeting summarization) tasks. Please note that due to its training, the model is prone to generate dialogues with disfluencies and other constructions common to spoken language.
44
 
45
  * [Typical usage](#typical-usage)
@@ -138,7 +140,7 @@ prompt = """\
138
 
139
  ### Training Data
140
 
141
- The training dataset will be made available soon.
142
 
143
  Claire-7B-0.1 was tuned from Falcon-7b on the following data distribution:
144
 
@@ -146,10 +148,10 @@ Claire-7B-0.1 was tuned from Falcon-7b on the following data distribution:
146
  |-------------------------------|------------|------------------------------|-----------------------------------------------------|
147
  | Parliamentary Proceedings | 135M | 35% | Assemblée Nationale |
148
  | Theatre | 16M | 18% | Théâtre Classique, Théâtre Gratuit |
149
- | Interviews | 6.4M | 29% | TCOF, CFPP, CFPB, ACSYNT, PFC, Valibel (ORFEO), ESLO|
150
  | Free Conversations | 2.2M | 10% | CRFP (ORFEO), OFROM (ORFEO), CID, Rhapsodie, ParisStories, PFC, CLAPI, C-ORAL-ROM (ORFEO), LinTO, ESLO |
151
  | Meetings | 1.2M | 5% | SUMM-RE, LinTO, Réunions de travail (ORFEO) |
152
- | Debates | 402k | <2% | FreDSum, ESLO |
153
  | Assistance | 159k | <1% | Fleuron (ORFEO), Accueil UBS, OTG, ESLO |
154
  | Presentation, Formal Address | 86k | <0.5% | Valibel (ORFEO), LinTO, ESLO |
155
 
@@ -165,7 +167,7 @@ While the model has been trained and evaluated only on French dialogues, it may
165
 
166
  ### Training Procedure
167
 
168
- The training code will be made available soon.
169
 
170
  Claire-7B-0.1 is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
171
  See [Falcon-7b](https://huggingface.co/tiiuae/falcon-7b) for more details.
 
40
  **Claire-7B-0.1 is a 7B parameter causal decoder-only model built by [LINAGORA](https://labs.linagora.com/) and [OpenLLM-France](https://github.com/OpenLLM-France)**
41
  **adapted from [Falcon-7b](https://huggingface.co/tiiuae/falcon-7b) on French conversational data.**
42
 
43
+ Quantized versions in GGUF format can be found in [TheBloke/Claire-7B-0.1-GGUF](https://huggingface.co/TheBloke/Claire-7B-0.1-GGUF).
44
+
45
  Claire-7B-0.1 is a pretrained language model designed to be attuned to the dynamics of linguistic interactions in dialogue. Without further training, its expected use is to generate continuations of dialogues. Its main purpose is to serve as a base model for fine-tuning on dialogue generation (e.g., chat) and dialogue understanding (e.g., meeting summarization) tasks. Please note that due to its training, the model is prone to generate dialogues with disfluencies and other constructions common to spoken language.
46
 
47
  * [Typical usage](#typical-usage)
 
140
 
141
  ### Training Data
142
 
143
+ The training dataset is available at [OpenLLM-France/Claire-Dialogue-French-0.1](https://huggingface.co/datasets/OpenLLM-France/Claire-Dialogue-French-0.1).
144
 
145
  Claire-7B-0.1 was tuned from Falcon-7b on the following data distribution:
146
 
 
148
  |-------------------------------|------------|------------------------------|-----------------------------------------------------|
149
  | Parliamentary Proceedings | 135M | 35% | Assemblée Nationale |
150
  | Theatre | 16M | 18% | Théâtre Classique, Théâtre Gratuit |
151
+ | Interviews | 6.4M | 29% | TCOF, CFPP, CFPB (ORFEO), ACSYNT, PFC, Valibel (ORFEO), ESLO|
152
  | Free Conversations | 2.2M | 10% | CRFP (ORFEO), OFROM (ORFEO), CID, Rhapsodie, ParisStories, PFC, CLAPI, C-ORAL-ROM (ORFEO), LinTO, ESLO |
153
  | Meetings | 1.2M | 5% | SUMM-RE, LinTO, Réunions de travail (ORFEO) |
154
+ | Debates | 402k | <2% | FREDSum, ESLO |
155
  | Assistance | 159k | <1% | Fleuron (ORFEO), Accueil UBS, OTG, ESLO |
156
  | Presentation, Formal Address | 86k | <0.5% | Valibel (ORFEO), LinTO, ESLO |
157
 
 
167
 
168
  ### Training Procedure
169
 
170
+ The training code is available at [https://github.com/OpenLLM-France/Lit-Claire](https://github.com/OpenLLM-France/Lit-Claire).
171
 
172
  Claire-7B-0.1 is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
173
  See [Falcon-7b](https://huggingface.co/tiiuae/falcon-7b) for more details.