Safetensors
llama
catherinearnett commited on
Commit
f882683
1 Parent(s): d435f38

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -12
README.md CHANGED
@@ -18,36 +18,36 @@ Pleias-pico-350m-RAG 0.1** is a specialized language model designed by PleIAs fo
18
  Similarly to its base model, Pleias-pico-350m-Preview, Pleias-pico-350m-RAG 0.1 aims to be a fully open model (weights, code, data), only trained on content with a permissible license and fully compliant with the European AI Act.
19
 
20
  ## Description
21
- PleIAs-360m-RAG is continuous pretraining of Pleias-360m on a new dataset of 45,088,768,000 tokens modeling common retrieval tasks. All the content of the dataset is ultimately coming from Common Corpus.
22
 
23
- Pleias-360m-RAG includes the main features of the original base model:
24
  * Only trained on open data under a permissible license and in compliance with the European AI Act. By design, all Pleias model are unable to output copyrighted content.
25
  * Extensive multilingual support for main European languages: English, French, German, Spanish, Italian, Dutch, Latin, Portuguese and Polish.
26
  * Extremely low level of toxicity and problematic content.
27
 
28
- Pleias-360m-RAG supports retrieval-augmented generation with enhanced verifiability, source analysis and grounding on submitted sources. This includes:
29
  * Standardized structure and special tokens to include queries, sources, references.
30
  * Anticipation of various query forms in multiple languages, from actual drafted questions to unstructured list of keyword search.
31
  * Source analysis/criticism which also acts as an integrated reranker step.
32
  * Generation of ground answers with references and excerpts linked to the original sources.
33
 
34
- Given its small size, Pleias-360m-RAG 0.1 was originally conceived as an experimental model.
35
 
36
  Initial tests have shown that the RAG design has significantly improved the factuality and verifiability of the model. Even when the grounding does not work perfectly, the information remains much closer to the original sources.
37
 
38
- As a result, Pleias-360m-RAG 0.1 has been already tested and integrated into multiple applied RAG projects, including Pleias flagship application Scholasticai.
39
 
40
  ## Training
41
- PleIAs-360m-RAG was trained at Jean-Zay with 16 h100s with Nanotron, the pretraining library from HuggingFace. We provide the complete settings as a yaml file as part of our release.
42
 
43
- PleIAs-360m-RAG derives from the last checkpoint of PleIAs-360m (518,000). The training schedule reused the last learning rate value (6e-5) without decay for 90,000 steps.
44
 
45
  Training covers the entire RAG dataset we have been designing out of Common Corpus for 1 epoch.
46
 
47
  Further experiments were made with different learning rate values: none of theses tests have provided a better convergence than the one obtained with the final learning rate from the base model.
48
 
49
  ## Inference
50
- PleIAs-360m-RAG relies on special tokens to encode the core RAG functionalities:
51
 
52
  A typical example, with excerpts drawn from a Wikipedia article on Wikipedia
53
  ```bash
@@ -58,10 +58,10 @@ A typical example, with excerpts drawn from a Wikipedia article on Wikipedia
58
  <|source_analysis_start|>
59
  ```
60
 
61
- As a specialized language model, PleIAs-360m-RAG will be unable to work properly with prompts that detracts from that design.
62
 
63
  ## Acceptable use
64
- Pleias-360m-RAG includes a much wider range of support for verifiability and grounding than most generalist models.
65
 
66
  The model is not a substitute for an integrated RAG application. Retrieval errors as well as challenging texts and questions can still create a range of issues. We especially encourage end users to take advantage of the citations and the references to provide better indicators of accuracy.
67
 
@@ -70,12 +70,12 @@ For best results we recommend the following setting:
70
  * Standardized hashes of 16 characters. While the model has been trained on many other patterns (including full bibliographic entries), this has proven the most convenient for systematic citation parsing.
71
 
72
  ## Future updates
73
- PleIAs-360m-RAG will be continuously improved through iterative retraining/adaptation.
74
 
75
  The current roadmap includes the following features:
76
  * Longer training on the same dataset for more than one epochs.
77
  * Context length expansion.
78
- * Better handling of multilingual sources. In its current form, PleIAs-360m-RAG will generally switch language if a query is made to sources in a different language.
79
  * New sampling methods inspired by Entropix for a better combined support of text creativity and accuracy.
80
  * Interactive/conversational RAG.
81
 
 
18
  Similarly to its base model, Pleias-pico-350m-Preview, Pleias-pico-350m-RAG 0.1 aims to be a fully open model (weights, code, data), only trained on content with a permissible license and fully compliant with the European AI Act.
19
 
20
  ## Description
21
+ Pleias-pico-350m-RAG is continuous pretrain of Pleias-pico-350m-Preview on a new dataset of 45,088,768,000 tokens modeling common retrieval tasks. All the content of the dataset is ultimately coming from [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus).
22
 
23
+ Pleias-pico-350m-RAG includes the main features of the original base model:
24
  * Only trained on open data under a permissible license and in compliance with the European AI Act. By design, all Pleias model are unable to output copyrighted content.
25
  * Extensive multilingual support for main European languages: English, French, German, Spanish, Italian, Dutch, Latin, Portuguese and Polish.
26
  * Extremely low level of toxicity and problematic content.
27
 
28
+ Pleias-pico-350m-RAG supports retrieval-augmented generation with enhanced verifiability, source analysis and grounding on submitted sources. This includes:
29
  * Standardized structure and special tokens to include queries, sources, references.
30
  * Anticipation of various query forms in multiple languages, from actual drafted questions to unstructured list of keyword search.
31
  * Source analysis/criticism which also acts as an integrated reranker step.
32
  * Generation of ground answers with references and excerpts linked to the original sources.
33
 
34
+ Given its small size, Pleias-pico-350m-RAG 0.1 was originally conceived as an experimental model.
35
 
36
  Initial tests have shown that the RAG design has significantly improved the factuality and verifiability of the model. Even when the grounding does not work perfectly, the information remains much closer to the original sources.
37
 
38
+ As a result, Pleias-pico-350m-RAG 0.1 has been already tested and integrated into multiple applied RAG projects, including Pleias's flagship application Scholasticai.
39
 
40
  ## Training
41
+ Pleias-pico-350m-RAG was trained at Jean-Zay with 16 h100s with Nanotron, the pretraining library from HuggingFace. We provide the complete settings as a yaml file as part of our release.
42
 
43
+ Pleias-pico-350m-RAG derives from the last checkpoint of Pleias-pico-350m-Preview (518,000). The training schedule reused the last learning rate value (6e-5) without decay for 90,000 steps.
44
 
45
  Training covers the entire RAG dataset we have been designing out of Common Corpus for 1 epoch.
46
 
47
  Further experiments were made with different learning rate values: none of theses tests have provided a better convergence than the one obtained with the final learning rate from the base model.
48
 
49
  ## Inference
50
+ Pleias-pico-350m-RAG relies on special tokens to encode the core RAG functionalities:
51
 
52
  A typical example, with excerpts drawn from a Wikipedia article on Wikipedia
53
  ```bash
 
58
  <|source_analysis_start|>
59
  ```
60
 
61
+ As a specialized language model, Pleias-pico-350m-RAG will be unable to work properly with prompts that detracts from that design.
62
 
63
  ## Acceptable use
64
+ Pleias-pico-350m-RAG includes a much wider range of support for verifiability and grounding than most generalist models.
65
 
66
  The model is not a substitute for an integrated RAG application. Retrieval errors as well as challenging texts and questions can still create a range of issues. We especially encourage end users to take advantage of the citations and the references to provide better indicators of accuracy.
67
 
 
70
  * Standardized hashes of 16 characters. While the model has been trained on many other patterns (including full bibliographic entries), this has proven the most convenient for systematic citation parsing.
71
 
72
  ## Future updates
73
+ Pleias-pico-350m-RAG will be continuously improved through iterative retraining/adaptation.
74
 
75
  The current roadmap includes the following features:
76
  * Longer training on the same dataset for more than one epochs.
77
  * Context length expansion.
78
+ * Better handling of multilingual sources. In its current form, Pleias-pico-350m-RAG will generally switch language if a query is made to sources in a different language.
79
  * New sampling methods inspired by Entropix for a better combined support of text creativity and accuracy.
80
  * Interactive/conversational RAG.
81