Pclanglais commited on
Commit
6429da8
1 Parent(s): 49379d6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -1
README.md CHANGED
@@ -1,8 +1,11 @@
 
 
 
1
  **Segmentext** is a specialized language model for text-segmentation. Segmentext has been trained to be resilient to broken and unstructured texts including digitzation artifacts and ill-recognized layout formats.
2
 
3
  In contrast with most text-segmentation approach, Segmentext is based on token classification. Editorial structure are reconstructed by the raw text without any reference to the original layout.
4
 
5
- Segmentext was trained on 3,500 example of manually annotated texts, mostly coming from three large scale dataset collected by PleIAs, Finance Commons (financial documents in open data), Common Corpus (cultural heritage texts) and the Science Pile (scientific publication in open licenses - to be released).
6
 
7
  Given the diversity of the training data, Segmentext should work correctly on diverse document formats in the main European languages.
8
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
  **Segmentext** is a specialized language model for text-segmentation. Segmentext has been trained to be resilient to broken and unstructured texts including digitzation artifacts and ill-recognized layout formats.
5
 
6
  In contrast with most text-segmentation approach, Segmentext is based on token classification. Editorial structure are reconstructed by the raw text without any reference to the original layout.
7
 
8
+ Segmentext was trained using HPC resources from GENCI–IDRIS on Ad Astra with 3,500 example of manually annotated texts, mostly coming from three large scale dataset collected by PleIAs, Finance Commons (financial documents in open data), Common Corpus (cultural heritage texts) and the Science Pile (scientific publication in open licenses - to be released).
9
 
10
  Given the diversity of the training data, Segmentext should work correctly on diverse document formats in the main European languages.
11