PleIAs
/

Segmentext

Pclanglais commited on Jul 18, 2024

Commit

6429da8

verified ·

1 Parent(s): 49379d6

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,8 +1,11 @@
 **Segmentext** is a specialized language model for text-segmentation. Segmentext has been trained to be resilient to broken and unstructured texts including digitzation artifacts and ill-recognized layout formats.
 In contrast with most text-segmentation approach, Segmentext is based on token classification. Editorial structure are reconstructed by the raw text without any reference to the original layout.
-Segmentext was trained on 3,500 example of manually annotated texts, mostly coming from three large scale dataset collected by PleIAs, Finance Commons (financial documents in open data), Common Corpus (cultural heritage texts) and the Science Pile (scientific publication in open licenses - to be released).
 Given the diversity of the training data, Segmentext should work correctly on diverse document formats in the main European languages.

+---
+license: apache-2.0
+---
 **Segmentext** is a specialized language model for text-segmentation. Segmentext has been trained to be resilient to broken and unstructured texts including digitzation artifacts and ill-recognized layout formats.
 In contrast with most text-segmentation approach, Segmentext is based on token classification. Editorial structure are reconstructed by the raw text without any reference to the original layout.
+Segmentext was trained using HPC resources from GENCI–IDRIS on Ad Astra with 3,500 example of manually annotated texts, mostly coming from three large scale dataset collected by PleIAs, Finance Commons (financial documents in open data), Common Corpus (cultural heritage texts) and the Science Pile (scientific publication in open licenses - to be released).
 Given the diversity of the training data, Segmentext should work correctly on diverse document formats in the main European languages.