Pclanglais commited on
Commit
eaaabbd
1 Parent(s): 3867942

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -0
README.md ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ **Estienne** is a text-segmentation model trained on Deberta.
2
+
3
+ In contrast with most text-segmentation approach, Estienne is based on token classification. Editorial structure are identified similarly to named-entity recognition.
4
+
5
+ Estienne was trained on 2,000 example of manually annotated texts, excerpted at random from three very large dataset collected by Pleias: Common Corpus (cultural heritage texts in the public domain), Marianne-OpenData (French/English administrative documents) and OpenScientificPile (scientific publications in free licenses, indexed on OpenAlex). Given the diversity of the corpus, Estienne should work out on diverse document formats in European languages.
6
+
7
+ Estienne supports the following segmentations:
8
+
9
+ The model is named in reference to the humanist Henri Estienne who introduced many practices of text segmentation still in use in scholarly edition today.