AntoineBourgois
commited on
Upload 3 files
Browse files- JCLS_model_card.md +122 -0
- README.md +43 -42
JCLS_model_card.md
ADDED
@@ -0,0 +1,122 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
---
|
3 |
+
language: fr
|
4 |
+
tags:
|
5 |
+
- NER
|
6 |
+
- camembert
|
7 |
+
- literary-texts
|
8 |
+
- nested-entities
|
9 |
+
- BookBLP-fr
|
10 |
+
license: apache-2.0
|
11 |
+
metrics:
|
12 |
+
- f1
|
13 |
+
- precision
|
14 |
+
- recall
|
15 |
+
base_model:
|
16 |
+
- almanach/camembertV2-base
|
17 |
+
pipeline_tag: token-classification
|
18 |
+
---
|
19 |
+
|
20 |
+
## INTRODUCTION:
|
21 |
+
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a NER model built on top of [camembertV2-base](https://huggingface.co/almanach/camembertV2-base) embeddings, trained to predict nested entities in french, specifically for literary texts.
|
22 |
+
|
23 |
+
The predicted entities are:
|
24 |
+
- mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
|
25 |
+
- facilities (FAC): chatêau, sentier, chambre, couloir, ...
|
26 |
+
- time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
|
27 |
+
- geo-political entities (GPE): Montrouge, France, le petit hameau, ...
|
28 |
+
- locations (LOC): le sud, Mars, l'océan, le bois, ...
|
29 |
+
- vehicles (VEH): avion, voitures, calèche, vélos, ...
|
30 |
+
|
31 |
+
## MODEL PERFORMANCES (LOOCV):
|
32 |
+
| NER_tag | precision | recall | f1_score | support | support % |
|
33 |
+
|-----------|-------------|----------|------------|-----------|-------------|
|
34 |
+
| PER | 90.17% | 95.76% | 92.88% | 4,061 | 85.80% |
|
35 |
+
| FAC | 79.19% | 78.12% | 78.65% | 224 | 4.73% |
|
36 |
+
| TIME | 63.18% | 70.56% | 66.67% | 214 | 4.52% |
|
37 |
+
| LOC | 62.50% | 54.55% | 58.25% | 110 | 2.32% |
|
38 |
+
| GPE | 74.58% | 68.75% | 71.54% | 64 | 1.35% |
|
39 |
+
| VEH | 69.12% | 78.33% | 73.44% | 60 | 1.27% |
|
40 |
+
| micro_avg | 87.31% | 92.25% | 89.68% | 4,733 | 100.00% |
|
41 |
+
| macro_avg | 73.12% | 74.35% | 73.57% | 4,733 | 100.00% |
|
42 |
+
|
43 |
+
## TRAINING PARAMETERS:
|
44 |
+
- Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
|
45 |
+
- Tagging scheme: BIOES
|
46 |
+
- Nested entities levels: [0, 1]
|
47 |
+
- Split strategy: Leave-one-out cross-validation (28 files)
|
48 |
+
- Train/Validation split: 0.85 / 0.15
|
49 |
+
- Batch size: 16
|
50 |
+
- Initial learning rate: 0.00014
|
51 |
+
|
52 |
+
## MODEL ARCHITECTURE:
|
53 |
+
Model Input: Maximum context camembertV2-base embeddings (768 dimensions)
|
54 |
+
|
55 |
+
- Locked Dropout: 0.5
|
56 |
+
|
57 |
+
- Projection layer:
|
58 |
+
- layer type: highway layer
|
59 |
+
- input: 768 dimensions
|
60 |
+
- output: 2048 dimensions
|
61 |
+
|
62 |
+
- BiLSTM layer:
|
63 |
+
- input: 2048 dimensions
|
64 |
+
- output: 256 dimensions (hidden state)
|
65 |
+
|
66 |
+
- Linear layer:
|
67 |
+
- input: 256 dimensions
|
68 |
+
- output: 25 dimensions (predicted labels with BIOES tagging scheme)
|
69 |
+
|
70 |
+
- CRF layer
|
71 |
+
|
72 |
+
Model Output: BIOES labels sequence
|
73 |
+
|
74 |
+
## HOW TO USE:
|
75 |
+
*** IN CONSTRUCTION ***
|
76 |
+
|
77 |
+
## TRAINING CORPUS:
|
78 |
+
| | Document | Tokens Count | Is included in model eval |
|
79 |
+
|----|----------------------------------------------------------------|----------------|-----------------------------------|
|
80 |
+
| 0 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,299 tokens | False |
|
81 |
+
| 1 | 1840_Sand-George_Pauline | 12,315 tokens | False |
|
82 |
+
| 2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens | False |
|
83 |
+
| 3 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,987 tokens | False |
|
84 |
+
| 4 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens | False |
|
85 |
+
| 5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | False |
|
86 |
+
| 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | False |
|
87 |
+
| 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | False |
|
88 |
+
| 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | False |
|
89 |
+
| 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | True |
|
90 |
+
| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | True |
|
91 |
+
| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | True |
|
92 |
+
| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | False |
|
93 |
+
| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | False |
|
94 |
+
| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | False |
|
95 |
+
| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,864 tokens | False |
|
96 |
+
| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,141 tokens | False |
|
97 |
+
| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,441 tokens | False |
|
98 |
+
| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,860 tokens | False |
|
99 |
+
| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | False |
|
100 |
+
| 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | False |
|
101 |
+
| 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | False |
|
102 |
+
| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | True |
|
103 |
+
| 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | False |
|
104 |
+
| 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | False |
|
105 |
+
| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | False |
|
106 |
+
| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | True |
|
107 |
+
| 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | False |
|
108 |
+
| 28 | TOTAL | 275,360 tokens | 5 files used for cross-validation |
|
109 |
+
|
110 |
+
## PREDICTIONS CONFUSION MATRIX:
|
111 |
+
| Gold Labels | PER | FAC | TIME | LOC | GPE | VEH | O | support |
|
112 |
+
|---------------|-------|-------|--------|-------|-------|-------|-----|-----------|
|
113 |
+
| PER | 3,889 | 3 | 2 | 2 | 1 | 1 | 163 | 4,061 |
|
114 |
+
| FAC | 6 | 175 | 0 | 2 | 0 | 1 | 40 | 224 |
|
115 |
+
| TIME | 0 | 0 | 151 | 0 | 0 | 0 | 63 | 214 |
|
116 |
+
| LOC | 1 | 0 | 0 | 60 | 9 | 0 | 40 | 110 |
|
117 |
+
| GPE | 2 | 0 | 0 | 8 | 44 | 0 | 10 | 64 |
|
118 |
+
| VEH | 1 | 0 | 0 | 0 | 0 | 47 | 12 | 60 |
|
119 |
+
| O | 411 | 43 | 85 | 24 | 5 | 19 | 0 | 587 |
|
120 |
+
|
121 |
+
## CONTACT:
|
122 |
+
mail: antoine [dot] bourgois [at] protonmail [dot] com
|
README.md
CHANGED
@@ -16,6 +16,7 @@ base_model:
|
|
16 |
- almanach/camembertV2-base
|
17 |
pipeline_tag: token-classification
|
18 |
---
|
|
|
19 |
## INTRODUCTION:
|
20 |
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a NER model built on top of [camembertV2-base](https://huggingface.co/almanach/camembertV2-base) embeddings, trained to predict nested entities in french, specifically for literary texts.
|
21 |
|
@@ -25,19 +26,19 @@ The predicted entities are:
|
|
25 |
- time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
|
26 |
- geo-political entities (GPE): Montrouge, France, le petit hameau, ...
|
27 |
- locations (LOC): le sud, Mars, l'océan, le bois, ...
|
28 |
-
- vehicles (VEH):
|
29 |
|
30 |
## MODEL PERFORMANCES (LOOCV):
|
31 |
-
| NER_tag | precision | recall | f1_score | support |
|
32 |
-
|
33 |
-
| PER | 90.10% | 93.38% | 91.71% | 31,570 |
|
34 |
-
| FAC | 70.14% | 70.97% | 70.55% | 2,294 |
|
35 |
-
| TIME | 58.04% | 58.98% | 58.51% | 1,670 |
|
36 |
-
| GPE | 75.85% | 76.81% | 76.33% | 871 |
|
37 |
-
| LOC | 61.22% | 46.57% | 52.90% | 773 |
|
38 |
-
| VEH | 66.37% | 48.82% | 56.26% | 465 |
|
39 |
-
| micro_avg | 86.25% | 88.60% | 87.36% | 37,643 |
|
40 |
-
| macro_avg | 70.29% | 65.92% | 67.71% | 37,643 |
|
41 |
|
42 |
## TRAINING PARAMETERS:
|
43 |
- Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
|
@@ -74,37 +75,37 @@ Model Output: BIOES labels sequence
|
|
74 |
*** IN CONSTRUCTION ***
|
75 |
|
76 |
## TRAINING CORPUS:
|
77 |
-
| | Document | Tokens Count |
|
78 |
-
|
79 |
-
| 0 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,299 tokens |
|
80 |
-
| 1 | 1840_Sand-George_Pauline | 12,315 tokens |
|
81 |
-
| 2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens |
|
82 |
-
| 3 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,987 tokens |
|
83 |
-
| 4 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens |
|
84 |
-
| 5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens |
|
85 |
-
| 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens |
|
86 |
-
| 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens |
|
87 |
-
| 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens |
|
88 |
-
| 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens |
|
89 |
-
| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens |
|
90 |
-
| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens |
|
91 |
-
| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens |
|
92 |
-
| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens |
|
93 |
-
| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens |
|
94 |
-
| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,864 tokens |
|
95 |
-
| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,141 tokens |
|
96 |
-
| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,441 tokens |
|
97 |
-
| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,860 tokens |
|
98 |
-
| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens |
|
99 |
-
| 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens |
|
100 |
-
| 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens |
|
101 |
-
| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens |
|
102 |
-
| 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens |
|
103 |
-
| 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens |
|
104 |
-
| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens |
|
105 |
-
| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens |
|
106 |
-
| 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens |
|
107 |
-
| 28 | TOTAL | 275,360 tokens |
|
108 |
|
109 |
## PREDICTIONS CONFUSION MATRIX:
|
110 |
| Gold Labels | PER | FAC | TIME | GPE | LOC | VEH | O | support |
|
|
|
16 |
- almanach/camembertV2-base
|
17 |
pipeline_tag: token-classification
|
18 |
---
|
19 |
+
|
20 |
## INTRODUCTION:
|
21 |
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a NER model built on top of [camembertV2-base](https://huggingface.co/almanach/camembertV2-base) embeddings, trained to predict nested entities in french, specifically for literary texts.
|
22 |
|
|
|
26 |
- time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
|
27 |
- geo-political entities (GPE): Montrouge, France, le petit hameau, ...
|
28 |
- locations (LOC): le sud, Mars, l'océan, le bois, ...
|
29 |
+
- vehicles (VEH): avion, voitures, calèche, vélos, ...
|
30 |
|
31 |
## MODEL PERFORMANCES (LOOCV):
|
32 |
+
| NER_tag | precision | recall | f1_score | support | support % |
|
33 |
+
|-----------|-------------|----------|------------|-----------|-------------|
|
34 |
+
| PER | 90.10% | 93.38% | 91.71% | 31,570 | 83.87% |
|
35 |
+
| FAC | 70.14% | 70.97% | 70.55% | 2,294 | 6.09% |
|
36 |
+
| TIME | 58.04% | 58.98% | 58.51% | 1,670 | 4.44% |
|
37 |
+
| GPE | 75.85% | 76.81% | 76.33% | 871 | 2.31% |
|
38 |
+
| LOC | 61.22% | 46.57% | 52.90% | 773 | 2.05% |
|
39 |
+
| VEH | 66.37% | 48.82% | 56.26% | 465 | 1.24% |
|
40 |
+
| micro_avg | 86.25% | 88.60% | 87.36% | 37,643 | 100.00% |
|
41 |
+
| macro_avg | 70.29% | 65.92% | 67.71% | 37,643 | 100.00% |
|
42 |
|
43 |
## TRAINING PARAMETERS:
|
44 |
- Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
|
|
|
75 |
*** IN CONSTRUCTION ***
|
76 |
|
77 |
## TRAINING CORPUS:
|
78 |
+
| | Document | Tokens Count | Is included in model eval |
|
79 |
+
|----|----------------------------------------------------------------|----------------|------------------------------------|
|
80 |
+
| 0 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,299 tokens | True |
|
81 |
+
| 1 | 1840_Sand-George_Pauline | 12,315 tokens | True |
|
82 |
+
| 2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens | True |
|
83 |
+
| 3 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,987 tokens | True |
|
84 |
+
| 4 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens | True |
|
85 |
+
| 5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | True |
|
86 |
+
| 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | True |
|
87 |
+
| 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | True |
|
88 |
+
| 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | True |
|
89 |
+
| 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | True |
|
90 |
+
| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | True |
|
91 |
+
| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | True |
|
92 |
+
| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | True |
|
93 |
+
| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | True |
|
94 |
+
| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | True |
|
95 |
+
| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,864 tokens | True |
|
96 |
+
| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,141 tokens | True |
|
97 |
+
| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,441 tokens | True |
|
98 |
+
| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,860 tokens | True |
|
99 |
+
| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | True |
|
100 |
+
| 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | True |
|
101 |
+
| 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | True |
|
102 |
+
| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | True |
|
103 |
+
| 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | True |
|
104 |
+
| 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | True |
|
105 |
+
| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | True |
|
106 |
+
| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | True |
|
107 |
+
| 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | True |
|
108 |
+
| 28 | TOTAL | 275,360 tokens | 28 files used for cross-validation |
|
109 |
|
110 |
## PREDICTIONS CONFUSION MATRIX:
|
111 |
| Gold Labels | PER | FAC | TIME | GPE | LOC | VEH | O | support |
|