AntoineBourgois commited on
Commit
eab1e64
·
verified ·
1 Parent(s): e547c1e

Upload 3 files

Browse files
Files changed (2) hide show
  1. JCLS_model_card.md +122 -0
  2. README.md +43 -42
JCLS_model_card.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ language: fr
4
+ tags:
5
+ - NER
6
+ - camembert
7
+ - literary-texts
8
+ - nested-entities
9
+ - BookBLP-fr
10
+ license: apache-2.0
11
+ metrics:
12
+ - f1
13
+ - precision
14
+ - recall
15
+ base_model:
16
+ - almanach/camembertV2-base
17
+ pipeline_tag: token-classification
18
+ ---
19
+
20
+ ## INTRODUCTION:
21
+ This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a NER model built on top of [camembertV2-base](https://huggingface.co/almanach/camembertV2-base) embeddings, trained to predict nested entities in french, specifically for literary texts.
22
+
23
+ The predicted entities are:
24
+ - mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
25
+ - facilities (FAC): chatêau, sentier, chambre, couloir, ...
26
+ - time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
27
+ - geo-political entities (GPE): Montrouge, France, le petit hameau, ...
28
+ - locations (LOC): le sud, Mars, l'océan, le bois, ...
29
+ - vehicles (VEH): avion, voitures, calèche, vélos, ...
30
+
31
+ ## MODEL PERFORMANCES (LOOCV):
32
+ | NER_tag | precision | recall | f1_score | support | support % |
33
+ |-----------|-------------|----------|------------|-----------|-------------|
34
+ | PER | 90.17% | 95.76% | 92.88% | 4,061 | 85.80% |
35
+ | FAC | 79.19% | 78.12% | 78.65% | 224 | 4.73% |
36
+ | TIME | 63.18% | 70.56% | 66.67% | 214 | 4.52% |
37
+ | LOC | 62.50% | 54.55% | 58.25% | 110 | 2.32% |
38
+ | GPE | 74.58% | 68.75% | 71.54% | 64 | 1.35% |
39
+ | VEH | 69.12% | 78.33% | 73.44% | 60 | 1.27% |
40
+ | micro_avg | 87.31% | 92.25% | 89.68% | 4,733 | 100.00% |
41
+ | macro_avg | 73.12% | 74.35% | 73.57% | 4,733 | 100.00% |
42
+
43
+ ## TRAINING PARAMETERS:
44
+ - Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
45
+ - Tagging scheme: BIOES
46
+ - Nested entities levels: [0, 1]
47
+ - Split strategy: Leave-one-out cross-validation (28 files)
48
+ - Train/Validation split: 0.85 / 0.15
49
+ - Batch size: 16
50
+ - Initial learning rate: 0.00014
51
+
52
+ ## MODEL ARCHITECTURE:
53
+ Model Input: Maximum context camembertV2-base embeddings (768 dimensions)
54
+
55
+ - Locked Dropout: 0.5
56
+
57
+ - Projection layer:
58
+ - layer type: highway layer
59
+ - input: 768 dimensions
60
+ - output: 2048 dimensions
61
+
62
+ - BiLSTM layer:
63
+ - input: 2048 dimensions
64
+ - output: 256 dimensions (hidden state)
65
+
66
+ - Linear layer:
67
+ - input: 256 dimensions
68
+ - output: 25 dimensions (predicted labels with BIOES tagging scheme)
69
+
70
+ - CRF layer
71
+
72
+ Model Output: BIOES labels sequence
73
+
74
+ ## HOW TO USE:
75
+ *** IN CONSTRUCTION ***
76
+
77
+ ## TRAINING CORPUS:
78
+ | | Document | Tokens Count | Is included in model eval |
79
+ |----|----------------------------------------------------------------|----------------|-----------------------------------|
80
+ | 0 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,299 tokens | False |
81
+ | 1 | 1840_Sand-George_Pauline | 12,315 tokens | False |
82
+ | 2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens | False |
83
+ | 3 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,987 tokens | False |
84
+ | 4 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens | False |
85
+ | 5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | False |
86
+ | 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | False |
87
+ | 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | False |
88
+ | 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | False |
89
+ | 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | True |
90
+ | 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | True |
91
+ | 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | True |
92
+ | 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | False |
93
+ | 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | False |
94
+ | 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | False |
95
+ | 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,864 tokens | False |
96
+ | 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,141 tokens | False |
97
+ | 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,441 tokens | False |
98
+ | 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,860 tokens | False |
99
+ | 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | False |
100
+ | 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | False |
101
+ | 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | False |
102
+ | 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | True |
103
+ | 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | False |
104
+ | 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | False |
105
+ | 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | False |
106
+ | 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | True |
107
+ | 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | False |
108
+ | 28 | TOTAL | 275,360 tokens | 5 files used for cross-validation |
109
+
110
+ ## PREDICTIONS CONFUSION MATRIX:
111
+ | Gold Labels | PER | FAC | TIME | LOC | GPE | VEH | O | support |
112
+ |---------------|-------|-------|--------|-------|-------|-------|-----|-----------|
113
+ | PER | 3,889 | 3 | 2 | 2 | 1 | 1 | 163 | 4,061 |
114
+ | FAC | 6 | 175 | 0 | 2 | 0 | 1 | 40 | 224 |
115
+ | TIME | 0 | 0 | 151 | 0 | 0 | 0 | 63 | 214 |
116
+ | LOC | 1 | 0 | 0 | 60 | 9 | 0 | 40 | 110 |
117
+ | GPE | 2 | 0 | 0 | 8 | 44 | 0 | 10 | 64 |
118
+ | VEH | 1 | 0 | 0 | 0 | 0 | 47 | 12 | 60 |
119
+ | O | 411 | 43 | 85 | 24 | 5 | 19 | 0 | 587 |
120
+
121
+ ## CONTACT:
122
+ mail: antoine [dot] bourgois [at] protonmail [dot] com
README.md CHANGED
@@ -16,6 +16,7 @@ base_model:
16
  - almanach/camembertV2-base
17
  pipeline_tag: token-classification
18
  ---
 
19
  ## INTRODUCTION:
20
  This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a NER model built on top of [camembertV2-base](https://huggingface.co/almanach/camembertV2-base) embeddings, trained to predict nested entities in french, specifically for literary texts.
21
 
@@ -25,19 +26,19 @@ The predicted entities are:
25
  - time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
26
  - geo-political entities (GPE): Montrouge, France, le petit hameau, ...
27
  - locations (LOC): le sud, Mars, l'océan, le bois, ...
28
- - vehicles (VEH): avions, voitures, calèches, vélos, ...
29
 
30
  ## MODEL PERFORMANCES (LOOCV):
31
- | NER_tag | precision | recall | f1_score | support |
32
- |-----------|-------------|----------|------------|-----------|
33
- | PER | 90.10% | 93.38% | 91.71% | 31,570 |
34
- | FAC | 70.14% | 70.97% | 70.55% | 2,294 |
35
- | TIME | 58.04% | 58.98% | 58.51% | 1,670 |
36
- | GPE | 75.85% | 76.81% | 76.33% | 871 |
37
- | LOC | 61.22% | 46.57% | 52.90% | 773 |
38
- | VEH | 66.37% | 48.82% | 56.26% | 465 |
39
- | micro_avg | 86.25% | 88.60% | 87.36% | 37,643 |
40
- | macro_avg | 70.29% | 65.92% | 67.71% | 37,643 |
41
 
42
  ## TRAINING PARAMETERS:
43
  - Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
@@ -74,37 +75,37 @@ Model Output: BIOES labels sequence
74
  *** IN CONSTRUCTION ***
75
 
76
  ## TRAINING CORPUS:
77
- | | Document | Tokens Count |
78
- |----|----------------------------------------------------------------|----------------|
79
- | 0 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,299 tokens |
80
- | 1 | 1840_Sand-George_Pauline | 12,315 tokens |
81
- | 2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens |
82
- | 3 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,987 tokens |
83
- | 4 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens |
84
- | 5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens |
85
- | 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens |
86
- | 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens |
87
- | 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens |
88
- | 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens |
89
- | 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens |
90
- | 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens |
91
- | 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens |
92
- | 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens |
93
- | 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens |
94
- | 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,864 tokens |
95
- | 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,141 tokens |
96
- | 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,441 tokens |
97
- | 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,860 tokens |
98
- | 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens |
99
- | 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens |
100
- | 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens |
101
- | 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens |
102
- | 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens |
103
- | 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens |
104
- | 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens |
105
- | 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens |
106
- | 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens |
107
- | 28 | TOTAL | 275,360 tokens |
108
 
109
  ## PREDICTIONS CONFUSION MATRIX:
110
  | Gold Labels | PER | FAC | TIME | GPE | LOC | VEH | O | support |
 
16
  - almanach/camembertV2-base
17
  pipeline_tag: token-classification
18
  ---
19
+
20
  ## INTRODUCTION:
21
  This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a NER model built on top of [camembertV2-base](https://huggingface.co/almanach/camembertV2-base) embeddings, trained to predict nested entities in french, specifically for literary texts.
22
 
 
26
  - time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
27
  - geo-political entities (GPE): Montrouge, France, le petit hameau, ...
28
  - locations (LOC): le sud, Mars, l'océan, le bois, ...
29
+ - vehicles (VEH): avion, voitures, calèche, vélos, ...
30
 
31
  ## MODEL PERFORMANCES (LOOCV):
32
+ | NER_tag | precision | recall | f1_score | support | support % |
33
+ |-----------|-------------|----------|------------|-----------|-------------|
34
+ | PER | 90.10% | 93.38% | 91.71% | 31,570 | 83.87% |
35
+ | FAC | 70.14% | 70.97% | 70.55% | 2,294 | 6.09% |
36
+ | TIME | 58.04% | 58.98% | 58.51% | 1,670 | 4.44% |
37
+ | GPE | 75.85% | 76.81% | 76.33% | 871 | 2.31% |
38
+ | LOC | 61.22% | 46.57% | 52.90% | 773 | 2.05% |
39
+ | VEH | 66.37% | 48.82% | 56.26% | 465 | 1.24% |
40
+ | micro_avg | 86.25% | 88.60% | 87.36% | 37,643 | 100.00% |
41
+ | macro_avg | 70.29% | 65.92% | 67.71% | 37,643 | 100.00% |
42
 
43
  ## TRAINING PARAMETERS:
44
  - Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
 
75
  *** IN CONSTRUCTION ***
76
 
77
  ## TRAINING CORPUS:
78
+ | | Document | Tokens Count | Is included in model eval |
79
+ |----|----------------------------------------------------------------|----------------|------------------------------------|
80
+ | 0 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,299 tokens | True |
81
+ | 1 | 1840_Sand-George_Pauline | 12,315 tokens | True |
82
+ | 2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens | True |
83
+ | 3 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,987 tokens | True |
84
+ | 4 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens | True |
85
+ | 5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | True |
86
+ | 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | True |
87
+ | 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | True |
88
+ | 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | True |
89
+ | 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | True |
90
+ | 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | True |
91
+ | 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | True |
92
+ | 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | True |
93
+ | 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | True |
94
+ | 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | True |
95
+ | 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,864 tokens | True |
96
+ | 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,141 tokens | True |
97
+ | 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,441 tokens | True |
98
+ | 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,860 tokens | True |
99
+ | 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | True |
100
+ | 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | True |
101
+ | 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | True |
102
+ | 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | True |
103
+ | 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | True |
104
+ | 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | True |
105
+ | 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | True |
106
+ | 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | True |
107
+ | 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | True |
108
+ | 28 | TOTAL | 275,360 tokens | 28 files used for cross-validation |
109
 
110
  ## PREDICTIONS CONFUSION MATRIX:
111
  | Gold Labels | PER | FAC | TIME | GPE | LOC | VEH | O | support |