AntoineBourgois
/

BookNLP-fr_coreference-resolution_camembert-large_PER

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+final_model filter=lfs diff=lfs merge=lfs -text

JCLS_model_card.md ADDED Viewed

	@@ -0,0 +1,132 @@

+---
+language: fr
+tags:
+- coreference-resolution
+- anaphora-resolution
+- mentions-linking
+- literary-texts
+- camembert
+- literary-texts
+- nested-entities
+- BookNLP-fr
+license: apache-2.0
+metrics:
+- MUC
+- B3
+- CEAF
+- CoNLL-F1
+base_model:
+- almanach/camembert-large
+---
+## INTRODUCTION:
+This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **coreference resolution model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.
+This specific model has been trained to link entities of the following types: PER.
+## MODEL PERFORMANCES (LOOCV):
+Overall Coreference Resolution Performances for non-overlapping windows of different length:
+|    | Window width (tokens)   |   Document count |   Sample count | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
+|----|-------------------------|------------------|----------------|----------|---------|------------|------------|
+|  0 | 500                     |                5 |             64 | 93.49%   | 86.27%  | 77.85%     | 85.87%     |
+|  1 | 1,000                   |                5 |             30 | 93.68%   | 81.32%  | 71.92%     | 82.31%     |
+|  2 | 2,000                   |                5 |             14 | 93.98%   | 76.90%  | 67.26%     | 79.38%     |
+|  3 | 5,000                   |                3 |              5 | 94.83%   | 68.34%  | 59.88%     | 74.35%     |
+|  4 | 10,000                  |                2 |              2 | 96.16%   | 62.22%  | 57.12%     | 71.84%     |
+Coreference Resolution Performances on the fully annotated sample for each document:
+|    | Token count   | Mention count   | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
+|----|---------------|-----------------|----------|---------|------------|------------|
+|  0 | 2,554         | 330             | 90.24%   | 65.27%  | 72.36%     | 75.96%     |
+|  1 | 2,929         | 386             | 95.65%   | 78.21%  | 64.23%     | 79.37%     |
+|  2 | 5,425         | 558             | 90.46%   | 53.03%  | 59.52%     | 67.67%     |
+|  3 | 10,982        | 1,095           | 97.18%   | 65.30%  | 60.49%     | 74.32%     |
+|  4 | 11,902        | 1,692           | 95.03%   | 58.83%  | 45.59%     | 66.49%     |
+## TRAINING PARAMETERS:
+- Entities types: PER
+- Split strategy: Leave-one-out cross-validation (29 files)
+- Train/Validation split: 0.85 / 0.15000000000000002
+- Batch size: 16,000
+- Initial learning rate: 0.0004
+- Focal loss gamma: 1
+- Focal loss alpha: 0.25
+- Pronoun lookup antecedents: 30
+- Common and Proper nouns lookup antecedents: 300
+## MODEL ARCHITECTURE:
+Model Input: 2,165 dimensions vector
+- Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
+- Additional mentions features (106 dimensions):
+  - Length of mentions
+  - Position of the mention's start token within the sentence
+  - Grammatical category of the mentions (pronoun, common noun, proper noun)
+  - Dependency relation of the mention's head (one-hot encoded)
+  - Gender of the mentions (one-hot encoded)
+  - Number (singular/plural) of the mentions (one-hot encoded)
+  - Grammatical person of the mentions (one-hot encoded)
+- Additional mention pairs features (11 dimensions):
+  - Distance between mention IDs
+  - Distance between start tokens of mentions
+  - Distance between end tokens of mentions
+  - Distance between sentences containing mentions
+  - Distance between paragraphs containing mentions
+  - Difference in nesting levels of mentions
+  - Ratio of shared tokens between mentions
+  - Exact text match between mentions (binary)
+  - Exact match of mention heads (binary)
+  - Match of syntactic heads between mentions (binary)
+  - Match of entity types between mentions (binary)
+- Hidden Layers:
+  - Number of layers: 3
+  - Units per layer: 1,900 nodes
+  - Activation function: relu
+  - Dropout rate: 0.6
+- Final Layer:
+  - Type: Linear
+  - Input: 1900 dimensions
+  - Output: 1 dimension (mention pair coreference score)
+Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.
+## HOW TO USE:
+*** IN CONSTRUCTION ***
+## TRAINING CORPUS:
+|    | Document                                                       | Tokens Count   | Is included in model eval         |
+|----|----------------------------------------------------------------|----------------|-----------------------------------|
+|  0 | 1836_Gautier-Theophile_La-morte-amoureuse                      | 14,299 tokens  | False                             |
+|  1 | 1840_Sand-George_Pauline                                       | 12,315 tokens  | False                             |
+|  2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote             | 24,776 tokens  | False                             |
+|  3 | 1844_Balzac-Honore-de_La-Maison-Nucingen                       | 30,987 tokens  | False                             |
+|  4 | 1844_Balzac-Honore-de_Sarrasine                                | 15,408 tokens  | False                             |
+|  5 | 1856_Cousin-Victor_Madame-de-Hautefort                         | 11,768 tokens  | False                             |
+|  6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse                   | 11,834 tokens  | False                             |
+|  7 | 1873_Zola-Emile_Le-ventre-de-Paris                             | 12,557 tokens  | False                             |
+|  8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet                      | 12,281 tokens  | False                             |
+|  9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens   | **True**                          |
+| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE   | 2,554 tokens   | **True**                          |
+| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE        | 2,929 tokens   | **True**                          |
+| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA           | 4,067 tokens   | False                             |
+| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE          | 2,251 tokens   | False                             |
+| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE        | 2,034 tokens   | False                             |
+| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU               | 1,864 tokens   | False                             |
+| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL            | 2,141 tokens   | False                             |
+| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE          | 2,441 tokens   | False                             |
+| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL          | 2,860 tokens   | False                             |
+| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON      | 2,343 tokens   | False                             |
+| 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis   | 12,703 tokens  | False                             |
+| 21 | 1903_Conan-Laure_Elisabeth_Seton                               | 13,023 tokens  | False                             |
+| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube              | 10,982 tokens  | **True**                          |
+| 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin           | 10,305 tokens  | False                             |
+| 24 | 1917_Adèle-Bourgeois_Némoville                                 | 12,389 tokens  | False                             |
+| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps                       | 14,637 tokens  | False                             |
+| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin                   | 11,902 tokens  | **True**                          |
+| 27 | 1937_Audoux-Marguerite_Douce-Lumiere                           | 12,285 tokens  | False                             |
+| 28 | Manon_Lescaut_PEDRO                                            | 71,219 tokens  | False                             |
+| 29 | TOTAL                                                          | 346,579 tokens | 5 files used for cross-validation |
+## CONTACT:
+mail: antoine [dot] bourgois [at] protonmail [dot] com

README.md ADDED Viewed

	@@ -0,0 +1,158 @@

+---
+language: fr
+tags:
+- coreference-resolution
+- anaphora-resolution
+- mentions-linking
+- literary-texts
+- camembert
+- literary-texts
+- nested-entities
+- BookNLP-fr
+license: apache-2.0
+metrics:
+- MUC
+- B3
+- CEAF
+- CoNLL-F1
+base_model:
+- almanach/camembert-large
+---
+## INTRODUCTION:
+This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **coreference resolution model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.
+This specific model has been trained to link entities of the following types: PER.
+## MODEL PERFORMANCES (LOOCV):
+Overall Coreference Resolution Performances for non-overlapping windows of different length:
+|    | Window width (tokens)   |   Document count |   Sample count | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
+|----|-------------------------|------------------|----------------|----------|---------|------------|------------|
+|  0 | 500                     |               29 |            677 | 92.18%   | 83.86%  | 76.86%     | 84.30%     |
+|  1 | 1,000                   |               29 |            332 | 92.65%   | 79.79%  | 71.77%     | 81.40%     |
+|  2 | 2,000                   |               28 |            162 | 93.29%   | 75.85%  | 67.34%     | 78.83%     |
+|  3 | 5,000                   |               19 |             56 | 93.76%   | 69.60%  | 61.16%     | 74.84%     |
+|  4 | 10,000                  |               18 |             27 | 94.28%   | 65.73%  | 58.59%     | 72.86%     |
+|  5 | 25,000                  |                2 |              3 | 94.76%   | 62.48%  | 53.33%     | 70.19%     |
+|  6 | 50,000                  |                1 |              1 | 97.39%   | 56.43%  | 47.40%     | 67.07%     |
+Coreference Resolution Performances on the fully annotated sample for each document:
+|    | Token count   | Mention count   | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
+|----|---------------|-----------------|----------|---------|------------|------------|
+|  0 | 1,864         | 253             | 98.16%   | 95.39%  | 60.34%     | 84.63%     |
+|  1 | 2,034         | 321             | 97.47%   | 92.79%  | 80.04%     | 90.10%     |
+|  2 | 2,141         | 297             | 95.06%   | 77.99%  | 65.08%     | 79.38%     |
+|  3 | 2,251         | 235             | 91.95%   | 80.47%  | 46.56%     | 73.00%     |
+|  4 | 2,343         | 239             | 83.87%   | 61.95%  | 43.58%     | 63.13%     |
+|  5 | 2,441         | 314             | 91.85%   | 55.70%  | 60.82%     | 69.46%     |
+|  6 | 2,554         | 330             | 90.24%   | 65.27%  | 72.36%     | 75.96%     |
+|  7 | 2,860         | 369             | 93.65%   | 84.89%  | 74.93%     | 84.49%     |
+|  8 | 2,929         | 386             | 95.65%   | 78.21%  | 64.23%     | 79.37%     |
+|  9 | 4,067         | 429             | 97.46%   | 85.20%  | 62.52%     | 81.73%     |
+| 10 | 5,425         | 558             | 90.46%   | 53.03%  | 59.52%     | 67.67%     |
+| 11 | 10,305        | 1,436           | 96.37%   | 74.83%  | 59.91%     | 77.04%     |
+| 12 | 10,982        | 1,095           | 97.18%   | 65.30%  | 60.49%     | 74.32%     |
+| 13 | 11,768        | 1,734           | 93.30%   | 64.14%  | 64.12%     | 73.85%     |
+| 14 | 11,834        | 600             | 92.21%   | 67.51%  | 60.74%     | 73.49%     |
+| 15 | 11,902        | 1,692           | 95.03%   | 58.83%  | 45.59%     | 66.49%     |
+| 16 | 12,281        | 1,089           | 95.06%   | 62.05%  | 72.55%     | 76.55%     |
+| 17 | 12,285        | 1,489           | 95.28%   | 77.84%  | 57.43%     | 76.85%     |
+| 18 | 12,315        | 1,501           | 95.36%   | 57.07%  | 64.26%     | 72.23%     |
+| 19 | 12,389        | 1,654           | 93.19%   | 54.21%  | 51.84%     | 66.41%     |
+| 20 | 12,557        | 1,085           | 92.30%   | 66.97%  | 46.65%     | 68.64%     |
+| 21 | 12,703        | 1,731           | 90.40%   | 53.70%  | 61.37%     | 68.49%     |
+| 22 | 13,023        | 1,559           | 93.86%   | 61.71%  | 62.41%     | 72.66%     |
+| 23 | 14,299        | 1,582           | 97.23%   | 69.25%  | 67.04%     | 77.84%     |
+| 24 | 14,637        | 2,127           | 95.78%   | 71.34%  | 63.28%     | 76.80%     |
+| 25 | 15,408        | 1,769           | 92.85%   | 54.11%  | 56.12%     | 67.69%     |
+| 26 | 24,776        | 2,716           | 94.31%   | 63.51%  | 54.12%     | 70.65%     |
+| 27 | 30,987        | 2,980           | 89.55%   | 54.25%  | 59.68%     | 67.83%     |
+| 28 | 71,219        | 11,857          | 97.38%   | 50.85%  | 45.93%     | 64.72%     |
+## TRAINING PARAMETERS:
+- Entities types: PER
+- Split strategy: Leave-one-out cross-validation (29 files)
+- Train/Validation split: 0.85 / 0.15000000000000002
+- Batch size: 16,000
+- Initial learning rate: 0.0004
+- Focal loss gamma: 1
+- Focal loss alpha: 0.25
+- Pronoun lookup antecedents: 30
+- Common and Proper nouns lookup antecedents: 300
+## MODEL ARCHITECTURE:
+Model Input: 2,165 dimensions vector
+- Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
+- Additional mentions features (106 dimensions):
+  - Length of mentions
+  - Position of the mention's start token within the sentence
+  - Grammatical category of the mentions (pronoun, common noun, proper noun)
+  - Dependency relation of the mention's head (one-hot encoded)
+  - Gender of the mentions (one-hot encoded)
+  - Number (singular/plural) of the mentions (one-hot encoded)
+  - Grammatical person of the mentions (one-hot encoded)
+- Additional mention pairs features (11 dimensions):
+  - Distance between mention IDs
+  - Distance between start tokens of mentions
+  - Distance between end tokens of mentions
+  - Distance between sentences containing mentions
+  - Distance between paragraphs containing mentions
+  - Difference in nesting levels of mentions
+  - Ratio of shared tokens between mentions
+  - Exact text match between mentions (binary)
+  - Exact match of mention heads (binary)
+  - Match of syntactic heads between mentions (binary)
+  - Match of entity types between mentions (binary)
+- Hidden Layers:
+  - Number of layers: 3
+  - Units per layer: 1,900 nodes
+  - Activation function: relu
+  - Dropout rate: 0.6
+- Final Layer:
+  - Type: Linear
+  - Input: 1900 dimensions
+  - Output: 1 dimension (mention pair coreference score)
+Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.
+## HOW TO USE:
+*** IN CONSTRUCTION ***
+## TRAINING CORPUS:
+|    | Document                                                       | Tokens Count   | Is included in model eval          |
+|----|----------------------------------------------------------------|----------------|------------------------------------|
+|  0 | 1836_Gautier-Theophile_La-morte-amoureuse                      | 14,299 tokens  | **True**                           |
+|  1 | 1840_Sand-George_Pauline                                       | 12,315 tokens  | **True**                           |
+|  2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote             | 24,776 tokens  | **True**                           |
+|  3 | 1844_Balzac-Honore-de_La-Maison-Nucingen                       | 30,987 tokens  | **True**                           |
+|  4 | 1844_Balzac-Honore-de_Sarrasine                                | 15,408 tokens  | **True**                           |
+|  5 | 1856_Cousin-Victor_Madame-de-Hautefort                         | 11,768 tokens  | **True**                           |
+|  6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse                   | 11,834 tokens  | **True**                           |
+|  7 | 1873_Zola-Emile_Le-ventre-de-Paris                             | 12,557 tokens  | **True**                           |
+|  8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet                      | 12,281 tokens  | **True**                           |
+|  9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens   | **True**                           |
+| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE   | 2,554 tokens   | **True**                           |
+| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE        | 2,929 tokens   | **True**                           |
+| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA           | 4,067 tokens   | **True**                           |
+| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE          | 2,251 tokens   | **True**                           |
+| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE        | 2,034 tokens   | **True**                           |
+| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU               | 1,864 tokens   | **True**                           |
+| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL            | 2,141 tokens   | **True**                           |
+| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE          | 2,441 tokens   | **True**                           |
+| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL          | 2,860 tokens   | **True**                           |
+| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON      | 2,343 tokens   | **True**                           |
+| 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis   | 12,703 tokens  | **True**                           |
+| 21 | 1903_Conan-Laure_Elisabeth_Seton                               | 13,023 tokens  | **True**                           |
+| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube              | 10,982 tokens  | **True**                           |
+| 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin           | 10,305 tokens  | **True**                           |
+| 24 | 1917_Adèle-Bourgeois_Némoville                                 | 12,389 tokens  | **True**                           |
+| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps                       | 14,637 tokens  | **True**                           |
+| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin                   | 11,902 tokens  | **True**                           |
+| 27 | 1937_Audoux-Marguerite_Douce-Lumiere                           | 12,285 tokens  | **True**                           |
+| 28 | Manon_Lescaut_PEDRO                                            | 71,219 tokens  | **True**                           |
+| 29 | TOTAL                                                          | 346,579 tokens | 29 files used for cross-validation |
+## CONTACT:
+mail: antoine [dot] bourgois [at] protonmail [dot] com

final_model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e2d6959e7d303580f9343904eefb84cc7e2d4917abbe30b12e1d2c591ccc0230
+size 45374744