File size: 8,324 Bytes
039cd26 f5fbf09 039cd26 f5fbf09 039cd26 f5fbf09 039cd26 f5fbf09 039cd26 f5fbf09 039cd26 f5fbf09 039cd26 3148c33 f5fbf09 039cd26 f5fbf09 039cd26 f5fbf09 039cd26 f5fbf09 039cd26 f5fbf09 039cd26 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
---
language: fr
tags:
- coreference-resolution
- anaphora-resolution
- mentions-linking
- literary-texts
- camembert
- literary-texts
- nested-entities
- BookNLP-fr
license: apache-2.0
metrics:
- MUC
- B3
- CEAF
- CoNLL-F1
base_model:
- almanach/camembert-large
---
## INTRODUCTION:
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **coreference resolution model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.
This specific model has been trained to link entities of the following types: PER.
## MODEL PERFORMANCES (LOOCV):
Overall Coreference Resolution Performances for non-overlapping windows of different length:
| | Window width (tokens) | Document count | Sample count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 |
|----|-------------------------|------------------|----------------|----------|---------|------------|------------|
| 0 | 500 | 5 | 64 | 93.49% | 86.27% | 77.85% | 85.87% |
| 1 | 1,000 | 5 | 30 | 93.68% | 81.32% | 71.92% | 82.31% |
| 2 | 2,000 | 5 | 14 | 93.98% | 76.90% | 67.26% | 79.38% |
| 3 | 5,000 | 3 | 5 | 94.83% | 68.34% | 59.88% | 74.35% |
| 4 | 10,000 | 2 | 2 | 96.16% | 62.22% | 57.12% | 71.84% |
Coreference Resolution Performances on the fully annotated sample for each document:
| | Token count | Mention count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 |
|----|---------------|-----------------|----------|---------|------------|------------|
| 0 | 2,554 | 330 | 90.24% | 65.27% | 72.36% | 75.96% |
| 1 | 2,929 | 386 | 95.65% | 78.21% | 64.23% | 79.37% |
| 2 | 5,425 | 558 | 90.46% | 53.03% | 59.52% | 67.67% |
| 3 | 10,982 | 1,095 | 97.18% | 65.30% | 60.49% | 74.32% |
| 4 | 11,902 | 1,692 | 95.03% | 58.83% | 45.59% | 66.49% |
## TRAINING PARAMETERS:
- Entities types: PER
- Split strategy: Leave-one-out cross-validation (29 files)
- Train/Validation split: 0.85 / 0.15
- Batch size: 16,000
- Initial learning rate: 0.0004
- Focal loss gamma: 1
- Focal loss alpha: 0.25
- Pronoun lookup antecedents: 30
- Common and Proper nouns lookup antecedents: 300
## MODEL ARCHITECTURE:
Model Input: 2,165 dimensions vector
- Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
- Additional mentions features (106 dimensions):
- Length of mentions
- Position of the mention's start token within the sentence
- Grammatical category of the mentions (pronoun, common noun, proper noun)
- Dependency relation of the mention's head (one-hot encoded)
- Gender of the mentions (one-hot encoded)
- Number (singular/plural) of the mentions (one-hot encoded)
- Grammatical person of the mentions (one-hot encoded)
- Additional mention pairs features (11 dimensions):
- Distance between mention IDs
- Distance between start tokens of mentions
- Distance between end tokens of mentions
- Distance between sentences containing mentions
- Distance between paragraphs containing mentions
- Difference in nesting levels of mentions
- Ratio of shared tokens between mentions
- Exact text match between mentions (binary)
- Exact match of mention heads (binary)
- Match of syntactic heads between mentions (binary)
- Match of entity types between mentions (binary)
- Hidden Layers:
- Number of layers: 3
- Units per layer: 1,900 nodes
- Activation function: relu
- Dropout rate: 0.6
- Final Layer:
- Type: Linear
- Input: 1900 dimensions
- Output: 1 dimension (mention pair coreference score)
Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.
## HOW TO USE:
*** IN CONSTRUCTION ***
## TRAINING CORPUS:
| | Document | Tokens Count | Is included in model eval |
|----|----------------------------------------------------------------|----------------|-----------------------------------|
| 0 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,299 tokens | False |
| 1 | 1840_Sand-George_Pauline | 12,315 tokens | False |
| 2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens | False |
| 3 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,987 tokens | False |
| 4 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens | False |
| 5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | False |
| 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | False |
| 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | False |
| 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | False |
| 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | **True** |
| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | **True** |
| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | **True** |
| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | False |
| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | False |
| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | False |
| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,864 tokens | False |
| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,141 tokens | False |
| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,441 tokens | False |
| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,860 tokens | False |
| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | False |
| 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | False |
| 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | False |
| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | **True** |
| 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | False |
| 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | False |
| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | False |
| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | **True** |
| 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | False |
| 28 | Manon_Lescaut_PEDRO | 71,219 tokens | False |
| 29 | TOTAL | 346,579 tokens | 5 files used for cross-validation |
## CONTACT:
mail: antoine [dot] bourgois [at] protonmail [dot] com
|