File size: 8,324 Bytes
039cd26
 
 
f5fbf09
 
 
 
039cd26
 
 
 
 
 
f5fbf09
 
 
 
039cd26
 
 
 
 
f5fbf09
039cd26
f5fbf09
039cd26
 
f5fbf09
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
039cd26
 
f5fbf09
039cd26
3148c33
f5fbf09
 
 
 
 
 
039cd26
 
f5fbf09
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
039cd26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f5fbf09
 
 
039cd26
 
 
 
 
 
 
 
 
 
f5fbf09
039cd26
 
 
f5fbf09
039cd26
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
language: fr
tags:
- coreference-resolution
- anaphora-resolution
- mentions-linking
- literary-texts
- camembert
- literary-texts
- nested-entities
- BookNLP-fr
license: apache-2.0
metrics:
- MUC
- B3
- CEAF
- CoNLL-F1
base_model:
- almanach/camembert-large
---

## INTRODUCTION:
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **coreference resolution model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.

This specific model has been trained to link entities of the following types: PER.

## MODEL PERFORMANCES (LOOCV):
Overall Coreference Resolution Performances for non-overlapping windows of different length:
|    | Window width (tokens)   |   Document count |   Sample count | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
|----|-------------------------|------------------|----------------|----------|---------|------------|------------|
|  0 | 500                     |                5 |             64 | 93.49%   | 86.27%  | 77.85%     | 85.87%     |
|  1 | 1,000                   |                5 |             30 | 93.68%   | 81.32%  | 71.92%     | 82.31%     |
|  2 | 2,000                   |                5 |             14 | 93.98%   | 76.90%  | 67.26%     | 79.38%     |
|  3 | 5,000                   |                3 |              5 | 94.83%   | 68.34%  | 59.88%     | 74.35%     |
|  4 | 10,000                  |                2 |              2 | 96.16%   | 62.22%  | 57.12%     | 71.84%     |

Coreference Resolution Performances on the fully annotated sample for each document:
|    | Token count   | Mention count   | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
|----|---------------|-----------------|----------|---------|------------|------------|
|  0 | 2,554         | 330             | 90.24%   | 65.27%  | 72.36%     | 75.96%     |
|  1 | 2,929         | 386             | 95.65%   | 78.21%  | 64.23%     | 79.37%     |
|  2 | 5,425         | 558             | 90.46%   | 53.03%  | 59.52%     | 67.67%     |
|  3 | 10,982        | 1,095           | 97.18%   | 65.30%  | 60.49%     | 74.32%     |
|  4 | 11,902        | 1,692           | 95.03%   | 58.83%  | 45.59%     | 66.49%     |

## TRAINING PARAMETERS:
- Entities types: PER
- Split strategy: Leave-one-out cross-validation (29 files)
- Train/Validation split: 0.85 / 0.15
- Batch size: 16,000
- Initial learning rate: 0.0004
- Focal loss gamma: 1
- Focal loss alpha: 0.25
- Pronoun lookup antecedents: 30
- Common and Proper nouns lookup antecedents: 300

## MODEL ARCHITECTURE:
Model Input: 2,165 dimensions vector
- Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
- Additional mentions features (106 dimensions):
  - Length of mentions
  - Position of the mention's start token within the sentence
  - Grammatical category of the mentions (pronoun, common noun, proper noun)
  - Dependency relation of the mention's head (one-hot encoded)
  - Gender of the mentions (one-hot encoded)
  - Number (singular/plural) of the mentions (one-hot encoded)
  - Grammatical person of the mentions (one-hot encoded)
- Additional mention pairs features (11 dimensions):
  - Distance between mention IDs
  - Distance between start tokens of mentions
  - Distance between end tokens of mentions
  - Distance between sentences containing mentions
  - Distance between paragraphs containing mentions
  - Difference in nesting levels of mentions
  - Ratio of shared tokens between mentions
  - Exact text match between mentions (binary)
  - Exact match of mention heads (binary)
  - Match of syntactic heads between mentions (binary)
  - Match of entity types between mentions (binary)

- Hidden Layers:
  - Number of layers: 3
  - Units per layer: 1,900 nodes
  - Activation function: relu
  - Dropout rate: 0.6

- Final Layer:
  - Type: Linear
  - Input: 1900 dimensions
  - Output: 1 dimension (mention pair coreference score)

Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.

## HOW TO USE:
*** IN CONSTRUCTION ***

## TRAINING CORPUS:
|    | Document                                                       | Tokens Count   | Is included in model eval         |
|----|----------------------------------------------------------------|----------------|-----------------------------------|
|  0 | 1836_Gautier-Theophile_La-morte-amoureuse                      | 14,299 tokens  | False                             |
|  1 | 1840_Sand-George_Pauline                                       | 12,315 tokens  | False                             |
|  2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote             | 24,776 tokens  | False                             |
|  3 | 1844_Balzac-Honore-de_La-Maison-Nucingen                       | 30,987 tokens  | False                             |
|  4 | 1844_Balzac-Honore-de_Sarrasine                                | 15,408 tokens  | False                             |
|  5 | 1856_Cousin-Victor_Madame-de-Hautefort                         | 11,768 tokens  | False                             |
|  6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse                   | 11,834 tokens  | False                             |
|  7 | 1873_Zola-Emile_Le-ventre-de-Paris                             | 12,557 tokens  | False                             |
|  8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet                      | 12,281 tokens  | False                             |
|  9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens   | **True**                          |
| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE   | 2,554 tokens   | **True**                          |
| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE        | 2,929 tokens   | **True**                          |
| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA           | 4,067 tokens   | False                             |
| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE          | 2,251 tokens   | False                             |
| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE        | 2,034 tokens   | False                             |
| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU               | 1,864 tokens   | False                             |
| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL            | 2,141 tokens   | False                             |
| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE          | 2,441 tokens   | False                             |
| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL          | 2,860 tokens   | False                             |
| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON      | 2,343 tokens   | False                             |
| 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis   | 12,703 tokens  | False                             |
| 21 | 1903_Conan-Laure_Elisabeth_Seton                               | 13,023 tokens  | False                             |
| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube              | 10,982 tokens  | **True**                          |
| 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin           | 10,305 tokens  | False                             |
| 24 | 1917_Adèle-Bourgeois_Némoville                                 | 12,389 tokens  | False                             |
| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps                       | 14,637 tokens  | False                             |
| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin                   | 11,902 tokens  | **True**                          |
| 27 | 1937_Audoux-Marguerite_Douce-Lumiere                           | 12,285 tokens  | False                             |
| 28 | Manon_Lescaut_PEDRO                                            | 71,219 tokens  | False                             |
| 29 | TOTAL                                                          | 346,579 tokens | 5 files used for cross-validation |

## CONTACT:
mail: antoine [dot] bourgois [at] protonmail [dot] com