File size: 8,324 Bytes

---
language: fr
tags:
- coreference-resolution
- anaphora-resolution
- mentions-linking
- literary-texts
- camembert
- literary-texts
- nested-entities
- BookNLP-fr
license: apache-2.0
metrics:
- MUC
- B3
- CEAF
- CoNLL-F1
base_model:
- almanach/camembert-large
---

## INTRODUCTION:
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a **coreference resolution model** built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.

This specific model has been trained to link entities of the following types: PER.

## MODEL PERFORMANCES (LOOCV):
Overall Coreference Resolution Performances for non-overlapping windows of different length:
|    | Window width (tokens)   |   Document count |   Sample count | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
|----|-------------------------|------------------|----------------|----------|---------|------------|------------|
|  0 | 500                     |                5 |             64 | 93.49%   | 86.27%  | 77.85%     | 85.87%     |
|  1 | 1,000                   |                5 |             30 | 93.68%   | 81.32%  | 71.92%     | 82.31%     |
|  2 | 2,000                   |                5 |             14 | 93.98%   | 76.90%  | 67.26%     | 79.38%     |
|  3 | 5,000                   |                3 |              5 | 94.83%   | 68.34%  | 59.88%     | 74.35%     |
|  4 | 10,000                  |                2 |              2 | 96.16%   | 62.22%  | 57.12%     | 71.84%     |

Coreference Resolution Performances on the fully annotated sample for each document:
|    | Token count   | Mention count   | MUC F1   | B3 F1   | CEAFe F1   | CONLL F1   |
|----|---------------|-----------------|----------|---------|------------|------------|
|  0 | 2,554         | 330             | 90.24%   | 65.27%  | 72.36%     | 75.96%     |
|  1 | 2,929         | 386             | 95.65%   | 78.21%  | 64.23%     | 79.37%     |
|  2 | 5,425         | 558             | 90.46%   | 53.03%  | 59.52%     | 67.67%     |
|  3 | 10,982        | 1,095           | 97.18%   | 65.30%  | 60.49%     | 74.32%     |
|  4 | 11,902        | 1,692           | 95.03%   | 58.83%  | 45.59%     | 66.49%     |

## TRAINING PARAMETERS:
- Entities types: PER
- Split strategy: Leave-one-out cross-validation (29 files)
- Train/Validation split: 0.85 / 0.15
- Batch size: 16,000
- Initial learning rate: 0.0004
- Focal loss gamma: 1
- Focal loss alpha: 0.25
- Pronoun lookup antecedents: 30
- Common and Proper nouns lookup antecedents: 300

## MODEL ARCHITECTURE:
Model Input: 2,165 dimensions vector
- Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
- Additional mentions features (106 dimensions):
  - Length of mentions
  - Position of the mention's start token within the sentence
  - Grammatical category of the mentions (pronoun, common noun, proper noun)
  - Dependency relation of the mention's head (one-hot encoded)
  - Gender of the mentions (one-hot encoded)
  - Number (singular/plural) of the mentions (one-hot encoded)
  - Grammatical person of the mentions (one-hot encoded)
- Additional mention pairs features (11 dimensions):
  - Distance between mention IDs
  - Distance between start tokens of mentions
  - Distance between end tokens of mentions
  - Distance between sentences containing mentions
  - Distance between paragraphs containing mentions
  - Difference in nesting levels of mentions
  - Ratio of shared tokens between mentions
  - Exact text match between mentions (binary)
  - Exact match of mention heads (binary)
  - Match of syntactic heads between mentions (binary)
  - Match of entity types between mentions (binary)

- Hidden Layers:
  - Number of layers: 3
  - Units per layer: 1,900 nodes
  - Activation function: relu
  - Dropout rate: 0.6

- Final Layer:
  - Type: Linear
  - Input: 1900 dimensions
  - Output: 1 dimension (mention pair coreference score)

Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.

## HOW TO USE:
*** IN CONSTRUCTION ***

## TRAINING CORPUS:
|    | Document                                                       | Tokens Count   | Is included in model eval         |
|----|----------------------------------------------------------------|----------------|-----------------------------------|
|  0 | 1836_Gautier-Theophile_La-morte-amoureuse                      | 14,299 tokens  | False                             |
|  1 | 1840_Sand-George_Pauline                                       | 12,315 tokens  | False                             |
|  2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote             | 24,776 tokens  | False                             |
|  3 | 1844_Balzac-Honore-de_La-Maison-Nucingen                       | 30,987 tokens  | False                             |
|  4 | 1844_Balzac-Honore-de_Sarrasine                                | 15,408 tokens  | False                             |
|  5 | 1856_Cousin-Victor_Madame-de-Hautefort                         | 11,768 tokens  | False                             |
|  6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse                   | 11,834 tokens  | False                             |
|  7 | 1873_Zola-Emile_Le-ventre-de-Paris                             | 12,557 tokens  | False                             |
|  8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet                      | 12,281 tokens  | False                             |
|  9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens   | **True**                          |
| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE   | 2,554 tokens   | **True**                          |
| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE        | 2,929 tokens   | **True**                          |
| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA           | 4,067 tokens   | False                             |
| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE          | 2,251 tokens   | False                             |
| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE        | 2,034 tokens   | False                             |
| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU               | 1,864 tokens   | False                             |
| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL            | 2,141 tokens   | False                             |
| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE          | 2,441 tokens   | False                             |
| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL          | 2,860 tokens   | False                             |
| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON      | 2,343 tokens   | False                             |
| 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis   | 12,703 tokens  | False                             |
| 21 | 1903_Conan-Laure_Elisabeth_Seton                               | 13,023 tokens  | False                             |
| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube              | 10,982 tokens  | **True**                          |
| 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin           | 10,305 tokens  | False                             |
| 24 | 1917_Adèle-Bourgeois_Némoville                                 | 12,389 tokens  | False                             |
| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps                       | 14,637 tokens  | False                             |
| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin                   | 11,902 tokens  | **True**                          |
| 27 | 1937_Audoux-Marguerite_Douce-Lumiere                           | 12,285 tokens  | False                             |
| 28 | Manon_Lescaut_PEDRO                                            | 71,219 tokens  | False                             |
| 29 | TOTAL                                                          | 346,579 tokens | 5 files used for cross-validation |

## CONTACT:
mail: antoine [dot] bourgois [at] protonmail [dot] com