File size: 2,711 Bytes
650a9c8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aac42eb
650a9c8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
license: mit
datasets:
- wer_leitet
language:
- en
metrics:
- loss
- accuracy
- recall
- precision
- f1
tags:
- entity-matching
- similarity-comparison
- preprocessing
- neer-match
model-index:
- name: Wer Leitet Entity Matching Model
  results:
  - task:
      type: entity-matching
      name: Entity Matching
    dataset:
      type: wer_leitet
      name: Wer Leitet
      config: default
      split: test
    metrics:
    - type: loss
      value: 4.6261e-06
      name: Test Loss
    - type: accuracy
      value: 1.0
      name: Test Accuracy
    - type: recall
      value: 1.0
      name: Test Recall
    - type: precision
      value: 1.0
      name: Test Precision
    - type: f1
      value: 1.0
      name: Test F1 Score
---

## Preprocessing

Before training, the `wer_leitet` dataset was preprocessed using the `prepare.format` function from the `neer-match-utilities` library. The following preprocessing steps were applied:

1. **String Standardization**:
   - Missing string values were replaced with placeholders.
   - All string fields were capitalized to ensure consistency in text formatting.
2. **Identification of Common Names**
   - Common names were defined as those falling within the 95th percentile of the distribution for first and last names.

These preprocessing steps ensured that the input data was harmonized and ready for training, improving the model's ability to compare and match records effectively.

---

## Similarity Map

The model uses a `SimilarityMap` to compute similarity scores between attributes of records. The following similarity metrics were applied:

```python
similarity_map = {
    "main_info": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio"],
    "Vorstand": ["levenshtein", "jaro_winkler", "notmissing"],
    "StVdAR": ["levenshtein", "jaro_winkler", "notmissing"],
    "address": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio", "notmissing"],
    "birth_date" : ['discrete', "notmissing"],
    "raw_text": ["token_set_ratio", "partial_token_set_ratio", "notmissing"],
    "common_name" : ['discrete', "notmissing"],
    "common_surname" : ['discrete', "notmissing"],
}
```

---

## Fitting the Model

The model was trained using the `fit` method and the binary cross-entropy (BCE) loss function.

### Training Configuration
The training parameters deviated from the default values in the following ways:
- **Epochs**: 150
- **Mismatch Share**: 0.3

Before training, the labeled data was split into training and test data, using the `split_test_train` method of `neer_match_utilities` with a `test_ratio` 0f .3