File size: 4,849 Bytes
7bd5a1f
 
 
 
 
 
5ab1138
 
8a44bf8
 
 
7bd5a1f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e65a82d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e946ec
 
 
 
 
e65a82d
 
 
 
 
 
 
 
 
 
0e591a5
 
e65a82d
 
 
 
0e591a5
 
e65a82d
 
 
0e591a5
 
e65a82d
0e591a5
 
 
 
 
e65a82d
 
 
0e591a5
 
 
 
 
 
e65a82d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7bd5a1f
 
 
 
 
 
 
 
 
e65a82d
 
7bd5a1f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
tags:
- spacy
- token-classification
language:
- en
widget:
  - text: "Moses received the Torah from Mount Sinai (Avot Chapter 1 Mishnah 1)"
  - text: "Yerushalmi Peah 23d"
  - text: "Rashi argues with Tosafot about this. See Rashi Bava Kamma 40a s.v. Amar."
  - text: "Bet Shammai says he is obligated to pay while Bet Hillel says he is exempt."
model-index:
- name: en_torah_ner
  results:
  - task:
      name: NER
      type: token-classification
    metrics:
    - name: NER Precision
      type: precision
      value: 0.8413793103
    - name: NER Recall
      type: recall
      value: 0.87517934
    - name: NER F Score
      type: f_score
      value: 0.8579465541
---

See below for technical details about the model.

# Description

This model is a named entity recognition model that was trained to run on text that discusses Torah topics (e.g. dvar torahs, Torah blogs, translations of classic Torah texts etc.).

It detects the following types of entities:

| Label | Description
|---|---|
| Person | Name of a person |
| Group | Name of a group of people. E.g. nations (Egypt), schools (Bet Hillel, Tosafot) |
| Citation | Citations to Torah texts. See notes below. |

## Notes on normalization

All text the model was trained on was initially put through the following normalizer: [link](https://github.com/Sefaria/Machine-Learning/blob/main/util/helper.py#L43).
Results will be signicantly worse if this normalizer is not used.

## Notes on citation matches

- Final parentheses is not included in the match. E.g. if the citation is `Genesis (1:1)` then the final parentheses will not be included. We found that the model would get confused if the final parentheses was part of the entity. It is fairly simple to add it back in via a deterministic check.
- Only the first word of a dibur hamatchil is included in the match. E.g. `Tosafot s.v. Amar Rabbi Akiva` only until the word `Amar` will be tagged. We found the model had trouble determining the end of the dibur hamatchil.
- See Ref part model for a model that can break down citations into chunks so it is simpler to parse them.

## Using with Sefaria-Project

The [Sefaria-Project](https://github.com/Sefaria/Sefaria-Project) repo can use this model to return objects linked to objects in the Sefaria database. Non-citation entities are linked to `Topic` objects and citation entities are linked to `Ref` objects.

Note, this model is designed to be used in conjunction with the corresponding [subref model](https://huggingface.co/Sefaria/en_subref_ner). That model takes citations as input and tags the parts of the citation. The below instructions explain how to integrate both of these models into Sefaria-Project.

### Configuring Sefaria-Project to use this model

The assumption is that Sefaria-Project is set up on your environment following the instructions in our [README](https://github.com/Sefaria/Sefaria-Project/blob/master/README.mkd).

Download this repo and the [subref repo](https://huggingface.co/Sefaria/en_subref_ner).

In `local_settings.py`, modify the following lines:

```python
ENABLE_LINKER = True

RAW_REF_MODEL_BY_LANG_FILEPATH = {
   "en": "/path/to/en-ref-ner model"
}

RAW_REF_PART_MODEL_BY_LANG_FILEPATH = {
	"en": "/path/to/en-subref-ner model",
}
```

Make sure spaCy is installed.

```bash
pip install spacy==3.4.1
```

### Running the model with Sefaria-Project

The following code shows an example of instantiating the `Linker` object which uses the ML models and running the `Linker` with input.

```python
import django
django.setup()
from sefaria.model.text import library

text = "Moses received the Torah from Har Sinai (Avot Chapter 1 Mishnah 1)"
linker = library.get_linker("en")
doc = linker.link(text)

print("Named entities")
for resolved_named_entity in doc.resolved_named_entities:
    print("---")
    print("Text:", resolved_named_entity.raw_entity.text)
    print("Topic Slug:", resolved_named_entity.topic.slug)

print("Citations")
for resolved_ref in doc.resolved_refs:
    print("---")
    print("Text:", resolved_ref.raw_entity.text)
    print("Ref:", resolved_ref.ref.normal())
```

# Technical Details

| Feature | Description |
| --- | --- |
| **Name** | `en_torah_ner` |
| **Version** | `1.0.0` |
| **spaCy** | `>=3.4.1,<3.5.0` |
| **Default Pipeline** | `tok2vec`, `ner` |
| **Components** | `tok2vec`, `ner` |
| **Vectors** | 218765 keys, 218765 unique vectors (50 dimensions) |
| **Sources** | n/a |
| **License** | GPLv3.0 |
| **Author** | Sefaria |

### Label Scheme

<details>

<summary>View label scheme (3 labels for 1 components)</summary>

| Component | Labels |
| --- | --- |
| **`ner`** | `Citation`, `Group`, `Person` |

</details>

### Accuracy

| Type | Score |
| --- | --- |
| `ENTS_F` | 85.79 |
| `ENTS_P` | 84.14 |
| `ENTS_R` | 87.52 |
| `TOK2VEC_LOSS` | 136797.07 |
| `NER_LOSS` | 95967.72 |