en_torah_ner / README.md
noahsantacruz's picture
Update README.md
e65a82d
|
raw
history blame
No virus
3.73 kB
metadata
tags:
  - spacy
  - token-classification
language:
  - en
model-index:
  - name: en_torah_ner
    results:
      - task:
          name: NER
          type: token-classification
        metrics:
          - name: NER Precision
            type: precision
            value: 0.8413793103
          - name: NER Recall
            type: recall
            value: 0.87517934
          - name: NER F Score
            type: f_score
            value: 0.8579465541

See below for technical details about the model.

Description

This model is a named entity recognition model that was trained to run on text that discusses Torah topics (e.g. dvar torahs, Torah blogs, translations of classic Torah texts etc.).

It detects the following types of entities:

Label Description
Person Name of a person
Group Name of a group of people. E.g. nations (Egypt), schools (Bet Hillel, Tosafot)
Citation Citations to Torah texts. See notes below.

Notes on citation matches

  • Final parentheses is not included in the match. E.g. if the citation is Genesis (1:1) then the final parentheses will not be included. We found that the model would get confused if the final parentheses was part of the entity. It is fairly simple to add it back in via a deterministic check.
  • Only the first word of a dibur hamatchil is included in the match. E.g. Tosafot s.v. Amar Rabbi Akiva only until the word Amar will be tagged. We found the model had trouble determining the end of the dibur hamatchil.
  • See Ref part model for a model that can break down citations into chunks so it is simpler to parse them.

Using with Sefaria-Project

The Sefaria-Project repo can use this model to return objects linked to objects in the Sefaria database. Non-citation entities are linked to Topic objects and citation entities are linked to Ref objects.

Configuring Sefaria-Project to use this model

The assumption is that Sefaria-Project is set up on your environment following the instructions in our README.

In local_settings.py, modify the following lines:

RAW_REF_MODEL_BY_LANG_FILEPATH = {
   "en": "/path/to/torah-ner-english model"
}

Running the model with Sefaria-Project

The following code shows an example of instantiating the Linker object which uses the ML models and running the Linker with input.

import django
django.setup()
from sefaria.model.text import library

text = "Moses received the Torah from Har Sinai (Avot Chapter 1 Mishnah 1)"
linker = library.get_linker("en")
doc = linker.link(text)

print("Named entities")
for resolved_named_entity in doc.resolved_named_entities:
    print("---")
    print("Text:", resolved_named_entity.raw_entity.text)
    print("Topic Slug:", resolved_named_entity.topic.slug)

print("Citations")
for resolved_ref in doc.resolved_refs:
    print("---")
    print("Text:", resolved_ref.raw_entity.text)
    print("Ref:", resolved_ref.ref.normal())

Technical Details

Feature Description
Name en_torah_ner
Version 1.0.0
spaCy >=3.4.1,<3.5.0
Default Pipeline tok2vec, ner
Components tok2vec, ner
Vectors 218765 keys, 218765 unique vectors (50 dimensions)
Sources n/a
License GPLv3.0
Author Sefaria

Label Scheme

View label scheme (3 labels for 1 components)
Component Labels
ner Citation, Group, Person

Accuracy

Type Score
ENTS_F 85.79
ENTS_P 84.14
ENTS_R 87.52
TOK2VEC_LOSS 136797.07
NER_LOSS 95967.72