See below for technical details about the model.

Description

This model is a named entity recognition model that was trained to run on text that discusses Torah topics (e.g. dvar torahs, Torah blogs, translations of classic Torah texts etc.).

It detects the following types of entities:

Label	Description
מקור	Citations to Torah texts. See notes below.

Notes on normalization

All text the model was trained on was initially put through the following normalizer: link. Results will be signicantly worse if this normalizer is not used.

Notes on citation matches

Final parentheses is not included in the match. E.g. if the citation is בראשית (א:א) then the final parentheses will not be included. We found that the model would get confused if the final parentheses was part of the entity. It is fairly simple to add it back in via a deterministic check.
Only the first word of a dibur hamatchil is included in the match. E.g. תוספות ד״ה אמר רבי עקיבא only until the word אמר will be tagged. We found the model had trouble determining the end of the dibur hamatchil.
See Ref part model for a model that can break down citations into chunks so it is simpler to parse them.

Using with Sefaria-Project

The Sefaria-Project repo can use this model to return objects linked to objects in the Sefaria database. Non-citation entities are linked to Topic objects and citation entities are linked to Ref objects.

Note, this model is designed to be used in conjunction with the corresponding subref model. That model takes citations as input and tags the parts of the citation. The below instructions explain how to integrate both of these models into Sefaria-Project.

Configuring Sefaria-Project to use this model

The assumption is that Sefaria-Project is set up on your environment following the instructions in our README.

Download this repo and the subref repo.

In local_settings.py, modify the following lines:

ENABLE_LINKER = True

RAW_REF_MODEL_BY_LANG_FILEPATH = {
   "he": "/path/to/he-ref-ner model"
}

RAW_REF_PART_MODEL_BY_LANG_FILEPATH = {
    "he": "/path/to/he-subref-ner model",
}

Make sure spaCy is installed.

pip install spacy==3.4.1

Running the model with Sefaria-Project

The following code shows an example of instantiating the Linker object which uses the ML models and running the Linker with input.

import django
django.setup()
from sefaria.model.text import library

text = "משה קבל תורה מסיני (אבות פרק א משנה א)"
linker = library.get_linker("he")
doc = linker.link(text)

print("Named entities")
for resolved_named_entity in doc.resolved_named_entities:
    print("---")
    print("Text:", resolved_named_entity.raw_entity.text)
    print("Topic Slug:", resolved_named_entity.topic.slug)

print("Citations")
for resolved_ref in doc.resolved_refs:
    print("---")
    print("Text:", resolved_ref.raw_entity.text)
    print("Ref:", resolved_ref.ref.normal())

Technical Details

Feature	Description
Name	`he_ref_ner`
Version	`1.0.0`
spaCy	`>=3.4.1,<3.5.0`
Default Pipeline	`tok2vec`, `ner`
Components	`tok2vec`, `ner`
Vectors	391957 keys, 391957 unique vectors (50 dimensions)
Sources	n/a
License	n/a
Author	n/a

Label Scheme

View label scheme (1 labels for 1 components)

Component	Labels
`ner`	`מקור`

Accuracy

Type	Score
`ENTS_F`	82.96
`ENTS_P`	86.42
`ENTS_R`	79.77
`TOK2VEC_LOSS`	44775.36
`NER_LOSS`	4561.19

Sefaria
/

he_ref_ner