Shaltiel's picture
Update README.md
055fae2 verified
|
raw
history blame
16 kB
metadata
license: cc-by-4.0
language:
  - he
inference: false

DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew

State-of-the-art language model for parsing Hebrew, released here.

This is the fine-tuned model for the joint parsing of the following tasks:

  • Prefix Segmentation
  • Morphological Disabmgiuation
  • Lexicographical Analysis (Lemmatization)
  • Syntactical Parsing (Dependency-Tree)
  • Named-Entity Recognition

This model was initialized from dictabert-large-joint and tuned on the Hebrew UD Treebank and NEMO corpora, to align the predictions of the model to the tagging methodology in those corpora.

A live demo of the dictabert-joint model with instant visualization of the syntax tree can be found here.

For a faster model, you can use the equivalent bert-tiny model for this task here.

For the bert-base models for other tasks, see here.


The model currently supports 3 types of output:

  1. JSON: The model returns a JSON object for each sentence in the input, where for each sentence we have the sentence text, the NER entities, and the list of tokens. For each token we include the output from each of the tasks.

    model.predict(..., output_style='json')
    
  2. UD: The model returns the full UD output for each sentence, according to the style of the Hebrew UD Treebank.

    model.predict(..., output_style='ud')
    
  3. UD, in the style of IAHLT: This model returns the full UD output, with slight modifications to match the style of IAHLT. This differences are mostly granularity of some dependency relations, how the suffix of a word is broken up, and implicit definite articles. The actual tagging behavior doesn't change.

    model.predict(..., output_style='iahlt_ud')
    

If you only need the output for one of the tasks, you can tell the model to not initialize some of the heads, for example:

model = AutoModel.from_pretrained('dicta-il/dictabert-parse', trust_remote_code=True, do_lex=False)

The list of options are: do_lex, do_syntax, do_ner, do_prefix, do_morph.


Sample usage:

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-large-parse')
model = AutoModel.from_pretrained('dicta-il/dictabert-large-parse', trust_remote_code=True)

model.eval()

sentence = '讘砖谞转 1948 讛砖诇讬诐 讗驻专讬诐 拽讬砖讜谉 讗转 诇讬诪讜讚讬讜 讘驻讬住讜诇 诪转讻转 讜讘转讜诇讚讜转 讛讗诪谞讜转 讜讛讞诇 诇驻专住诐 诪讗诪专讬诐 讛讜诪讜专讬住讟讬讬诐'
print(model.predict([sentence], tokenizer, output_style='json')) # see below for other return formats

Output:

[
  {
    "text": "讘砖谞转 1948 讛砖诇讬诐 讗驻专讬诐 拽讬砖讜谉 讗转 诇讬诪讜讚讬讜 讘驻讬住讜诇 诪转讻转 讜讘转讜诇讚讜转 讛讗诪谞讜转 讜讛讞诇 诇驻专住诐 诪讗诪专讬诐 讛讜诪讜专讬住讟讬讬诐",
    "tokens": [
      {
        "token": "讘砖谞转",
        "offsets": {
          "start": 0,
          "end": 4
        },
        "syntax": {
          "word": "讘砖谞转",
          "dep_head_idx": 2,
          "dep_func": "obl",
          "dep_head": "讛砖诇讬诐"
        },
        "seg": [
          "讘",
          "砖谞转"
        ],
        "lex": "砖谞讛",
        "morph": {
          "token": "讘砖谞转",
          "pos": "NOUN",
          "feats": {
            "Gender": "Fem",
            "Number": "Sing"
          },
          "prefixes": [
            "ADP"
          ],
          "suffix": false
        }
      },
      {
        "token": "1948",
        "offsets": {
          "start": 5,
          "end": 9
        },
        "syntax": {
          "word": "1948",
          "dep_head_idx": 0,
          "dep_func": "compound:smixut",
          "dep_head": "讘砖谞转"
        },
        "seg": [
          "1948"
        ],
        "lex": "1948",
        "morph": {
          "token": "1948",
          "pos": "NUM",
          "feats": {},
          "prefixes": [],
          "suffix": false
        }
      },
      {
        "token": "讛砖诇讬诐",
        "offsets": {
          "start": 10,
          "end": 15
        },
        "syntax": {
          "word": "讛砖诇讬诐",
          "dep_head_idx": -1,
          "dep_func": "root",
          "dep_head": "讛讜诪讜专讬住讟讬讬诐"
        },
        "seg": [
          "讛砖诇讬诐"
        ],
        "lex": "讛砖诇讬诐",
        "morph": {
          "token": "讛砖诇讬诐",
          "pos": "VERB",
          "feats": {
            "Gender": "Masc",
            "Number": "Sing",
            "Person": "3",
            "Tense": "Past"
          },
          "prefixes": [],
          "suffix": false
        }
      },
      {
        "token": "讗驻专讬诐",
        "offsets": {
          "start": 16,
          "end": 21
        },
        "syntax": {
          "word": "讗驻专讬诐",
          "dep_head_idx": 2,
          "dep_func": "nsubj",
          "dep_head": "讛砖诇讬诐"
        },
        "seg": [
          "讗驻专讬诐"
        ],
        "lex": "讗驻专讬诐",
        "morph": {
          "token": "讗驻专讬诐",
          "pos": "PROPN",
          "feats": {},
          "prefixes": [],
          "suffix": false
        }
      },
      {
        "token": "拽讬砖讜谉",
        "offsets": {
          "start": 22,
          "end": 27
        },
        "syntax": {
          "word": "拽讬砖讜谉",
          "dep_head_idx": 3,
          "dep_func": "flat:name",
          "dep_head": "讗驻专讬诐"
        },
        "seg": [
          "拽讬砖讜谉"
        ],
        "lex": "拽讬砖讜谉",
        "morph": {
          "token": "拽讬砖讜谉",
          "pos": "PROPN",
          "feats": {},
          "prefixes": [],
          "suffix": false
        }
      },
      {
        "token": "讗转",
        "offsets": {
          "start": 28,
          "end": 30
        },
        "syntax": {
          "word": "讗转",
          "dep_head_idx": 6,
          "dep_func": "case:acc",
          "dep_head": "诇讬诪讜讚讬讜"
        },
        "seg": [
          "讗转"
        ],
        "lex": "讗转",
        "morph": {
          "token": "讗转",
          "pos": "ADP",
          "feats": {},
          "prefixes": [],
          "suffix": false
        }
      },
      {
        "token": "诇讬诪讜讚讬讜",
        "offsets": {
          "start": 31,
          "end": 38
        },
        "syntax": {
          "word": "诇讬诪讜讚讬讜",
          "dep_head_idx": 2,
          "dep_func": "obj",
          "dep_head": "讛砖诇讬诐"
        },
        "seg": [
          "诇讬诪讜讚讬讜"
        ],
        "lex": "诇讬诪讜讚",
        "morph": {
          "token": "诇讬诪讜讚讬讜",
          "pos": "NOUN",
          "feats": {
            "Gender": "Masc",
            "Number": "Plur"
          },
          "prefixes": [],
          "suffix": "ADP_PRON",
          "suffix_feats": {
            "Gender": "Masc",
            "Number": "Sing",
            "Person": "3"
          }
        }
      },
      {
        "token": "讘驻讬住讜诇",
        "offsets": {
          "start": 39,
          "end": 45
        },
        "syntax": {
          "word": "讘驻讬住讜诇",
          "dep_head_idx": 6,
          "dep_func": "nmod",
          "dep_head": "诇讬诪讜讚讬讜"
        },
        "seg": [
          "讘",
          "驻讬住讜诇"
        ],
        "lex": "驻讬住讜诇",
        "morph": {
          "token": "讘驻讬住讜诇",
          "pos": "NOUN",
          "feats": {
            "Gender": "Masc",
            "Number": "Sing"
          },
          "prefixes": [
            "ADP"
          ],
          "suffix": false
        }
      },
      {
        "token": "诪转讻转",
        "offsets": {
          "start": 46,
          "end": 50
        },
        "syntax": {
          "word": "诪转讻转",
          "dep_head_idx": 7,
          "dep_func": "compound:smixut",
          "dep_head": "讘驻讬住讜诇"
        },
        "seg": [
          "诪转讻转"
        ],
        "lex": "诪转讻转",
        "morph": {
          "token": "诪转讻转",
          "pos": "NOUN",
          "feats": {
            "Gender": "Fem",
            "Number": "Sing"
          },
          "prefixes": [],
          "suffix": false
        }
      },
      {
        "token": "讜讘转讜诇讚讜转",
        "offsets": {
          "start": 51,
          "end": 59
        },
        "syntax": {
          "word": "讜讘转讜诇讚讜转",
          "dep_head_idx": 7,
          "dep_func": "conj",
          "dep_head": "讘驻讬住讜诇"
        },
        "seg": [
          "讜讘",
          "转讜诇讚讜转"
        ],
        "lex": "转讜诇讚讛",
        "morph": {
          "token": "讜讘转讜诇讚讜转",
          "pos": "NOUN",
          "feats": {
            "Gender": "Fem",
            "Number": "Plur"
          },
          "prefixes": [
            "CCONJ",
            "ADP"
          ],
          "suffix": false
        }
      },
      {
        "token": "讛讗诪谞讜转",
        "offsets": {
          "start": 60,
          "end": 66
        },
        "syntax": {
          "word": "讛讗诪谞讜转",
          "dep_head_idx": 9,
          "dep_func": "compound:smixut",
          "dep_head": "讜讘转讜诇讚讜转"
        },
        "seg": [
          "讛",
          "讗诪谞讜转"
        ],
        "lex": "讗讜诪谞讜转",
        "morph": {
          "token": "讛讗诪谞讜转",
          "pos": "NOUN",
          "feats": {
            "Gender": "Fem",
            "Number": "Sing"
          },
          "prefixes": [
            "DET"
          ],
          "suffix": false
        }
      },
      {
        "token": "讜讛讞诇",
        "offsets": {
          "start": 67,
          "end": 71
        },
        "syntax": {
          "word": "讜讛讞诇",
          "dep_head_idx": 2,
          "dep_func": "conj",
          "dep_head": "讛砖诇讬诐"
        },
        "seg": [
          "讜",
          "讛讞诇"
        ],
        "lex": "讛讞诇",
        "morph": {
          "token": "讜讛讞诇",
          "pos": "VERB",
          "feats": {
            "Gender": "Masc",
            "Number": "Sing",
            "Person": "3",
            "Tense": "Past"
          },
          "prefixes": [
            "CCONJ"
          ],
          "suffix": false
        }
      },
      {
        "token": "诇驻专住诐",
        "offsets": {
          "start": 72,
          "end": 77
        },
        "syntax": {
          "word": "诇驻专住诐",
          "dep_head_idx": 11,
          "dep_func": "xcomp",
          "dep_head": "讜讛讞诇"
        },
        "seg": [
          "诇驻专住诐"
        ],
        "lex": "驻专住诐",
        "morph": {
          "token": "诇驻专住诐",
          "pos": "VERB",
          "feats": {},
          "prefixes": [],
          "suffix": false
        }
      },
      {
        "token": "诪讗诪专讬诐",
        "offsets": {
          "start": 78,
          "end": 84
        },
        "syntax": {
          "word": "诪讗诪专讬诐",
          "dep_head_idx": 12,
          "dep_func": "obj",
          "dep_head": "诇驻专住诐"
        },
        "seg": [
          "诪讗诪专讬诐"
        ],
        "lex": "诪讗诪专",
        "morph": {
          "token": "诪讗诪专讬诐",
          "pos": "NOUN",
          "feats": {
            "Gender": "Masc",
            "Number": "Plur"
          },
          "prefixes": [],
          "suffix": false
        }
      },
      {
        "token": "讛讜诪讜专讬住讟讬讬诐",
        "offsets": {
          "start": 85,
          "end": 96
        },
        "syntax": {
          "word": "讛讜诪讜专讬住讟讬讬诐",
          "dep_head_idx": 13,
          "dep_func": "amod",
          "dep_head": "诪讗诪专讬诐"
        },
        "seg": [
          "讛讜诪讜专讬住讟讬讬诐"
        ],
        "lex": "讛讜诪讜专讬住讟讬",
        "morph": {
          "token": "讛讜诪讜专讬住讟讬讬诐",
          "pos": "ADJ",
          "feats": {
            "Gender": "Masc",
            "Number": "Plur"
          },
          "prefixes": [],
          "suffix": false
        }
      }
    ],
    "root_idx": 2,
    "ner_entities": [
      {
        "phrase": "1948",
        "label": "TIMEX",
        "start": 5,
        "end": 9,
        "token_start": 1,
        "token_end": 1
      },
      {
        "phrase": "讗驻专讬诐 拽讬砖讜谉",
        "label": "PER",
        "start": 16,
        "end": 27,
        "token_start": 3,
        "token_end": 4
      }
    ]
  }
]

You can also choose to get your response in UD format:

sentence = '讘砖谞转 1948 讛砖诇讬诐 讗驻专讬诐 拽讬砖讜谉 讗转 诇讬诪讜讚讬讜 讘驻讬住讜诇 诪转讻转 讜讘转讜诇讚讜转 讛讗诪谞讜转 讜讛讞诇 诇驻专住诐 诪讗诪专讬诐 讛讜诪讜专讬住讟讬讬诐'
print(model.predict([sentence], tokenizer, output_style='ud')) 

Results:

[
  [
    "# sent_id = 1",
    "# text = 讘砖谞转 1948 讛砖诇讬诐 讗驻专讬诐 拽讬砖讜谉 讗转 诇讬诪讜讚讬讜 讘驻讬住讜诇 诪转讻转 讜讘转讜诇讚讜转 讛讗诪谞讜转 讜讛讞诇 诇驻专住诐 诪讗诪专讬诐 讛讜诪讜专讬住讟讬讬诐",
    "1-2\t讘砖谞转\t_\t_\t_\t_\t_\t_\t_\t_",
    "1\t讘\t讘\tADP\tADP\t_\t2\tcase\t_\t_",
    "2\t砖谞转\t砖谞讛\tNOUN\tNOUN\tGender=Fem|Number=Sing\t4\tobl\t_\t_",
    "3\t1948\t1948\tNUM\tNUM\t\t2\tcompound:smixut\t_\t_",
    "4\t讛砖诇讬诐\t讛砖诇讬诐\tVERB\tVERB\tGender=Masc|Number=Sing|Person=3|Tense=Past\t0\troot\t_\t_",
    "5\t讗驻专讬诐\t讗驻专讬诐\tPROPN\tPROPN\t\t4\tnsubj\t_\t_",
    "6\t拽讬砖讜谉\t拽讬砖讜谉\tPROPN\tPROPN\t\t5\tflat:name\t_\t_",
    "7\t讗转\t讗转\tADP\tADP\t\t8\tcase:acc\t_\t_",
    "8-10\t诇讬诪讜讚讬讜\t_\t_\t_\t_\t_\t_\t_\t_",
    "8\t诇讬诪讜讚_\t诇讬诪讜讚\tNOUN\tNOUN\tGender=Masc|Number=Plur\t4\tobj\t_\t_",
    "9\t_砖诇_\t砖诇\tADP\tADP\t_\t10\tcase\t_\t_",
    "10\t_讛讜讗\t讛讜讗\tPRON\tPRON\tGender=Masc|Number=Sing|Person=3\t8\tnmod:poss\t_\t_",
    "11-12\t讘驻讬住讜诇\t_\t_\t_\t_\t_\t_\t_\t_",
    "11\t讘\t讘\tADP\tADP\t_\t12\tcase\t_\t_",
    "12\t驻讬住讜诇\t驻讬住讜诇\tNOUN\tNOUN\tGender=Masc|Number=Sing\t8\tnmod\t_\t_",
    "13\t诪转讻转\t诪转讻转\tNOUN\tNOUN\tGender=Fem|Number=Sing\t12\tcompound:smixut\t_\t_",
    "14-16\t讜讘转讜诇讚讜转\t_\t_\t_\t_\t_\t_\t_\t_",
    "14\t讜\t讜\tCCONJ\tCCONJ\t_\t16\tcc\t_\t_",
    "15\t讘\t讘\tADP\tADP\t_\t16\tcase\t_\t_",
    "16\t转讜诇讚讜转\t转讜诇讚讛\tNOUN\tNOUN\tGender=Fem|Number=Plur\t12\tconj\t_\t_",
    "17-18\t讛讗诪谞讜转\t_\t_\t_\t_\t_\t_\t_\t_",
    "17\t讛\t讛\tDET\tDET\t_\t18\tdet\t_\t_",
    "18\t讗诪谞讜转\t讗讜诪谞讜转\tNOUN\tNOUN\tGender=Fem|Number=Sing\t16\tcompound:smixut\t_\t_",
    "19-20\t讜讛讞诇\t_\t_\t_\t_\t_\t_\t_\t_",
    "19\t讜\t讜\tCCONJ\tCCONJ\t_\t20\tcc\t_\t_",
    "20\t讛讞诇\t讛讞诇\tVERB\tVERB\tGender=Masc|Number=Sing|Person=3|Tense=Past\t4\tconj\t_\t_",
    "21\t诇驻专住诐\t驻专住诐\tVERB\tVERB\t\t20\txcomp\t_\t_",
    "22\t诪讗诪专讬诐\t诪讗诪专\tNOUN\tNOUN\tGender=Masc|Number=Plur\t21\tobj\t_\t_",
    "23\t讛讜诪讜专讬住讟讬讬诐\t讛讜诪讜专讬住讟讬\tADJ\tADJ\tGender=Masc|Number=Plur\t22\tamod\t_\t_"
  ]
]

Citation

If you use DictaBERT-large-parse in your research, please cite MRL Parsing without Tears: The Case of Hebrew

BibTeX:

@misc{shmidman2024mrl,
      title={MRL Parsing Without Tears: The Case of Hebrew}, 
      author={Shaltiel Shmidman and Avi Shmidman and Moshe Koppel and Reut Tsarfaty},
      year={2024},
      eprint={2403.06970},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0