Updates incorrect tokenizer configuration file

#7
by lysandre HF staff - opened

This repository contains in an incorrect tokenizer configuration file, and is instead relying on some attributes set within
the transformers library directly in order to correctly tokenize inputs.

In order to ensure repositories don't depend on internal configuration changes, we're removing these attribute maps
in transformers#29112.

In doing so, we see that the following attributes are currently missing from the configuration and would be
ill-configured without this PR:

{'src_lang': None, 'tgt_lang': None, 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'cls_token': '<s>', 'pad_token': '<pad>', 'mask_token': {'content': '<mask>', 'single_word': False, 'lstrip': True, 'rstrip': False, 'normalized': True, '__type': 'AddedToken'}, 'bos_token': '<s>', 'tokenizer_file': None, 'language_codes': 'ML50', 'special_tokens_map_file': '/home/suraj/projects/mbart-50/mbart-50/special_tokens_map.json', 'name_or_path': '/home/suraj/projects/mbart-50/hf_models/mbart-50-large-one-to-many/', 'model_max_length': 1024}

This PR aims to add these attributes and their values to the tokenizer config file.
This will proceed to make this repository more robust by ensuring that:

  • the repository does not depend on intra-library code
  • clones of this repository continue working as expected even without the correct repository name
  • other libraries that would like to leverage this repository do not depend on code within the transformers library

Thanks πŸ€—

lysandre changed pull request status to open
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment