--- language: - he - en pipeline_tag: text-classification tags: - transformer - tokenizer --- --- language: - he - en pipeline_tag: text-classification tags: - transformer - tokenizer --- # Model Overview **Model Name:** T5 Hebrew-to-English Translation Tokenizer **Model Type:** Tokenizer for Transformer-based models **Base Model:** T5 (Text-to-Text Transfer Transformer) **Preprocessing:** Custom Tokenizer using SentencePieceBPETokenizer **Training Data:** Custom Hebrew-English dataset curated for translation tasks **Intended Use:** This tokenizer is intended for machine translation tasks, specifically Hebrew-to-English translations. ## Model Description This tokenizer was trained on a Hebrew-to-English dataset using `SentencePieceBPETokenizer`. It is optimized for handling Hebrew text tokenization and can be paired with a Transformer model, such as T5, for sequence-to-sequence translation tasks. It handles preprocessing tasks like tokenization, padding, and truncation effectively. ## Performance - **Task:** Hebrew-to-English Translation (Tokenizer only) - **Dataset:** A custom dataset containing parallel Hebrew-English sentences - **Metrics:** - Vocabulary size: 30,000 tokens - Tokenization accuracy: Not applicable (Tokenizer-specific metric) ## Usage ### How to Use the Tokenizer To use this tokenizer, you can load it using the Hugging Face Transformers library: ```python from transformers import AutoTokenizer # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained("tejagowda/t5-hebrew-translation", use_fast=False) # Example: Tokenizing a Hebrew sentence hebrew_text = "\u05D0\u05EA\u05D4\u05D3 \u05E2\u05DC \u05D4\u05D7\u05D5\u05DE\u05E8\u05D4." inputs = tokenizer(hebrew_text, return_tensors="pt") print("Tokens:", inputs["input_ids"]) ``` ### Example Usage with a Pretrained Model To perform translation, you can pair this tokenizer with a pretrained T5 model: ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM # Load the tokenizer and model tokenizer = AutoTokenizer.from_pretrained("tejagowda/t5-hebrew-translation", use_fast=False) model = AutoModelForSeq2SeqLM.from_pretrained("t5-small") # Replace with fine-tuned model if available # Hebrew text to translate hebrew_text = "\u05EA\u05D0\u05E8 \u05D0\u05EA \u05DE\u05D1\u05E0\u05D4 \u05E9\u05DC \u05D0\u05D8\u05D5\u05DD." # Tokenize and translate inputs = tokenizer(hebrew_text, return_tensors="pt") outputs = model.generate(inputs["input_ids"], max_length=100) # Decode the output english_translation = tokenizer.decode(outputs[0], skip_special_tokens=True) print("Translation:", english_translation) ``` ## Limitations - The tokenizer itself does not perform translation; it must be paired with a translation model. - Performance depends on the quality of the paired model and training data. ## License This tokenizer is licensed under the Apache 2.0 License. See the LICENSE file for more details.