metadata

license: apache-2.0
library_name: transformers

Model Description

The transformer-eng-por model is used to ... More information is needed.

The model was trained with a ...More information is needed.

Details

Size: 23,805,216 parameters
Dataset:
Languages: English
Number of Training Steps: 30
Batch size: 32
Optimizer: rmsprop
Learning Rate: 0.001
GPU: T4
This repository has the source code used to train this model.

Usage

import tensorflow as tf
import numpy as np
import string
import keras
import re

strip_chars = string.punctuation
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")


def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(lowercase, f"[{re.escape(strip_chars)}]", "")

portuguese_vocabulary_path = hf_hub_download(
    repo_id="AiresPucrs/transformer-eng-por",
    filename="keras_transformer_blocks.py",
    repo_type='model',
    local_dir="./")

from keras_transformer_blocks import TransformerEncoder, PositionalEmbedding, TransformerDecoder

transformer = keras.models.load_model("/content/transformer-eng-por/transformer-eng-por.h5",
    custom_objects={"TransformerEncoder": TransformerEncoder,
        "PositionalEmbedding": PositionalEmbedding,
        "TransformerDecoder": TransformerDecoder})

with open('portuguese_vocabulary.txt', encoding='utf-8', errors='backslashreplace') as fp:
    portuguese_vocab = [line.strip() for line in fp]
    fp.close()

with open('english_vocabulary.txt', encoding='utf-8', errors='backslashreplace') as fp:
    english_vocab = [line.strip() for line in fp]
    fp.close()


target_vectorization = tf.keras.layers.TextVectorization(max_tokens=20000,
                                        output_mode="int",
                                        output_sequence_length=21,
                                        standardize=custom_standardization,
                                        vocabulary=portuguese_vocab)

source_vectorization = tf.keras.layers.TextVectorization(max_tokens=20000,
                                        output_mode="int",
                                        output_sequence_length=20,
                                        vocabulary=english_vocab)

portuguese_index_lookup = dict(zip(range(len(portuguese_vocab)), portuguese_vocab))
max_decoded_sentence_length = 20


def decode_sequence(input_sentence):
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence = "[start]"

    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization([decoded_sentence])[:, :-1]
        predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = portuguese_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token
        if sampled_token == "[end]":
            break
    return decoded_sentence


eng_sentences =["What is its name?",
                "How old are you?",
                "I know you know where Mary is.",
                "We will show Tom.",
                "What do you all do?",
                "Don't do it!"]

for sentence in eng_sentences:
    print(f"English sentence:\n{sentence}")
    print(f'Portuguese translation:\n{decode_sequence(sentence)}')
    print('-' * 50)

This will output the following:

English sentence:
What is its name?
Portuguese translation:
[start] qual é o nome dele [end]
--------------------------------------------------
English sentence:
How old are you?
Portuguese translation:
[start] quantos anos você tem [end]
--------------------------------------------------
English sentence:
I know you know where Mary is.
Portuguese translation:
[start] eu sei que você sabe onde mary está [end]
--------------------------------------------------
English sentence:
We will show Tom.
Portuguese translation:
[start] vamos ligar para o tom [end]
--------------------------------------------------
English sentence:
What do you all do?
Portuguese translation:
[start] o que vocês todos nós têm feito [end]
--------------------------------------------------
English sentence:
Don't do it!
Portuguese translation:
[start] não faça isso [end]
--------------------------------------------------

Cite as 🤗

@misc{teenytinycastle,
    doi = {10.5281/zenodo.7112065},
    url = {https://huggingface.co/AiresPucrs/transformer-eng-por},
    author = {Nicholas Kluge Corr{\^e}a},
    title = {Teeny-Tiny Castle},
    year = {2023},
    publisher = {HuggingFace},
    journal = {HuggingFace repository},
}

License

The transformer-eng-por is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.