sentencepiece

#1
by valamiasd - opened

hi, It's possible to use this model without the transformers library, and tokenize the text with sentencepiece or something?

Owner

Hi valamiasd,
I'm sorry, but I only know about the tokenizer method provided by CTranslate2.
If you have tested different tokenization methods and obtained alternative results, please feel free to share. Thank you.

Hi, I've used this model with ctranslate2 and sentencepiece (C++) without any problem (sorry, I don't speak python) :

Convertion:

ct2-transformers-converter --model . --quantization int8_float16 --output_dir ./ct2-int8_float16 --copy_files added_tokens.json generation_config.json model.safetensors.index.json special_tokens_map.json spiece.model tokenizer.json tokenizer_config.json

Loading model:

auto const model { models::Model::load ( ".../madlad400-3b/ct2-int8_float16", Device::CPU )};

Sentence piece stuffs:

SentencePieceProcessor sp_processor;
auto status { sp_processor.Load ( ".../madlad400-3b/ct2-int8_float16/spiece.model" )};
if ( !status.ok ())
{ throw invalid_argument ( "Unable to open SentencePiece model!" ); }

auto const * pSequenceToSequenceModel { dynamic_cast<models::SequenceToSequenceModel const *>( model.get ())};
status = sp_processor.SetVocabulary ( get_vocabulary_tokens ( pSequenceToSequenceModel->get_source_vocabulary ()));
if ( !status.ok ())
{ throw runtime_error ( "Failed to set the SentencePiece vocabulary!" ); }

Result (fr->en):

There was a confusion between "building" and "ship", the same word in French, I need to work more on the sentence separation algo for long sentences...


Source text:
I

Translate text:
I


Source text:
Marseille. — L’arrivée.

Translate text:
Marseille. — The arrival


Source text:
Le 24 février 1815, la vigie de Notre-Dame de la Garde signala le trois-mâts le Pharaon, venant de Smyrne, Trieste et Naples.

Translate text:
On February 24, the watchman of Notre-Dame de la Garde signaled three mast Pharaon coming from Smyrna (Turkey), Trieste and Naples.


Source text:
Comme d’habitude, un pilote côtier partit aussitôt du port, rasa le château d’If, et alla aborder le navire entre le cap de Morgion et l’île de Rion.

Translate text:
As usual, a coast pilot left the port as soon from shore and raised out of and went to the ship between Cape Morgion, which is in a shore of Rion.


Source text:
Aussitôt, comme d’habitude encore, la plate-forme du fort Saint-Jean s’était couverte de curieux ; car c’est toujours une grande affaire à Marseille que l’arrivée d’un bâtiment, surtout quand ce bâtiment, comme le Pharaon, a été construit, gréé, arrimé sur les chantiers de la vieille Phocée, et appartient à un armateur de la ville.

Translate text:
Immediately, as usual again the platform of Fort Saint-Jean was covered with curious; because it is always a big affair in Marseille that the arrival of building, especially when this like the Pharaoh, was built and rigged in old Phocea shipyards; it belonged to a local shipping company.

Sign up or log in to comment