sasha-smirnov commited on
Commit
00ba765
·
verified ·
1 Parent(s): 84eb4f8

Initial publish via teradata-opus-translate

Browse files
README.md ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - ca
5
+ license: apache-2.0
6
+ library_name: transformers
7
+ pipeline_tag: translation
8
+ tags:
9
+ - translation
10
+ - LiteRT
11
+ - safetensors
12
+ - Helsinki-NLP/tatoeba
13
+ - openlanguagedata/flores_plus
14
+ - marian
15
+ - onnx
16
+ - teradata
17
+ base_model: Helsinki-NLP/opus-mt_tiny_eng-cat
18
+ ---
19
+
20
+ > ⚠️ **See [Disclaimer](#disclaimer) below before using.**
21
+
22
+ # opus-mt_tiny_eng-cat 🇬🇧 → 🇪🇸
23
+
24
+ ## A Teradata-compatible Translation Model
25
+
26
+ A sequence-to-sequence translation model that translates text from
27
+ **English 🇬🇧** to
28
+ **Catalan 🇪🇸**.
29
+ This repository hosts an ONNX-converted version of
30
+ [Helsinki-NLP/opus-mt_tiny_eng-cat](https://huggingface.co/Helsinki-NLP/opus-mt_tiny_eng-cat),
31
+ packaged for use with the Teradata `mldb.ONNXSeq2Seq` BYOM function.
32
+
33
+ **This repository does not redistribute the original model weights.** It contains only:
34
+
35
+ - `onnx/model-fp32.onnx` — full-precision ONNX graph
36
+ - `onnx/model-int8.onnx` — dynamically quantized ONNX graph
37
+ - `tokenizer.json` — repacked Marian tokenizer suitable for BYOM
38
+ - `config.json` — model architecture metadata, copied unchanged from the upstream repo
39
+ - `generation_config.json` — generation defaults, copied unchanged from the upstream repo
40
+
41
+ For the original PyTorch weights and training details, see the upstream model:
42
+ **[Helsinki-NLP/opus-mt_tiny_eng-cat](https://huggingface.co/Helsinki-NLP/opus-mt_tiny_eng-cat)**.
43
+
44
+ **Specifications**
45
+
46
+ | | |
47
+ |---|---|
48
+ | Source language | English 🇬🇧 (`en`) |
49
+ | Target language | Catalan 🇪🇸 (`ca`) |
50
+ | Architecture | MarianMT (encoder-decoder) |
51
+ | Max input tokens | 256 |
52
+ | Max output tokens | 512 |
53
+ | ONNX file sizes | fp32 (177 MB), int8 (94 MB) |
54
+ | ONNX opset | 14 |
55
+ | ONNX IR version | 8 (BYOM 7.0+ compatible) |
56
+ | License | Apache-2.0 (from upstream) |
57
+ | Reference | https://huggingface.co/Helsinki-NLP/opus-mt_tiny_eng-cat |
58
+
59
+ Generation parameters are configurable at SQL time via the
60
+ `mldb.ONNXSeq2Seq` USING clause through `Const_*` keys: `Const_min_length`,
61
+ `Const_max_length`, `Const_num_beams`, `Const_length_penalty`,
62
+ `Const_repetition_penalty`. They are not fixed in the ONNX graph.
63
+ (`num_return_sequences` is the exception — it's baked into the graph as 1.)
64
+
65
+ ## Quickstart: Deploying this Model in Teradata
66
+
67
+ Requires Teradata 17.20+ with **BYOM 7.0.0.4** or newer (the conversion
68
+ targets ONNX IR version 8, which BYOM 7.0.x requires).
69
+
70
+ ```python
71
+ import getpass
72
+ import teradataml as tdml
73
+ from huggingface_hub import hf_hub_download
74
+
75
+ repo_id = "Teradata/opus-mt_tiny_eng-cat"
76
+ model_id = "opus-mt_tiny_eng-cat" # used as BYOM model_id
77
+
78
+ # 1. Download artifacts from this repo
79
+ hf_hub_download(repo_id=repo_id, filename="onnx/model-int8.onnx", local_dir="./")
80
+ hf_hub_download(repo_id=repo_id, filename="tokenizer.json", local_dir="./")
81
+
82
+ # 2. Connect to Teradata
83
+ tdml.create_context(
84
+ host=input("host: "),
85
+ username=input("user: "),
86
+ password=getpass.getpass("password: "),
87
+ )
88
+
89
+ # 3. Load model + tokenizer into BYOM tables
90
+ tdml.save_byom(model_id=model_id, model_file="onnx/model-int8.onnx",
91
+ table_name="translation_models")
92
+ tdml.save_byom(model_id=model_id, model_file="tokenizer.json",
93
+ table_name="translation_tokenizers")
94
+
95
+ # 4. Translate
96
+ query = f"""
97
+ SELECT id, sequences
98
+ FROM mldb.ONNXSeq2Seq(
99
+ ON (SELECT id, txt FROM your_input_table) AS InputTable
100
+ ON (SELECT model_id, model FROM translation_models
101
+ WHERE model_id = '{{model_id}}') AS ModelTable DIMENSION
102
+ ON (SELECT model AS tokenizer FROM translation_tokenizers
103
+ WHERE model_id = '{{model_id}}') AS TokenizerTable DIMENSION
104
+ USING
105
+ Accumulate('id')
106
+ ModelOutputTensor('sequences')
107
+ SkipSpecialTokens('true')
108
+ OutputLength(512)
109
+ OverwriteCachedModel('{{model_id}}')
110
+ Const_min_length(1)
111
+ Const_max_length(64)
112
+ Const_num_beams(4)
113
+ Const_length_penalty(1.0)
114
+ Const_repetition_penalty(1.0)
115
+ ) AS t
116
+ """
117
+ print(tdml.DataFrame.from_query(query))
118
+ ```
119
+
120
+ Use `model-int8.onnx` unless you have a measured accuracy reason to ship `fp32`.
121
+
122
+ ## How this model was converted
123
+
124
+ This model was produced with the open-source
125
+ [`teradata-opus-translate`](https://pypi.org/project/teradata-opus-translate/)
126
+ package, which exports the encoder/decoder, stitches in the BeamSearch op,
127
+ applies dynamic int8 quantization, and verifies parity against PyTorch on a
128
+ small sample set.
129
+
130
+ > **Note:** the same package can convert *any* Helsinki-NLP MarianMT model
131
+ > (including ones not in this collection) to a BYOM-ready ONNX bundle. If
132
+ > you have a translation pair that's not published here, install the package
133
+ > and run:
134
+ >
135
+ > ```python
136
+ > from teradata_opus_translate import convert_model, convert_tokenizer
137
+ >
138
+ > convert_model(
139
+ > "Helsinki-NLP/<your-model>",
140
+ > output_path="model-int8.onnx",
141
+ > precision="int8",
142
+ > )
143
+ > convert_tokenizer(
144
+ > "Helsinki-NLP/<your-model>",
145
+ > output_path="tokenizer.json",
146
+ > )
147
+ > ```
148
+ >
149
+ > The resulting `model-int8.onnx` and `tokenizer.json` are ready to deploy
150
+ > with the Quickstart flow above.
151
+
152
+ ## Disclaimer
153
+
154
+ DISCLAIMER: The content herein ("Content") is provided "AS IS" and is not covered by any Teradata Operations, Inc. and its affiliates ("Teradata") agreements. Its listing here does not constitute certification or endorsement by Teradata.
155
+
156
+ To the extent any of the Content contains or is related to any artificial intelligence ("AI") or other language learning models ("Models") that interoperate with the products and services of Teradata, by accessing, bringing, deploying or using such Models, you acknowledge and agree that you are solely responsible for ensuring compliance with all applicable laws, regulations, and restrictions governing the use, deployment, and distribution of AI technologies. This includes, but is not limited to, AI Diffusion Rules, European Union AI Act, AI-related laws and regulations, privacy laws, export controls, and financial or sector-specific regulations.
157
+
158
+ While Teradata may provide support, guidance, or assistance in the deployment or implementation of Models to interoperate with Teradata's products and/or services, you remain fully responsible for ensuring that your Models, data, and applications comply with all relevant legal and regulatory obligations. Our assistance does not constitute legal or regulatory approval, and Teradata disclaims any liability arising from non-compliance with applicable laws.
159
+
160
+ You must determine the suitability of the Models for any purpose. Given the probabilistic nature of machine learning and modeling, the use of the Models may in some situations result in incorrect output that does not accurately reflect the action generated. You should evaluate the accuracy of any output as appropriate for your use case, including by using human review of the output.
config.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_dropout": 0.0,
3
+ "activation_function": "relu",
4
+ "architectures": [
5
+ "MarianMTModel"
6
+ ],
7
+ "attention_dropout": 0.0,
8
+ "bos_token_id": 0,
9
+ "d_model": 256,
10
+ "decoder_attention_heads": 8,
11
+ "decoder_ffn_dim": 1536,
12
+ "decoder_layerdrop": 0.0,
13
+ "decoder_layers": 2,
14
+ "decoder_start_token_id": 32000,
15
+ "decoder_vocab_size": 32001,
16
+ "dropout": 0.1,
17
+ "dtype": "float16",
18
+ "encoder_attention_heads": 8,
19
+ "encoder_ffn_dim": 1536,
20
+ "encoder_layerdrop": 0.0,
21
+ "encoder_layers": 6,
22
+ "eos_token_id": 0,
23
+ "forced_eos_token_id": 0,
24
+ "init_std": 0.02,
25
+ "is_encoder_decoder": true,
26
+ "max_length": null,
27
+ "max_position_embeddings": 256,
28
+ "model_type": "marian",
29
+ "normalize_embedding": false,
30
+ "num_hidden_layers": 6,
31
+ "pad_token_id": 32000,
32
+ "scale_embedding": true,
33
+ "share_encoder_decoder_embeddings": true,
34
+ "static_position_embeddings": true,
35
+ "transformers_version": "4.57.6",
36
+ "use_cache": true,
37
+ "vocab_size": 32001
38
+ }
generation_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bad_words_ids": [
4
+ [
5
+ 32000
6
+ ]
7
+ ],
8
+ "bos_token_id": 0,
9
+ "decoder_start_token_id": 32000,
10
+ "eos_token_id": 0,
11
+ "forced_eos_token_id": 0,
12
+ "max_length": 512,
13
+ "pad_token_id": 32000,
14
+ "transformers_version": "4.57.6"
15
+ }
onnx/model-fp32.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9aa55806a5ae9f0d002bced9a8b8684b454d1624b43768a5a54d25a0e3d1605c
3
+ size 177301730
onnx/model-int8.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f2db650606d950865b8f3e6d668518d88f02fffd61313f1e37a4e438d69971a3
3
+ size 94483810
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff