mistralai/Mistral-7B-v0.1 · Unrecognized configuration class <class 'transformers.models.mistral.configuration

zeio

Oct 8, 2023

•

edited Oct 8, 2023

Cannot use this model in a pipeline, the following error is generated:

ValueError: Unrecognized configuration class <class 'transformers.models.mistral.configuration_mistral.MistralConfig'> for this kind of AutoModel: TFAutoModelForCausalLM.

My code:

from transformers import pipeline

print(pipeline('text-generation', 'mistralai/Mistral-7B-v0.1')(text))

Version of the transformers library: 4.35.0.dev0 (installed from github)

hrusheekesh

Oct 15, 2023

I am getting the same error

lysandre

Oct 16, 2023

Hey! It seems like you only have tensorflow installed and not pytorch as it's trying to default to the TensorFlow model.
Unfortunately, Mistral isn't implemented in TensorFlow. Could you try installing PyTorch?

fabrizioricciarelli

Jun 9, 2024

In my case, trying to use some other LLM fuctions (like, NER, Summarizations, Translation) I'm getting the following:

I0609 02:28:45.316430 1 cache_manager.cc:480] Create CacheManager with cache_dir: '/opt/tritonserver/caches'
I0609 02:28:45.506355 1 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7ffb60000000' with size 268435456
I0609 02:28:45.506454 1 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0609 02:28:45.507546 1 model_config_utils.cc:680] Server side auto-completed config: name: "Mistral-7B-v0.3"
input {
name: "input_text"
data_type: TYPE_STRING
dims: 1
dims: 1
}
input {
name: "function"
data_type: TYPE_STRING
dims: 1
}
output {
name: "output_text"
data_type: TYPE_STRING
dims: 1
}
output {
name: "output_embedding"
data_type: TYPE_FP32
dims: -1
}
output {
name: "output_ids"
data_type: TYPE_INT64
dims: -1
}
instance_group {
gpus: 0
kind: KIND_GPU
}
default_model_filename: "model.py"
backend: "python"

I0609 02:28:45.507592 1 model_lifecycle.cc:469] loading: Mistral-7B-v0.3:1
I0609 02:28:45.507721 1 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4
I0609 02:28:45.507734 1 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so
I0609 02:28:45.508520 1 python_be.cc:2099] 'python' TRITONBACKEND API version: 1.19
I0609 02:28:45.508527 1 python_be.cc:2121] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0609 02:28:45.508539 1 python_be.cc:2259] Shared memory configuration is shm-default-byte-size=1048576,shm-growth-byte-size=1048576,stub-timeout-seconds=30
I0609 02:28:45.508904 1 python_be.cc:2582] TRITONBACKEND_GetBackendAttribute: setting attributes
I0609 02:28:45.509007 1 python_be.cc:2360] TRITONBACKEND_ModelInitialize: Mistral-7B-v0.3 (version 1)
I0609 02:28:45.509236 1 model_config_utils.cc:1902] ModelConfig 64-bit fields:
I0609 02:28:45.509239 1 model_config_utils.cc:1904] ModelConfig::dynamic_batching::default_priority_level
I0609 02:28:45.509241 1 model_config_utils.cc:1904] ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
I0609 02:28:45.509243 1 model_config_utils.cc:1904] ModelConfig::dynamic_batching::max_queue_delay_microseconds
I0609 02:28:45.509245 1 model_config_utils.cc:1904] ModelConfig::dynamic_batching::priority_levels
I0609 02:28:45.509247 1 model_config_utils.cc:1904] ModelConfig::dynamic_batching::priority_queue_policy::key
I0609 02:28:45.509249 1 model_config_utils.cc:1904] ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
I0609 02:28:45.509251 1 model_config_utils.cc:1904] ModelConfig::ensemble_scheduling::step::model_version
I0609 02:28:45.509253 1 model_config_utils.cc:1904] ModelConfig::input::dims
I0609 02:28:45.509254 1 model_config_utils.cc:1904] ModelConfig::input::reshape::shape
I0609 02:28:45.509256 1 model_config_utils.cc:1904] ModelConfig::instance_group::secondary_devices::device_id
I0609 02:28:45.509258 1 model_config_utils.cc:1904] ModelConfig::model_warmup::inputs::value::dims
I0609 02:28:45.509260 1 model_config_utils.cc:1904] ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
I0609 02:28:45.509263 1 model_config_utils.cc:1904] ModelConfig::optimization::cuda::graph_spec::input::value::dim
I0609 02:28:45.509264 1 model_config_utils.cc:1904] ModelConfig::output::dims
I0609 02:28:45.509266 1 model_config_utils.cc:1904] ModelConfig::output::reshape::shape
I0609 02:28:45.509268 1 model_config_utils.cc:1904] ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
I0609 02:28:45.509270 1 model_config_utils.cc:1904] ModelConfig::sequence_batching::max_sequence_idle_microseconds
I0609 02:28:45.509272 1 model_config_utils.cc:1904] ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
I0609 02:28:45.509274 1 model_config_utils.cc:1904] ModelConfig::sequence_batching::state::dims
I0609 02:28:45.509276 1 model_config_utils.cc:1904] ModelConfig::sequence_batching::state::initial_state::dims
I0609 02:28:45.509278 1 model_config_utils.cc:1904] ModelConfig::version_policy::specific::versions
I0609 02:28:45.509740 1 stub_launcher.cc:385] Starting Python backend stub: exec /opt/tritonserver/backends/python/triton_python_backend_stub /models/Mistral-7B-v0.3/1/model.py triton_python_backend_shm_region_cf2fe392-563d-4a2e-acd7-3a141c183094 1048576 1048576 1 /opt/tritonserver/backends/python 336 Mistral-7B-v0.3 DEFAULT
I0609 02:28:48.668810 1 python_be.cc:2055] model configuration:
{
"name": "Mistral-7B-v0.3",
"platform": "",
"backend": "python",
"runtime": "",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 0,
"input": [
{
"name": "input_text",
"data_type": "TYPE_STRING",
"format": "FORMAT_NONE",
"dims": [
1,
1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "function",
"data_type": "TYPE_STRING",
"format": "FORMAT_NONE",
"dims": [
1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
}
],
"output": [
{
"name": "output_text",
"data_type": "TYPE_STRING",
"dims": [
1
],
"label_filename": "",
"is_shape_tensor": false
},
{
"name": "output_embedding",
"data_type": "TYPE_FP32",
"dims": [
-1
],
"label_filename": "",
"is_shape_tensor": false
},
{
"name": "output_ids",
"data_type": "TYPE_INT64",
"dims": [
-1
],
"label_filename": "",
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"instance_group": [
{
"name": "Mistral-7B-v0.3_0",
"kind": "KIND_GPU",
"count": 1,
"gpus": [
0
],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "model.py",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {},
"model_warmup": []
}
I0609 02:28:48.668972 1 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: Mistral-7B-v0.3_0_0 (GPU device 0)
I0609 02:28:48.669047 1 backend_model_instance.cc:106] Creating instance Mistral-7B-v0.3_0_0 on GPU 0 (8.9) using artifact 'model.py'
I0609 02:28:48.669335 1 stub_launcher.cc:385] Starting Python backend stub: exec /opt/tritonserver/backends/python/triton_python_backend_stub /models/Mistral-7B-v0.3/1/model.py triton_python_backend_shm_region_36b77468-5d65-49c9-85a2-5846aa709a7a 1048576 1048576 1 /opt/tritonserver/backends/python 336 Mistral-7B-v0.3_0_0 DEFAULT
Loading checkpoint shards: 100%|#############################################################################################################################################################################| 3/3 [00:03<00:00, 1.00s/it]
Failed to load model for task ner: Unrecognized configuration class <class 'transformers.models.mistral.configuration_mistral.MistralConfig'> for this kind of AutoModel: AutoModelForTokenClassification.
Model type should be one of AlbertConfig, BertConfig, BigBirdConfig, BioGptConfig, BloomConfig, BrosConfig, CamembertConfig, CanineConfig, ConvBertConfig, Data2VecTextConfig, DebertaConfig, DebertaV2Config, DistilBertConfig, ElectraConfig, ErnieConfig, ErnieMConfig, EsmConfig, FalconConfig, FlaubertConfig, FNetConfig, FunnelConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, IBertConfig, LayoutLMConfig, LayoutLMv2Config, LayoutLMv3Config, LiltConfig, LongformerConfig, LukeConfig, MarkupLMConfig, MegaConfig, MegatronBertConfig, MobileBertConfig, MPNetConfig, MptConfig, MraConfig, MT5Config, NezhaConfig, NystromformerConfig, PhiConfig, Phi3Config, QDQBertConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, SqueezeBertConfig, T5Config, UMT5Config, XLMConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig, YosoConfig.
Failed to load model for task summarization: Unrecognized configuration class <class 'transformers.models.mistral.configuration_mistral.MistralConfig'> for this kind of AutoModel: AutoModelForSeq2SeqLM.
Model type should be one of BartConfig, BigBirdPegasusConfig, BlenderbotConfig, BlenderbotSmallConfig, EncoderDecoderConfig, FSMTConfig, GPTSanJapaneseConfig, LEDConfig, LongT5Config, M2M100Config, MarianConfig, MBartConfig, MT5Config, MvpConfig, NllbMoeConfig, PegasusConfig, PegasusXConfig, PLBartConfig, ProphetNetConfig, SeamlessM4TConfig, SeamlessM4Tv2Config, SwitchTransformersConfig, T5Config, UMT5Config, XLMProphetNetConfig.
Loading checkpoint shards: 100%|#############################################################################################################################################################################| 3/3 [00:03<00:00, 1.26s/it]
Some weights of MistralForSequenceClassification were not initialized from the model checkpoint at /models/Mistral-7B-v0.3/1 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to load model for task translation: Unrecognized configuration class <class 'transformers.models.mistral.configuration_mistral.MistralConfig'> for this kind of AutoModel: AutoModelForSeq2SeqLM.
Model type should be one of BartConfig, BigBirdPegasusConfig, BlenderbotConfig, BlenderbotSmallConfig, EncoderDecoderConfig, FSMTConfig, GPTSanJapaneseConfig, LEDConfig, LongT5Config, M2M100Config, MarianConfig, MBartConfig, MT5Config, MvpConfig, NllbMoeConfig, PegasusConfig, PegasusXConfig, PLBartConfig, ProphetNetConfig, SeamlessM4TConfig, SeamlessM4Tv2Config, SwitchTransformersConfig, T5Config, UMT5Config, XLMProphetNetConfig.
I0609 02:28:57.463372 1 python_be.cc:2425] TRITONBACKEND_ModelInstanceInitialize: instance initialization successful Mistral-7B-v0.3_0_0 (device 0)
I0609 02:28:57.473754 1 backend_model_instance.cc:772] Starting backend thread for Mistral-7B-v0.3_0_0 at nice 0 on device 0...
I0609 02:28:57.480430 1 model_lifecycle.cc:835] successfully loaded 'Mistral-7B-v0.3'
I0609 02:28:57.483055 1 server.cc:607]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0609 02:28:57.484239 1 server.cc:634]
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0609 02:28:57.639809 1 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA RTX 4000 SFF Ada Generation
I0609 02:28:57.640830 1 metrics.cc:770] Collecting CPU metrics
I0609 02:28:57.641538 1 tritonserver.cc:2538]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.45.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics tr |
| | ace logging |
| model_repository_path[0] | /models |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0609 02:28:57.652122 1 grpc_server.cc:2370]
+----------------------------------------------+---------+
| GRPC KeepAlive Option | Value |
+----------------------------------------------+---------+
| keepalive_time_ms | 7200000 |
| keepalive_timeout_ms | 20000 |
| keepalive_permit_without_calls | 0 |
| http2_max_pings_without_data | 2 |
| http2_min_recv_ping_interval_without_data_ms | 300000 |
| http2_max_ping_strikes | 2 |
+----------------------------------------------+---------+

I0609 02:28:57.661526 1 grpc_server.cc:102] Ready for RPC 'Check', 0
I0609 02:28:57.662075 1 grpc_server.cc:102] Ready for RPC 'ServerLive', 0
I0609 02:28:57.662107 1 grpc_server.cc:102] Ready for RPC 'ServerReady', 0
I0609 02:28:57.662114 1 grpc_server.cc:102] Ready for RPC 'ModelReady', 0
I0609 02:28:57.662127 1 grpc_server.cc:102] Ready for RPC 'ServerMetadata', 0
I0609 02:28:57.662136 1 grpc_server.cc:102] Ready for RPC 'ModelMetadata', 0
I0609 02:28:57.662144 1 grpc_server.cc:102] Ready for RPC 'ModelConfig', 0
I0609 02:28:57.662322 1 grpc_server.cc:102] Ready for RPC 'SystemSharedMemoryStatus', 0
I0609 02:28:57.662339 1 grpc_server.cc:102] Ready for RPC 'SystemSharedMemoryRegister', 0
I0609 02:28:57.662363 1 grpc_server.cc:102] Ready for RPC 'SystemSharedMemoryUnregister', 0
I0609 02:28:57.662377 1 grpc_server.cc:102] Ready for RPC 'CudaSharedMemoryStatus', 0
I0609 02:28:57.662389 1 grpc_server.cc:102] Ready for RPC 'CudaSharedMemoryRegister', 0
I0609 02:28:57.662411 1 grpc_server.cc:102] Ready for RPC 'CudaSharedMemoryUnregister', 0
I0609 02:28:57.662425 1 grpc_server.cc:102] Ready for RPC 'RepositoryIndex', 0
I0609 02:28:57.662451 1 grpc_server.cc:102] Ready for RPC 'RepositoryModelLoad', 0
I0609 02:28:57.662460 1 grpc_server.cc:102] Ready for RPC 'RepositoryModelUnload', 0
I0609 02:28:57.662470 1 grpc_server.cc:102] Ready for RPC 'ModelStatistics', 0
I0609 02:28:57.662490 1 grpc_server.cc:102] Ready for RPC 'Trace', 0
I0609 02:28:57.662506 1 grpc_server.cc:102] Ready for RPC 'Logging', 0
I0609 02:28:57.662531 1 grpc_server.cc:366] Thread started for CommonHandler
I0609 02:28:57.663217 1 infer_handler.cc:680] New request handler for ModelInferHandler, 0
I0609 02:28:57.663269 1 infer_handler.h:1322] Thread started for ModelInferHandler
I0609 02:28:57.663410 1 infer_handler.cc:680] New request handler for ModelInferHandler, 0
I0609 02:28:57.663464 1 infer_handler.h:1322] Thread started for ModelInferHandler
I0609 02:28:57.663761 1 stream_infer_handler.cc:128] New request handler for ModelStreamInferHandler, 0
I0609 02:28:57.663800 1 infer_handler.h:1322] Thread started for ModelStreamInferHandler
I0609 02:28:57.663817 1 grpc_server.cc:2463] Started GRPCInferenceService at 0.0.0.0:8001
I0609 02:28:57.665315 1 http_server.cc:4692] Started HTTPService at 0.0.0.0:8000
I0609 02:28:57.707641 1 http_server.cc:362] Started Metrics Service at 0.0.0.0:8002
I0609 02:29:10.838304 1 infer_handler.cc:702] Process for ModelInferHandler, rpc_ok=1, 0 step START
I0609 02:29:10.838333 1 infer_handler.cc:680] New request handler for ModelInferHandler, 0
I0609 02:29:10.839025 1 infer_request.cc:131] [request id: ] Setting state from INITIALIZED to INITIALIZED
I0609 02:29:10.839034 1 infer_request.cc:900] [request id: ] prepared: [0x0x7ffa98007110] request id: , model: Mistral-7B-v0.3, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7ffa980077d8] input: function, type: BYTES, original shape: [1], batch + shape: [1], shape: [1]
[0x0x7ffa98007698] input: input_text, type: BYTES, original shape: [1,1], batch + shape: [1,1], shape: [1,1]
override inputs:
inputs:
[0x0x7ffa98007698] input: input_text, type: BYTES, original shape: [1,1], batch + shape: [1,1], shape: [1,1]
[0x0x7ffa980077d8] input: function, type: BYTES, original shape: [1], batch + shape: [1], shape: [1]
original requested outputs:
output_text
requested outputs:
output_text

I0609 02:29:10.839047 1 infer_request.cc:131] [request id: ] Setting state from INITIALIZED to PENDING
I0609 02:29:10.839070 1 infer_request.cc:131] [request id: ] Setting state from PENDING to EXECUTING
I0609 02:29:10.839104 1 python_be.cc:1395] model Mistral-7B-v0.3, instance Mistral-7B-v0.3_0_0, executing 1 requests
Both max_new_tokens (=10000) and max_length(=10000) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

This is my "model.py" code (called by the model inside the model_repository, see below):

import triton_python_backend_utils as pb_utils
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoModelForTokenClassification, AutoModelForSequenceClassification, pipeline
import torch
import numpy as np

class TritonPythonModel:
def initialize(self, args):
model_dir = args['model_repository'] + '/' + args['model_version']
self.tokenizer = AutoTokenizer.from_pretrained(model_dir)

    self.model_chatbot = AutoModelForCausalLM.from_pretrained(model_dir)
    self.model_ner = self.load_model_for_task(model_dir, "ner")
    self.model_summarization = self.load_model_for_task(model_dir, "summarization")
    self.model_sentiment = self.load_model_for_task(model_dir, "sentiment")
    self.model_translation = self.load_model_for_task(model_dir, "translation")
    self.model_embedding = pipeline("feature-extraction", model=self.model_chatbot, tokenizer=self.tokenizer)

def load_model_for_task(self, model_dir, task):
    try:
        if task == "ner":
            return pipeline("ner", model=AutoModelForTokenClassification.from_pretrained(model_dir), tokenizer=self.tokenizer)
        elif task == "summarization" or task == "translation":
            return pipeline(task, model=AutoModelForSeq2SeqLM.from_pretrained(model_dir), tokenizer=self.tokenizer)
        elif task == "sentiment":
            return pipeline("sentiment-analysis", model=AutoModelForSequenceClassification.from_pretrained(model_dir), tokenizer=self.tokenizer)
        else:
            raise ValueError(f"Task {task} is not supported by the model configuration.")
    except Exception as e:
        print(f"Failed to load model for task {task}: {str(e)}")
        return None

def execute(self, requests):
    responses = []
    for request in requests:
        input_text_tensor = pb_utils.get_input_tensor_by_name(request, "input_text")
        function_tensor = pb_utils.get_input_tensor_by_name(request, "function")
        
        input_texts = input_text_tensor.as_numpy().tolist()
        functions = function_tensor.as_numpy().tolist()
        
        for input_text, function in zip(input_texts, functions):
            input_text = input_text[0].decode('utf-8')
            function = function.decode('utf-8')
            
            if function == 'chatbot':
                responses.append(self.chatbot_response(input_text))
            elif function == 'ner' and self.model_ner:
                responses.append(self.ner_response(input_text))
            elif function == 'summarization' and self.model_summarization:
                responses.append(self.summarization_response(input_text))
            elif function == 'sentiment' and self.model_sentiment:
                responses.append(self.sentiment_response(input_text))
            elif function == 'translation' and self.model_translation:
                responses.append(self.translation_response(input_text))
            elif function == 'embedding':
                responses.append(self.embedding_response(input_text))
            elif function == 'tokenization':
                responses.append(self.tokenization_response(input_text))
            else:
                responses.append(self.error_response(f"Function {function} not supported or model not available"))

    return responses

def chatbot_response(self, input_text):
    input_ids = self.tokenizer.encode(input_text, return_tensors="pt")
    attention_mask = torch.ones_like(input_ids)
    outputs = self.model_chatbot.generate(input_ids, attention_mask=attention_mask, pad_token_id=self.tokenizer.eos_token_id, max_new_tokens=10000, max_length=10000)
    generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
    output_tensor = pb_utils.Tensor("output_text", np.array([generated_text], dtype=object))
    return pb_utils.InferenceResponse(output_tensors=[output_tensor])

def ner_response(self, input_text):
    ner_results = self.model_ner(input_text)
    output_tensor = pb_utils.Tensor("output_text", np.array([str(ner_results)], dtype=object))
    return pb_utils.InferenceResponse(output_tensors=[output_tensor])

def summarization_response(self, input_text):
    summary = self.model_summarization(input_text)
    output_tensor = pb_utils.Tensor("output_text", np.array([summary[0]['summary_text']], dtype=object))
    return pb_utils.InferenceResponse(output_tensors=[output_tensor])

def sentiment_response(self, input_text):
    sentiment = self.model_sentiment(input_text)
    output_tensor = pb_utils.Tensor("output_text", np.array([str(sentiment)], dtype=object))
    return pb_utils.InferenceResponse(output_tensors=[output_tensor])

def translation_response(self, input_text):
    translation = self.model_translation(input_text)
    output_tensor = pb_utils.Tensor("output_text", np.array([translation[0]['translation_text']], dtype=object))
    return pb_utils.InferenceResponse(output_tensors=[output_tensor])

def embedding_response(self, input_text):
    embeddings = self.model_embedding(input_text)
    output_tensor = pb_utils.Tensor("output_embedding", np.array(embeddings, dtype=np.float32))
    return pb_utils.InferenceResponse(output_tensors=[output_tensor])

def tokenization_response(self, input_text):
    tokenized = self.tokenizer(input_text)
    output_tensor = pb_utils.Tensor("output_ids", np.array([tokenized.input_ids], dtype=np.int64))
    return pb_utils.InferenceResponse(output_tensors=[output_tensor])

def error_response(self, message):
    output_tensor = pb_utils.Tensor("output_text", np.array([message], dtype=object))
    return pb_utils.InferenceResponse(output_tensors=[output_tensor])

def finalize(self):
    pass

model repository:

├── Mistral-7B-v0.3
│   ├── 1
│   │   ├── README.md
│   │   ├── pycache
│   │   │   └── model.cpython-310.pyc
│   │   ├── config.json
│   │   ├── consolidated.safetensors
│   │   ├── generation_config.json
│   │   ├── model-00001-of-00003.safetensors
│   │   ├── model-00002-of-00003.safetensors
│   │   ├── model-00003-of-00003.safetensors
│   │   ├── model.py
│   │   ├── model.safetensors.index.json
│   │   ├── params.json
│   │   ├── special_tokens_map.json
│   │   ├── tokenizer.json
│   │   ├── tokenizer.model
│   │   ├── tokenizer.model.v3
│   │   └── tokenizer_config.json
│   └── config.pbtxt

Please HELP!

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 395864 C tritonserver 222MiB |
+---------------------------------------------------------------------------------------+