Deployment fails (weight model.embed_tokens.weight does not exist) to Inference Endpoint

#18
by ts314 - opened

Seems like the model currently cannot be deployed to Inference Endpoints (as of 4111085adfbe2172075e92d6d2b9ef2a1080bc90) on any AWS GPU machine (tried several, incl A10G and T4 GPUs)

Here's the tail of the error log (core issue seems to be RuntimeError: weight model.embed_tokens.weight does not exist and Error: ShardCannotStart)

Any advice?

2024/01/20 23:45:36 ~ {"timestamp":"2024-01-20T22:45:36.195045Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
2024/01/20 23:45:36 ~ {"timestamp":"2024-01-20T22:45:36.194959Z","level":"INFO","fields":{"message":"Sharding model on 4 processes"},"target":"text_generation_launcher"}
2024/01/20 23:45:41 ~ {"timestamp":"2024-01-20T22:45:41.015952Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download.\n"},"target":"text_generation_launcher"}
2024/01/20 23:45:41 ~ {"timestamp":"2024-01-20T22:45:41.600361Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
2024/01/20 23:45:41 ~ {"timestamp":"2024-01-20T22:45:41.600642Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2024/01/20 23:45:41 ~ {"timestamp":"2024-01-20T22:45:41.600711Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
2024/01/20 23:45:41 ~ {"timestamp":"2024-01-20T22:45:41.601003Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"rank":3,"name":"shard-manager"}]}
2024/01/20 23:45:41 ~ {"timestamp":"2024-01-20T22:45:41.600987Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank":2,"name":"shard-manager"}]}
2024/01/20 23:45:46 ~ {"timestamp":"2024-01-20T22:45:46.244286Z","level":"WARN","fields":{"message":"Disabling exllama v2 and using v1 instead because there are issues when sharding\n"},"target":"text_generation_launcher"}
2024/01/20 23:45:46 ~ {"timestamp":"2024-01-20T22:45:46.271154Z","level":"WARN","fields":{"message":"Disabling exllama v2 and using v1 instead because there are issues when sharding\n"},"target":"text_generation_launcher"}
2024/01/20 23:45:46 ~ {"timestamp":"2024-01-20T22:45:46.296080Z","level":"WARN","fields":{"message":"Disabling exllama v2 and using v1 instead because there are issues when sharding\n"},"target":"text_generation_launcher"}
2024/01/20 23:45:46 ~ {"timestamp":"2024-01-20T22:45:46.509018Z","level":"WARN","fields":{"message":"Disabling exllama v2 and using v1 instead because there are issues when sharding\n"},"target":"text_generation_launcher"}
2024/01/20 23:45:49 ~ {"timestamp":"2024-01-20T22:45:49.628474Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File "/opt/conda/bin/text-generation-server", line 8, in \n sys.exit(app())\n File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__\n return self.main(*args, **kwargs)\n File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main\n return _main(\n File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main\n rv = self.invoke(ctx)\n File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke\n return __callback(*args, **kwargs)\n File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve\n server.serve(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve\n asyncio.run(\n File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete\n self.run_forever()\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever\n self._run_once()\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once\n handle._run()\n File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner\n model = get_model(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 288, in get_model\n return FlashMistral(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 430, in __init__\n super(FlashMistral, self).init(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 333, in __init__\n model = model_cls(config, weights)\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 421, in __init__\n self.model = MistralModel(config, weights)\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 352, in __init__\n self.embed_tokens = TensorParallelEmbedding(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 502, in __init__\n weight = weights.get_partial_sharded(f"{prefix}.weight", dim=0)\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 88, in get_partial_sharded\n filename, tensor_name = self.get_filename(tensor_name)\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 64, in get_filename\n raise RuntimeError(f"weight {tensor_name} does not exist")\nRuntimeError: weight model.embed_tokens.weight does not exist\n"},"target":"text_generation_launcher"}
2024/01/20 23:45:49 ~ {"timestamp":"2024-01-20T22:45:49.628474Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File "/opt/conda/bin/text-generation-server", line 8, in \n sys.exit(app())\n File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__\n return self.main(*args, **kwargs)\n File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main\n return _main(\n File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main\n rv = self.invoke(ctx)\n File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke\n return __callback(*args, **kwargs)\n File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve\n server.serve(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve\n asyncio.run(\n File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete\n self.run_forever()\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever\n self._run_once()\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once\n handle._run()\n File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner\n model = get_model(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 288, in get_model\n return FlashMistral(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 430, in __init__\n super(FlashMistral, self).init(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 333, in __init__\n model = model_cls(config, weights)\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 421, in __init__\n self.model = MistralModel(config, weights)\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 352, in __init__\n self.embed_tokens = TensorParallelEmbedding(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 502, in __init__\n weight = weights.get_partial_sharded(f"{prefix}.weight", dim=0)\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 88, in get_partial_sharded\n filename, tensor_name = self.get_filename(tensor_name)\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 64, in get_filename\n raise RuntimeError(f"weight {tensor_name} does not exist")\nRuntimeError: weight model.embed_tokens.weight does not exist\n"},"target":"text_generation_launcher"}
2024/01/20 23:45:49 ~ {"timestamp":"2024-01-20T22:45:49.629309Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File "/opt/conda/bin/text-generation-server", line 8, in \n sys.exit(app())\n File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__\n return self.main(*args, **kwargs)\n File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main\n return _main(\n File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main\n rv = self.invoke(ctx)\n File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke\n return __callback(*args, **kwargs)\n File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve\n server.serve(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve\n asyncio.run(\n File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete\n self.run_forever()\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever\n self._run_once()\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once\n handle._run()\n File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner\n model = get_model(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 288, in get_model\n return FlashMistral(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 430, in __init__\n super(FlashMistral, self).init(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 333, in __init__\n model = model_cls(config, weights)\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 421, in __init__\n self.model = MistralModel(config, weights)\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 352, in __init__\n self.embed_tokens = TensorParallelEmbedding(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 502, in __init__\n weight = weights.get_partial_sharded(f"{prefix}.weight", dim=0)\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 88, in get_partial_sharded\n filename, tensor_name = self.get_filename(tensor_name)\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 64, in get_filename\n raise RuntimeError(f"weight {tensor_name} does not exist")\nRuntimeError: weight model.embed_tokens.weight does not exist\n"},"target":"text_generation_launcher"}
2024/01/20 23:45:49 ~ {"timestamp":"2024-01-20T22:45:49.629296Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File "/opt/conda/bin/text-generation-server", line 8, in \n sys.exit(app())\n File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__\n return self.main(*args, **kwargs)\n File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main\n return _main(\n File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main\n rv = self.invoke(ctx)\n File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke\n return __callback(*args, **kwargs)\n File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve\n server.serve(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve\n asyncio.run(\n File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete\n self.run_forever()\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever\n self._run_once()\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once\n handle._run()\n File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner\n model = get_model(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 288, in get_model\n return FlashMistral(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 430, in __init__\n super(FlashMistral, self).init(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 333, in __init__\n model = model_cls(config, weights)\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 421, in __init__\n self.model = MistralModel(config, weights)\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 352, in __init__\n self.embed_tokens = TensorParallelEmbedding(\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 502, in __init__\n weight = weights.get_partial_sharded(f"{prefix}.weight", dim=0)\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 88, in get_partial_sharded\n filename, tensor_name = self.get_filename(tensor_name)\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 64, in get_filename\n raise RuntimeError(f"weight {tensor_name} does not exist")\nRuntimeError: weight model.embed_tokens.weight does not exist\n"},"target":"text_generation_launcher"}
2024/01/20 23:45:50 ~ {"timestamp":"2024-01-20T22:45:50.910360Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\n[W socket.cpp:663] [c10d] The client socket has failed to connect to [localhost]:29500 (errno: 99 - Cannot assign requested address).\nTraceback (most recent call last):\n\n File "/opt/conda/bin/text-generation-server", line 8, in \n sys.exit(app())\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve\n server.serve(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve\n asyncio.run(\n\n File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n return future.result()\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner\n model = get_model(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 288, in get_model\n return FlashMistral(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 430, in __init__\n super(FlashMistral, self).init(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 333, in __init__\n model = model_cls(config, weights)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 421, in __init__\n self.model = MistralModel(config, weights)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 352, in __init__\n self.embed_tokens = TensorParallelEmbedding(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 502, in __init__\n weight = weights.get_partial_sharded(f"{prefix}.weight", dim=0)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 88, in get_partial_sharded\n filename, tensor_name = self.get_filename(tensor_name)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 64, in get_filename\n raise RuntimeError(f"weight {tensor_name} does not exist")\n\nRuntimeError: weight model.embed_tokens.weight does not exist\n"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank":2,"name":"shard-manager"}]}
2024/01/20 23:45:51 ~ {"timestamp":"2024-01-20T22:45:51.009485Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
2024/01/20 23:45:51 ~ {"timestamp":"2024-01-20T22:45:51.009448Z","level":"ERROR","fields":{"message":"Shard 2 failed to start"},"target":"text_generation_launcher"}
2024/01/20 23:45:51 ~ {"timestamp":"2024-01-20T22:45:51.009861Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nTraceback (most recent call last):\n\n File "/opt/conda/bin/text-generation-server", line 8, in \n sys.exit(app())\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve\n server.serve(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve\n asyncio.run(\n\n File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n return future.result()\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner\n model = get_model(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 288, in get_model\n return FlashMistral(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 430, in __init__\n super(FlashMistral, self).init(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 333, in __init__\n model = model_cls(config, weights)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 421, in __init__\n self.model = MistralModel(config, weights)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 352, in __init__\n self.embed_tokens = TensorParallelEmbedding(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 502, in __init__\n weight = weights.get_partial_sharded(f"{prefix}.weight", dim=0)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 88, in get_partial_sharded\n filename, tensor_name = self.get_filename(tensor_name)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 64, in get_filename\n raise RuntimeError(f"weight {tensor_name} does not exist")\n\nRuntimeError: weight model.embed_tokens.weight does not exist\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2024/01/20 23:45:51 ~ {"timestamp":"2024-01-20T22:45:51.011705Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nTraceback (most recent call last):\n\n File "/opt/conda/bin/text-generation-server", line 8, in \n sys.exit(app())\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve\n server.serve(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve\n asyncio.run(\n\n File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n\n File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n return future.result()\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner\n model = get_model(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 288, in get_model\n return FlashMistral(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 430, in __init__\n super(FlashMistral, self).init(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 333, in __init__\n model = model_cls(config, weights)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 421, in __init__\n self.model = MistralModel(config, weights)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 352, in __init__\n self.embed_tokens = TensorParallelEmbedding(\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 502, in __init__\n weight = weights.get_partial_sharded(f"{prefix}.weight", dim=0)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 88, in get_partial_sharded\n filename, tensor_name = self.get_filename(tensor_name)\n\n File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 64, in get_filename\n raise RuntimeError(f"weight {tensor_name} does not exist")\n\nRuntimeError: weight model.embed_tokens.weight does not exist\n"},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"rank":3,"name":"shard-manager"}]}
2024/01/20 23:45:51 ~ {"timestamp":"2024-01-20T22:45:51.083890Z","level":"INFO","fields":{"message":"Shard terminated"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
2024/01/20 23:45:51 ~ Error: ShardCannotStart

It's not a text generation model, please do not load it with generation pipeline.

Yes, I'm aware. I've tried to deploy it as a Text Embedding model (not with a generation pipeline) on an inference endpoint

Same here. Changed the instance from the example (ml.m5.2xlarge) to ml.g4dn.4xlarge, still had the issue. I found some potentially related documentation below, but still stuck. Any progress on this?

https://docs.aws.amazon.com/en_kr/sagemaker/latest/dg/model-parallel-core-features-v2-tensor-parallelism.html

I have the same problem as my peers above. If the problem is indeed the generation pipeline, then it means that there is a problem with the default script in the HuggingFace container when loading embedding models. Has anyone managed to figure out how to make this work?

Sign up or log in to comment