Trying to deploy this using Inference Endpoints

#96
by Kolibri753 - opened

I experienced this problem every time on each model i try

2023/10/28 20:59:08 ~ {"timestamp":"2023-10-28T17:59:08.113711Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank":2,"name":"shard-manager"}]}
2023/10/28 20:59:08 ~ {"timestamp":"2023-10-28T17:59:08.113716Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"rank":3,"name":"shard-manager"}]}
2023/10/28 20:59:08 ~ {"timestamp":"2023-10-28T17:59:08.113195Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
2023/10/28 20:59:08 ~ {"timestamp":"2023-10-28T17:59:08.113179Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2023/10/28 20:59:08 ~ {"timestamp":"2023-10-28T17:59:08.112899Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
2023/10/28 20:59:11 ~ {"timestamp":"2023-10-28T17:59:11.060909Z","level":"WARN","fields":{"message":"Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2\n"},"target":"text_generation_launcher"}
2023/10/28 20:59:11 ~ {"timestamp":"2023-10-28T17:59:11.079997Z","level":"WARN","fields":{"message":"Could not import Mistral model: Mistral model requires flash attn v2\n"},"target":"text_generation_launcher"}
2023/10/28 20:59:11 ~ {"timestamp":"2023-10-28T17:59:11.108899Z","level":"WARN","fields":{"message":"Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2\n"},"target":"text_generation_launcher"}
2023/10/28 20:59:11 ~ {"timestamp":"2023-10-28T17:59:11.108903Z","level":"WARN","fields":{"message":"Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2\n"},"target":"text_generation_launcher"}
2023/10/28 20:59:11 ~ {"timestamp":"2023-10-28T17:59:11.111889Z","level":"WARN","fields":{"message":"Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2\n"},"target":"text_generation_launcher"}
2023/10/28 20:59:11 ~ {"timestamp":"2023-10-28T17:59:11.127853Z","level":"WARN","fields":{"message":"Could not import Mistral model: Mistral model requires flash attn v2\n"},"target":"text_generation_launcher"}
2023/10/28 20:59:11 ~ {"timestamp":"2023-10-28T17:59:11.127742Z","level":"WARN","fields":{"message":"Could not import Mistral model: Mistral model requires flash attn v2\n"},"target":"text_generation_launcher"}
2023/10/28 20:59:11 ~ {"timestamp":"2023-10-28T17:59:11.130757Z","level":"WARN","fields":{"message":"Could not import Mistral model: Mistral model requires flash attn v2\n"},"target":"text_generation_launcher"}
2023/10/28 20:59:14 ~ {"timestamp":"2023-10-28T17:59:14.197784Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File "/opt/conda/bin/text-generation-server", line 8, in \n sys.exit(app())\n File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__\n return self.main(*args, **kwargs)\n File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main\n return _main(\n File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main\n rv = self.invoke(ctx)\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke\n return __callback(*args, **kwargs)\n File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 83, in serve\n server.serve(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 207, in serve\n asyncio.run(\n File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete\n self.run_forever()\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever\n self._run_once()\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once\n handle._run()\n File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 159, in serve_inner\n model = get_model(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 224, in get_model\n return FlashRWSharded(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 67, in __init__\n model = FlashRWForCausalLM(config, weights)\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 625, in __init__\n self.transformer = FlashRWModel(config, weights)\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 567, in __init__\n [\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 568, in \n FlashRWLayer(layer_id, config, weights)\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 396, in __init__\n self.self_attention = FlashRWAttention(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 141, in __init__\n raise ValueError(\nValueError: num_heads must be divisible by num_shards (got num_heads: 71 and num_shards: 4\n"},"target":"text_generation_launcher"}
2023/10/28 20:59:14 ~ {"timestamp":"2023-10-28T17:59:14.202380Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File "/opt/conda/bin/text-generation-server", line 8, in \n sys.exit(app())\n File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__\n return self.main(*args, **kwargs)\n File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main\n return _main(\n File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main\n rv = self.invoke(ctx)\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke\n return __callback(*args, **kwargs)\n File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 83, in serve\n server.serve(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 207, in serve\n asyncio.run(\n File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete\n self.run_forever()\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever\n self._run_once()\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once\n handle._run()\n File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 159, in serve_inner\n model = get_model(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 224, in get_model\n return FlashRWSharded(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 67, in __init__\n model = FlashRWForCausalLM(config, weights)\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 625, in __init__\n self.transformer = FlashRWModel(config, weights)\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 567, in __init__\n [\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 568, in \n FlashRWLayer(layer_id, config, weights)\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 396, in __init__\n self.self_attention = FlashRWAttention(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 141, in __init__\n raise ValueError(\nValueError: num_heads must be divisible by num_shards (got num_heads: 71 and num_shards: 4\n"},"target":"text_generation_launcher"}
2023/10/28 20:59:14 ~ {"timestamp":"2023-10-28T17:59:14.207864Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File "/opt/conda/bin/text-generation-server", line 8, in \n sys.exit(app())\n File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__\n return self.main(*args, **kwargs)\n File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main\n return _main(\n File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main\n rv = self.invoke(ctx)\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke\n return __callback(*args, **kwargs)\n File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 83, in serve\n server.serve(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 207, in serve\n asyncio.run(\n File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete\n self.run_forever()\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever\n self._run_once()\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once\n handle._run()\n File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 159, in serve_inner\n model = get_model(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 224, in get_model\n return FlashRWSharded(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 67, in __init__\n model = FlashRWForCausalLM(config, weights)\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 625, in __init__\n self.transformer = FlashRWModel(config, weights)\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 567, in __init__\n [\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 568, in \n FlashRWLayer(layer_id, config, weights)\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 396, in __init__\n self.self_attention = FlashRWAttention(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 141, in __init__\n raise ValueError(\nValueError: num_heads must be divisible by num_shards (got num_heads: 71 and num_shards: 4\n"},"target":"text_generation_launcher"}
2023/10/28 20:59:14 ~ {"timestamp":"2023-10-28T17:59:14.218637Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File "/opt/conda/bin/text-generation-server", line 8, in \n sys.exit(app())\n File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__\n return self.main(*args, **kwargs)\n File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main\n return _main(\n File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main\n rv = self.invoke(ctx)\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke\n return __callback(*args, **kwargs)\n File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 83, in serve\n server.serve(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 207, in serve\n asyncio.run(\n File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete\n self.run_forever()\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever\n self._run_once()\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once\n handle._run()\n File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 159, in serve_inner\n model = get_model(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 224, in get_model\n return FlashRWSharded(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 67, in __init__\n model = FlashRWForCausalLM(config, weights)\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 625, in __init__\n self.transformer = FlashRWModel(config, weights)\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 567, in __init__\n [\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 568, in \n FlashRWLayer(layer_id, config, weights)\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 396, in __init__\n self.self_attention = FlashRWAttention(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 141, in __init__\n raise ValueError(\nValueError: num_heads must be divisible by num_shards (got num_heads: 71 and num_shards: 4\n"},"target":"text_generation_launcher"}
2023/10/28 20:59:14 ~ {"timestamp":"2023-10-28T17:59:14.920730Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nYou are using a model of type falcon to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\nTraceback (most recent call last):\n\n File "/opt/conda/bin/text-generation-server", line 8, in \n sys.exit(app())\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 83, in serve\n server.serve(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 207, in serve\n asyncio.run(\n\n File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete\n return future.result()\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 159, in serve_inner\n model = get_model(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 224, in get_model\n return FlashRWSharded(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 67, in __init__\n model = FlashRWForCausalLM(config, weights)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 625, in __init__\n self.transformer = FlashRWModel(config, weights)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 567, in __init__\n [\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 568, in \n FlashRWLayer(layer_id, config, weights)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 396, in __init__\n self.self_attention = FlashRWAttention(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 141, in __init__\n raise ValueError(\n\nValueError: num_heads must be divisible by num_shards (got num_heads: 71 and num_shards: 4\n"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
2023/10/28 20:59:15 ~ {"timestamp":"2023-10-28T17:59:15.020784Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nYou are using a model of type falcon to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\nTraceback (most recent call last):\n\n File "/opt/conda/bin/text-generation-server", line 8, in \n sys.exit(app())\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 83, in serve\n server.serve(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 207, in serve\n asyncio.run(\n\n File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete\n return future.result()\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 159, in serve_inner\n model = get_model(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 224, in get_model\n return FlashRWSharded(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 67, in __init__\n model = FlashRWForCausalLM(config, weights)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 625, in __init__\n self.transformer = FlashRWModel(config, weights)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 567, in __init__\n [\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 568, in \n FlashRWLayer(layer_id, config, weights)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 396, in __init__\n self.self_attention = FlashRWAttention(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 141, in __init__\n raise ValueError(\n\nValueError: num_heads must be divisible by num_shards (got num_heads: 71 and num_shards: 4\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2023/10/28 20:59:15 ~ {"timestamp":"2023-10-28T17:59:15.020317Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
2023/10/28 20:59:15 ~ {"timestamp":"2023-10-28T17:59:15.020291Z","level":"ERROR","fields":{"message":"Shard 1 failed to start"},"target":"text_generation_launcher"}
2023/10/28 20:59:15 ~ {"timestamp":"2023-10-28T17:59:15.030818Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nYou are using a model of type falcon to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\nTraceback (most recent call last):\n\n File "/opt/conda/bin/text-generation-server", line 8, in \n sys.exit(app())\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 83, in serve\n server.serve(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 207, in serve\n asyncio.run(\n\n File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete\n return future.result()\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 159, in serve_inner\n model = get_model(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 224, in get_model\n return FlashRWSharded(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 67, in __init__\n model = FlashRWForCausalLM(config, weights)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 625, in __init__\n self.transformer = FlashRWModel(config, weights)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 567, in __init__\n [\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 568, in \n FlashRWLayer(layer_id, config, weights)\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 396, in __init__\n self.self_attention = FlashRWAttention(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 141, in __init__\n raise ValueError(\n\nValueError: num_heads must be divisible by num_shards (got num_heads: 71 and num_shards: 4\n"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank":2,"name":"shard-manager"}]}
2023/10/28 20:59:15 ~ {"timestamp":"2023-10-28T17:59:15.063866Z","level":"INFO","fields":{"message":"Shard terminated"},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"rank":3,"name":"shard-manager"}]}
2023/10/28 20:59:15 ~ Error: ShardCannotStart
2023/10/28 21:00:02 ~ {"timestamp":"2023-10-28T18:00:02.927884Z","level":"INFO","fields":{"message":"Sharding model on 4 processes"},"target":"text_generation_launcher"}
2023/10/28 21:00:02 ~ {"timestamp":"2023-10-28T18:00:02.927835Z","level":"INFO","fields":{"message":"Args { model_id: "/repository", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Bitsandbytes), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 1512, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 2048, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "olibri753a2ac-aws-falcon-7b-instruct-0923-5c5c58bd75-rqd6t", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }"},"target":"text_generation_launcher"}
2023/10/28 21:00:02 ~ {"timestamp":"2023-10-28T18:00:02.928003Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}

I am also getting these errors in GKE on GCP. I have tried tag 1.1.0 and 1.0.3 and keep having the webserver shutdown.

GPU: Tesla T4
Trying a downloaded llama-2-7b-chat-hf model
Tag: 1.0.3
--quantize bitsandbytes

1.0.3 error:
Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

1.1.0 error:
Could not import Mistral model: Mistral model requires flash attn v2

Sign up or log in to comment