OpenNMT/Mistral-7B-v0.2-instruct-onmt-awq-gemv

This is the OpenNMT-py converted version of Mistral 7b Instruct v0.1, 4-bit AWQ quantized.

The safetensors file is 4.2GB hence runs smoothly on any RTX card.

Command line to run is:

python onmt/bin/translate.py --config /pathto/mistral-instruct-inference-awq.yaml --src /pathto/input-vicuna.txt --output /pathto/mistral-output.txt

Where for instance, input-vicuna.txt contains:

USER:｟newline｠Show me some attractions in Boston.｟newline｠｟newline｠ASSISTANT:｟newline｠

Output will be:

Boston is a great city with many attractions to visit. Here are some popular ones:｟newline｠｟newline｠1. The Freedom Trail - This is a 2.5-mile-long path through downtown Boston that passes by 16 historically significant sites related to the American Revolution.｟newline｠2. The Massachusetts State House - The iconic red brick building that serves as the seat of the Massachusetts government and is home to the Massachusetts Legislature and the Governor.｟newline｠3. The Boston Tea Party Ships and Museum - This museum tells the story of the Boston Tea Party and the events leading up to the American Revolution through interactive exhibits and live reenactments.｟newline｠4. The Paul Revere House - The oldest house in the United States, it was home to the famous silversmith and patriot Paul Revere.｟newline｠5. The USS Constitution - A historic warship that played a key role in the American Revolution, the USS Constitution is now a museum and a popular tourist attraction.｟newline｠6. The Bunker Hill Memorial Park - A beautiful park located on the site of the first military engagement of the American Revolution, the Battle of Bunker Hill.｟newline｠7. The Museum of Fine Arts, Boston - One of the largest

If you run with a batch size of 60 you can get a nice throughput even with GEMV:

[2023-12-27 14:54:47,967 INFO] Loading checkpoint from /mnt/InternalCrucial4/dataAI/mistral-7B/mistral-instruct-v0.2/mistral-instruct-v0.2-onmt-awq-gemv.pt
[2023-12-27 14:54:48,063 INFO] awq_gemv compression of layer ['w_1', 'w_2', 'w_3', 'linear_values', 'linear_query', 'linear_keys', 'final_linear']
[2023-12-27 14:54:52,059 INFO] Loading data into the model
step0 time:  1.2714881896972656
[2023-12-27 14:54:59,180 INFO] PRED SCORE: -0.2316, PRED PPL: 1.26 NB SENTENCES: 59
[2023-12-27 14:54:59,180 INFO] Total translation time (s): 6.1
[2023-12-27 14:54:59,180 INFO] Average translation time (ms): 103.5
[2023-12-27 14:54:59,180 INFO] Tokens per second: 2183.8
Time w/o python interpreter load/terminate:  11.222625255584717