Xenova/Phi-3-mini-4k-instruct

BoscoTheDog

May 3, 2024

•

edited May 3, 2024

So much yes! Finally 128K context in the browser!

Is any part of this WebGPU accelerated in V2 or V3?
The file is surprisingly small at 1.45Gb? Since the original Microsoft file is 2.5 Gb, does that mean it's heavily quantified? (https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32)
Isn't implementation rather complicated? For example, llama.cpp still doesn't have support for this model. Or does the onnx-format solve those issues somehow? The Llama.cpp issue on Github: https://github.com/ggerganov/llama.cpp/issues/6849

BoscoTheDog changed discussion title from Yes yes yes yes yes! to Awww yes! May 3, 2024

BoscoTheDog

May 3, 2024

it's possible to overdo it on the "yes".

BoscoTheDog

May 3, 2024

Unsupported model type: phi3

I guess I need to be patient a little bit longer.

Xenova

Owner May 4, 2024

Hey! Indeed, we're still working on this, and we'll make an announcement once it's working 100%! To answer your questions:
1 ) Yes, this will be part of v3 w/ WebGPU acceleration
2) The model is split into two parts: ~830MB + 1.45GB. Both need to be below 2GB to be cacheable.
3) We're relying on MSFT's official ONNX export/implementation, which simplifies a lot for us! :)

BoscoTheDog

May 4, 2024

•

edited May 4, 2024

Super cool on all fronts. Thanks for explaining!

If I can help test, let me know.

BoscoTheDog

May 4, 2024

I couldn't resist and tried to get it running with the latest version of V3, but only got 404's no matter which dtype I tried.

What will be the correct incantation? Is this close?

const generator = await pipeline('text-generation', 'Xenova/Phi-3-mini-128k-instruct', {
                dtype: 'q4',    // fp32, fp16, q8, int8, uint8, q4, bnb4
                progress_callback: (data) => {
                  	if (data.status !== 'progress') return;
                    setLoadProgress(data);
                },
            });

Xenova

Owner May 4, 2024

•

edited May 4, 2024

You need to use this revision: https://huggingface.co/Xenova/Phi-3-mini-128k-instruct/discussions/3

By setting revision: 'refs/pr/3'.

You also need to set use_external_data_format: true, which has been introduced by the latest commits.

Xenova

Owner May 4, 2024

I can share some example code in a few hours, but I’m still getting erroneous output. Hopefully will get it working soon :)

Xenova

Owner May 4, 2024

Here's some HIGHLY EXPERIMENTAL WORK IN PROGRESS code:

import { env, AutoModelForCausalLM, AutoTokenizer } from '@xenova/transformers';

// disable proxying for now (much slower)
env.backends.onnx.wasm.proxy = false;

const model_id = 'Xenova/Phi-3-mini-128k-instruct';
const tokenizer = await AutoTokenizer.from_pretrained(model_id, {
    legacy: true, // TODO: update config
});

const prompt = `<|user|>
Tell me a joke<|end|>
<|assistant|>
`;

const inputs = tokenizer(prompt);

const model = await AutoModelForCausalLM.from_pretrained(model_id, {
    dtype: 'q4',
    // device: 'webgpu', // NOTE: webgpu produces incorrect results
    use_external_data_format: true,
    revision: 'refs/pr/3',
});

// { // warm up
//     const outputs = await model.generate({ ...inputs, max_new_tokens: 1 });
// }
{ // run + time execution
    const start = performance.now();
    const outputs = await model.generate({ ...inputs, max_new_tokens: 5 }); // TODO: increase max new tokens
    const end = performance.now();
    console.log(tokenizer.batch_decode(outputs, { skip_special_tokens: false }));
    console.log('Execution Time:', end - start);
}

NOTE: to get it working, you need to use the latest commit of transformers.js v3.

And just for now, you also need to replace this in src/models.js.

- const dtype = 'float32';
- const empty = [];
+ const dtype = 'float16';
+ const empty = new Uint16Array();

(Will not be necessary once we update the model).

WASM produces correct results, while WebGPU does not. Will continue to investigate.

Xenova

Owner May 4, 2024

Figured out the problem: the latest version of ONNXRuntime-web hadn't yet been published to NPM.

Here's a demo of phi-3-mini-128k-instruct running at ~20 tokens per second on an RTX 2080:

NickyNicky

May 5, 2024

•

edited May 5, 2024

yup this model is a quanti... run on phone.

I trained this model:
NickyNicky/Phi-3-mini-4k-instruct_orpo_V2 --- >> https://huggingface.co/NickyNicky/Phi-3-mini-4k-instruct_orpo_V2

Then I quantized it with onnx but it gave me 10 GB, how did you compress it so much?

Xenova

Owner May 5, 2024

The latest versions of ONNXRuntime support two forms of 4-bit quantization (for certain weights):

You should also be able to quantize the other weights to fp16 of q8.

Hope that helps!

BoscoTheDog

May 5, 2024

•

edited May 5, 2024

Awesome!

Is this still needed?

- const dtype = 'float32';
- const empty = [];
+ const dtype = 'float16';
+ const empty = new Uint16Array();

// yep :-)

BoscoTheDog

May 5, 2024

Whoop!

<s><|user|> Why is the sky blue?<|end|><|assistant|> The sky appears blue to

omaryshchenko

May 10, 2024

@Xenova How did you split it into 2 pieces below 2 GB ?

Xenova

Owner May 10, 2024

•

edited May 10, 2024

@omaryshchenko See instructions here: https://huggingface.co/Xenova/Phi-3-mini-4k-instruct/discussions/3#66364ac6f7acbb051b7fa9f9

BoscoTheDog

May 11, 2024

Would you say the 128K context version is now it's ready for implementation? Or are there still workarounds needed?

webjjin

May 24, 2024

When can I use this model in web?
still "Unsupported model type: phi3"

Xenova

Owner May 24, 2024

@webjjin You need to install transformers.js v3 from the dev branch:

npm install xenova/transformers.js#v3

See here for example code: https://github.com/xenova/transformers.js/blob/e32d4ebb6fe715e6634335123c07a96d0dc62ac8/examples/webgpu-chat/src/worker.js

webjjin

May 27, 2024

•

edited May 27, 2024

I have tried it. but I got an error.

There is no file below.
https://cdn.jsdelivr.net/npm/onnxruntime-web@1.19.0-dev.20240509-69cfcba38a/dist/ort-wasm-simd-threaded.jsep.mjs

There is no onnxruntime-web@v1.19.0
https://www.npmjs.com/package/onnxruntime-web

How can I get onnxruntime-web@v1.19.0 ?

BoscoTheDog

May 27, 2024

I remember Xenova saying he had early access to those files. Perhaps you can download them via/from the demo?

webjjin

May 28, 2024

I remember Xenova saying he had early access to those files. Perhaps you can download them via/from the demo?

Thank you for the reply. but I'd better to wait for the stable version of transformer.js#v3

patrickbrosset

Jul 15, 2024

I just tried today, with the following steps:

Cloned the transformers.js project locally.
Switched to the v3 branch.
Installed dependencies and built the project: npm install and npm run build.
Copied the dist folder to a test project folder.
Tried running Phi-3 with this sample code:

import { pipeline, env } from './dist/transformers.js';

const model_id = 'Xenova/Phi-3-mini-4k-instruct';
env.backends.onnx.wasm.proxy = false;

const pipe = await pipeline('text-generation', model_id, {
  dtype: "q4",
  device: 'webgpu',
  use_external_data_format: true,
});

But I'm blocked at this error:

Error: Can't create a session. ERROR_CODE: 1, ERROR_MESSAGE: Deserialize tensor model.layers.1.mlp.up_proj.MatMul.weight_Q4 failed.Failed to load external data file "./model_q4.onnx_data", error: Module.MountedFiles is not available.
    at We (ort.webgpu.min.js:22:13223)
    at Pd (ort.webgpu.min.js:2309:19615)

Any idea how I could solve this?

BenColemanUK

Oct 5, 2024

•

edited Oct 5, 2024

I'm having problems with loading this model too, but with the external data. I'm using the latest v3 alpha @huggingface/transformers@3.0.0-alpha.19

I've also tried "microsoft/Phi-3-mini-4k-instruct-onnx-web" with dtype: q4f16, it never loads the external data.

model.layers.18.mlp.up_proj.MatMul.weight_Q4 failed.Failed to load external data file ""model_q4.onnx_data"", error: Module.MountedFiles is not available.
or
model.layers.18.mlp.up_proj.MatMul.weight_Q4 failed.Failed to load external data file ""model_q4f16.onnx_data"", error: Module.MountedFiles is not available.

Those files definitely exist - so I'm thinking a bug in transformers v3 somewhere, the double double quotes makes me extra suspicious

BoscoTheDog

Oct 5, 2024

•

edited Oct 5, 2024

You wouldn't happen to be on Safari or Firefox would you?

And what are your environment variables? E.g.

env.allowLocalModels = false;
env.allowRemoteModels = true;
env.useBrowserCache = true;

BenColemanUK

Oct 5, 2024

•

edited Oct 5, 2024

No, and I've already successfully created and run this using ONNX directly (which is a lot more hassle!) in my browser https://github.com/benc-uk/onnx-webgpu/tree/main/phi-chat

I'm using exactly those env settings

NERDDISCO

Feb 16

running into the same error as you @BenColemanUK , have you ever solved it?

BenColemanUK

Feb 16

Never solved it, AFAIK transformers.js just doesn't support this model and/or has a bug in the external data loading that isn't being addressed

Xenova

Owner Feb 18

@BenColemanUK Please see here for example code on how to use a model like this: https://github.com/huggingface/transformers.js-examples/blob/5b6e0c18677e3e22ef42779a766f48e2ed0a4b18/phi-3.5-webgpu/src/worker.js#L19-L24

mayram

Apr 16

•

edited Apr 16

On running above sample getting the following error:tensor-impl.js:159 Uncaught (in promise) TypeError: A float16 tensor's data must be type of function Float16Array() { [native code] }
at load (worker.js:134:15)
Any pointers on how to resolve it?

Xenova

Owner Apr 16

Updating to a newer version of Transformers.js (e.g., 3.5.0) will fix that! This is because some browsers recently added support for the Float16Array type.