Awww yes!

#2
by BoscoTheDog - opened

So much yes! Finally 128K context in the browser!

BoscoTheDog changed discussion title from Yes yes yes yes yes! to Awww yes!

it's possible to overdo it on the "yes".

Unsupported model type: phi3

I guess I need to be patient a little bit longer.

Hey! Indeed, we're still working on this, and we'll make an announcement once it's working 100%! To answer your questions:
1 ) Yes, this will be part of v3 w/ WebGPU acceleration
2) The model is split into two parts: ~830MB + 1.45GB. Both need to be below 2GB to be cacheable.
3) We're relying on MSFT's official ONNX export/implementation, which simplifies a lot for us! :)

Super cool on all fronts. Thanks for explaining!

If I can help test, let me know.

I couldn't resist and tried to get it running with the latest version of V3, but only got 404's no matter which dtype I tried.

What will be the correct incantation? Is this close?

const generator = await pipeline('text-generation', 'Xenova/Phi-3-mini-128k-instruct', {
                dtype: 'q4',    // fp32, fp16, q8, int8, uint8, q4, bnb4
                progress_callback: (data) => {
                  	if (data.status !== 'progress') return;
                    setLoadProgress(data);
                },
            });

You need to use this revision: https://huggingface.co/Xenova/Phi-3-mini-128k-instruct/discussions/3

By setting revision: 'refs/pr/3'.

You also need to set use_external_data_format: true, which has been introduced by the latest commits.

I can share some example code in a few hours, but I’m still getting erroneous output. Hopefully will get it working soon :)

Here's some HIGHLY EXPERIMENTAL WORK IN PROGRESS code:

import { env, AutoModelForCausalLM, AutoTokenizer } from '@xenova/transformers';

// disable proxying for now (much slower)
env.backends.onnx.wasm.proxy = false;

const model_id = 'Xenova/Phi-3-mini-128k-instruct';
const tokenizer = await AutoTokenizer.from_pretrained(model_id, {
    legacy: true, // TODO: update config
});

const prompt = `<|user|>
Tell me a joke<|end|>
<|assistant|>
`;

const inputs = tokenizer(prompt);

const model = await AutoModelForCausalLM.from_pretrained(model_id, {
    dtype: 'q4',
    // device: 'webgpu', // NOTE: webgpu produces incorrect results
    use_external_data_format: true,
    revision: 'refs/pr/3',
});

// { // warm up
//     const outputs = await model.generate({ ...inputs, max_new_tokens: 1 });
// }
{ // run + time execution
    const start = performance.now();
    const outputs = await model.generate({ ...inputs, max_new_tokens: 5 }); // TODO: increase max new tokens
    const end = performance.now();
    console.log(tokenizer.batch_decode(outputs, { skip_special_tokens: false }));
    console.log('Execution Time:', end - start);
}

NOTE: to get it working, you need to use the latest commit of transformers.js v3.

And just for now, you also need to replace this in src/models.js.

- const dtype = 'float32';
- const empty = [];
+ const dtype = 'float16';
+ const empty = new Uint16Array();

(Will not be necessary once we update the model).


WASM produces correct results, while WebGPU does not. Will continue to investigate.

Figured out the problem: the latest version of ONNXRuntime-web hadn't yet been published to NPM.

Here's a demo of phi-3-mini-128k-instruct running at ~20 tokens per second on an RTX 2080:

yup this model is a quanti... run on phone.

I trained this model:
NickyNicky/Phi-3-mini-4k-instruct_orpo_V2 --- >> https://huggingface.co/NickyNicky/Phi-3-mini-4k-instruct_orpo_V2

Then I quantized it with onnx but it gave me 10 GB, how did you compress it so much?

The latest versions of ONNXRuntime support two forms of 4-bit quantization (for certain weights):

You should also be able to quantize the other weights to fp16 of q8.

Hope that helps!

Awesome!

Is this still needed?

- const dtype = 'float32';
- const empty = [];
+ const dtype = 'float16';
+ const empty = new Uint16Array();

// yep :-)

Whoop!

<s><|user|> Why is the sky blue?<|end|><|assistant|> The sky appears blue to

@Xenova How did you split it into 2 pieces below 2 GB ?

Would you say the 128K context version is now it's ready for implementation? Or are there still workarounds needed?

When can I use this model in web?
still "Unsupported model type: phi3"

@webjjin You need to install transformers.js v3 from the dev branch:

npm install xenova/transformers.js#v3

See here for example code: https://github.com/xenova/transformers.js/blob/e32d4ebb6fe715e6634335123c07a96d0dc62ac8/examples/webgpu-chat/src/worker.js

I remember Xenova saying he had early access to those files. Perhaps you can download them via/from the demo?

I remember Xenova saying he had early access to those files. Perhaps you can download them via/from the demo?

Thank you for the reply. but I'd better to wait for the stable version of transformer.js#v3

Sign up or log in to comment