Taking too much time on 4 GB RAM

#9
by rajeev148711 - opened

Hi, using this, it's taking too much time for converting audio to text in 4 GB RAM system, it's near around 4 to 5 minutes time to take convert. provide system configuration for this

rajeev148711 changed discussion status to closed
rajeev148711 changed discussion status to open

Hi there. Could you perhaps give more information about your system? Which OS, browser, CPU, etc. are you using? Also, how long is the audio file you are trying to transcribe?

Which OS :- window 7 home basic
browser :- Google chrome 109
CPU :- Intel Core i3-3110M cpu @ 2.40GHz
System type:- 64 bit
RAM:- 4GB
Audio Length:- 60 seconds

Considering that you’re using a very old computer, I don’t think it’s a problem with the app/library.

You could also try updating the settings to use the quantized version of the model. By default, the tiny unquantized model is used, which is ~160MB.

But, the accuracy of Conversion is not good as compare to small model

The small model is ~250 million parameters, which would explain the slow execution time on your system.

Unfortunately I’d say you need to make some compromises with speed and accuracy on your hardware.

You could try the “base” model, which is ~75 million parameters.

Also, in some cases, the unquantized versions of models are faster, so you can play around with the model settings.

Kindly provide accuracy of unquantized base model as compare to small quantized

Hi, as you suggest base model unquantized, I am using this base model with unquantized. One issue face in this model, the word or number is repeated many times, Kindly check given text with given audio file.

I can see a table in front of me which shows the statistics of victims in Africa before the introduction of their respective anti-roads. So in years I can see 19,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 arthritis, syphilis and the maximum numbers is for diaphragm in 1960 which goes up to 500 and the minimum values are of AIDS in 1900. It is showing the victims in Africa.

One issue face in this model, the word or number is repeated many times

If using the unquantized model, you should get the exact same output as if you ran the python version (HF transformers). So, you will also see these artifacts there.

Unfortunately, this is a limitation of the model itself. There are some ways to try fix this (with logits processors), but those are not available in this user interface.

no upper pattern kind of case is not in python whisper please improve it or include some logic of text manupulation in base model please. i t will add more quality in your hard working this marathon project. and one word or one sentance audio is not doing accurate. like short command audio

You can also use some of the generation parameters (like no_repeat_ngram_size or repetition_penalty), but these are not available in the user interface at the moment. See docs for the full list of parameters.

If you can provide an audio clip where this problem happens in our unquantized model but not in the python version, please file an issue to https://github.com/xenova/transformers.js.

Sign up or log in to comment