Pure Python version for local Inference operation on PC

#2
by MartialTerran - opened

I first heard about this model today in this article: https://venturebeat.com/ai/ai-on-your-smartphone-hugging-faces-smollm2-brings-powerful-models-to-the-palm-of-your-hand/?_bhlid=071034f893836a3364663dcc52fbea6fd14a2f15
I am disappointed that the full edge-optimal "model" is not disclosed in a portable standalone python format that can be ported to an edge device capable of running python.
Can Hugginface please publish a python script that defines the "model" explicitly, without using the complex "AutoModelForCausalLM" library???

If you remove from the vocabulary and output layer, all the useless tokens that are not required for your specific language/application, the local model will use less resources during inference, and you can maybe get better reliability/results (the logits distribution will not be distributed/diluted over as many extraneous tokens at output level). You need a standalone model.py and tokenizer.py to make these modifications.

Or, modify the model.py and tokenizer to dynamically deactivate the extraneous tokens during inference (and finetuning). If you are currently coding in only "python", then why would you need to operate the logits in the output layer for probability of tokens that are used only with a different programming language? If you want to Patent and implement this modification, include me as Inventor or at least as Co-inventor in the Patent Application. Do you want me to write the patent application? I can do that. I can write the sample code for inclusion in the Patent Application, at least after I obtain a standalone model.py to run this SmolLLM2 weights. See for example:
https://huggingface.co/MartialTerran/Method_for_Dynamically_Reducing_Logit_Computation_in_LLMs

If SmolLM2 (Said to be better at coding than GPT4) can be finetuned to optimally code with our own specific code "syntax and program logic" that way we can "own" the resulting model and control it and protect it from revision/upgrade and protect our investment in the finetuning.

Please build a python/pytorch-only version of the SmolLM2_model.py and tokenizer.py that does not use huggingface "Transformers.py" libraries. See https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B/discussions

for example for PC operation,
SmolLM2.py
SmolLM2_tokenizer.py
So that:
model = SmolLM2(SmolLM_135M_checkpoint_file).to(device)
inputs = SmolLM2_tokenizer.encode("Gravity is", return_tensors="pt").to(device)

not:

from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "HuggingFaceTB/SmolLM2-1.7B"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

for multiple GPUs install accelerate and do model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")

model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)

Have there already been initiatives in this direction? I also need a stand alone version for python to train an explicit programming language

Sign up or log in to comment