Can Huggingface facilitate experimentation with Tiny LLMs

#2
by MartialTerran - opened

I has sudden opportunity to write to Clem of Huggingface, and I asked him to help you develop Tiny GPTs using a custom vocabulary:

He wrote:
Great first three days in the Bay Area, so much energy here! My current messages to the ecosystem:

  • Everyone (including software engineers, small startups, academia, indie developers,...) should train, optimize and run their own models. Time for AI builders, not AI users!

  • LLMs are boring and progress is slowing down there. Let's focus on smaller/specialized/on-device models, datasets & other modalities like audio, video, time-series, multi-modal, robotics, biology, chemistry,...!

  • Open-source AI for the win!

Let's go Silicon Valley! πŸš€πŸš€πŸš€
https://www.linkedin.com/mwlite/feed/update/activity:7194764186524995585?trk=viral_share

I wrote:
Dear Clem,
Thank you,
Please help/motivate Microsoft team of Ronen Elden to develop a clean TinyStories V3 dataset (reduced vocabulay, clean of misspellings, and free of Chinese and random unicode) for tiny GPT model training and edge experimentation. Also, please change huggingface strictures to stop hindering experimenters ( e.g. Corianas) who wants to use customized ( reduced) vocabularies or novel tokenizers in tiny GPT models. E.g. Microllama_Char_100k_step. Also, please invite Mythic (Texas company) the analog Matmul IC company to demonstrate that their analog Matrix Multiplication ICs can already run Tiny GPTs in inference mode at edge power levels. And ask Mythic company to develop a USB adapter and simple python libraries for using their current M.2 format development boards connected to windows or arduino or android devices.

Hi hi, honestly work has been kicking my butt, and I have been unable to look at this as much as I like, and regarding the HF issue with my original models, I think I have an idea on how to sort, as its a more my end issue than HFs I think, as they are just using the standards, and I am the one doing things oddly.

Thank you for prodding me again, as it does help keep me on this as I am really trying to both get my notes together properly and work out how to actually keep getting better results. While I can generate stories well, I really am looking to try and next implant some... behaviors to try and steer towards being able to 'live' in a text world (probably a MUD/MUSH text game as I remember them being extendable and started making my own engine at one point.)

:-) I'm off to bed, but thank you again.

Hi. You are welcome. I think that huggingface should better accommodate small-model SML experimenters below 1B parameters, and give greater flexibility to the tokenizers and GPU access for unique small models (support inference mode evaluations based on pure-python scripts).
I have some novel ideas to further "improve" the performance of tiny GPTs along the lines of your experiments, but I would have to get deep into the low level python code of GPT-2 to test these ideas out. Then pretrain the modded model from scratch. But, with work related burdens, that will not be this month. I can't use Huggingface "transformers" library models because they are too strictured, and I cannot modify what I need to modify. I have collected examples of pure-python GPTs that will need some tinkering. The question of patents has come up. Why is no one is claiming to hold a patent on GPT-2 model architecture? Take a look at the discussion at Reddit:
https://www.reddit.com/r/MachineLearning/comments/10zzm18/comment/l3wbp55/

Hi. You might derive something from this study and discussion of tokenizers and vocab of various sizes: (I am not sure what the term "capcode" means)_

https://www.reddit.com/r/MachineLearning/comments/168wc1o/i_pretrained_16_language_models_from_scratch_with/
Quote: "Validation Loss is very strongly correlated (0.97 Pearson correlation) with the compression ratio (average number of characters per token) associated with a given tokenizer. "

Sign up or log in to comment