Storing Spelling information in LLMs

#2
by MartialTerran - opened

Hi again.
I'm not sure how to direct message you on hf. So, this is just a comment on a topic that might be of interest to you.
Because you were building smallish LLMs with single-letter token vocabulary, you forced the LLM to encode the spelling of each word into a small set of tokens (e.g. 27 tokens). The largest LLMs (Google Gemini 1.5) also evidently store extensive word-spelling information for each token/word in their vocabuary (or in a portion of their vocabulary). See the evidence in the ReadMeToo.md at my new post:

https://huggingface.co/datasets/MartialTerran/Eval_Counting_Letters_in_Words/tree/main

P.S. A new open-sourced Llama-type LLM called SmolLM2 has been mega-trained on trillions of tokens, but is only less than 2B parameters. It is said to have high language coherence. Maybe check it out and evaluate it on hf, then download for local operation, or fire up a PEFT finetuning setup on your PC to see if you can get it configured to do what you want it to do.

Thank you, I have played with the smolLM but not finetuned, and really should, but work work work really gets me down at times.

Hope your projects are leading to results, or satisfaction.

Hi. 
This Dynamically_Reducing_Logit_Computation (comparable to your logit reduction by hardcoding reducing of the vocabulary before pretraining) is a serious idea that might be compatible or relevant to your small-token-set research: https://huggingface.co/MartialTerran/Method_for_Dynamically_Reducing_Logit_Computation_in_LLMs
[As currently described, the method has no impact on input-tokens nor on pretrained model vocabulary size] It might actually be patentable.

This Self-Aware_LLM_bootup is currently a "thought experiment" (SciFi) that I had, and I think that it could have some practical applications:  https://huggingface.co/MartialTerran/Self-Aware_LLM_bootup_sequence 

This is a AI-enhanced thought salad published to inspire (or hinder) those who have capacity/resources to undertake such a massive AGI build: https://huggingface.co/MartialTerran/Artificial_General_Super_Intelligence_LLM  

A spin-off that has generated probable python code (modifying state of the art GPTs) derived from the above post has not been published yet.

MOST OF MY TIME is dealing with certain emergent real-world problems as indicated in this hackathon entry:  https://devpost.com/software/ai-decision-clerk1 

In terms of developing API LLM Apps, I am favoring Google Gemini 1.5 Pro API models and focusing on solving real world problems as illustrated in:   https://devpost.com/software/ai-decision-clerk1    [But, also keeping an eye out for downloadable models having sufficient capacity for at least local inference to support my Apps.]

Compare:  AI Legal Assistant (India)   https://devpost.com/software/ai-legal-assistant
See also https://kowallawgroup.com/should-ai-replace-law-clerks-yes-says-adam-unikowsky/ 
https://adamunikowsky.substack.com/p/in-ai-we-trust-part-ii

AI lawyers could wind up democratizing law and making legal services available to people who otherwise wouldn't have access.
https://www.themarshallproject.org/2024/02/10/ai-artificial-intelligence-attorney-court

Versus commercialized products marketed to attorneys: The world’s first generative AI legal assistant is a year old!  https://casetext.com/blog/cocounsel-first-generative-ai-legal-assistant-one-year/ 

https://www.kingselab.org/blog/hackathon-precedent-ai 

I would like to download and experiment/tinker with SmolLM2 (1.7B) since it has small parameter set, and can be trained/tuned on local PC, and fairly high coherence.  But, since it is not easily modified and there has been no publication of a SmolLM2 _model.py and SmolLM2 _tokenizer.py it is practically inaccessible to me. The deficiencies of SmolLM2 (1.7B) include, it has over 150,000 token-vocabulary (see my proposal
at https://huggingface.co/MartialTerran/Method_for_Dynamically_Reducing_Logit_Computation_in_LLMs ) ; it is unnecessarily trained on multiple coding languages (diluted parameters)   The Huggingface makers have not published a standalone SmolLM2 _model.py and SmolLM2_tokenizer.py that operates independent of huggingface "transformers.py" and its inflexible "autotokenizer.py" (which you struggled with overcoming).   Thus actual experimentation/tinkering/development is hindered and frustrated.  See all of my remarks athttps://huggingface.co/HuggingFaceTB/SmolLM2-1.7B/discussions 

I hadn't known about the SmolLm2 training or lack thereof, but was inspired to release my little stories dataset this weekend, after some final cleaning, its 5.4 million short stories like the tinystories dataset, but not asking for the limitations on paragraphs or vocab.

The goal for that is for a good pre-training to learn how things react in the world.

Tomorrow, I hope to finish my MUD/MUSH client for LLMs and start their journey in simulations and living lives of a sort.

I will take a look at/evaluate your littlestories dataset ASAP.  I am trying to build a multi-token LLM from scratch, as a learning exercise about training methods and internal operation of LLMs. Currenly stuck at: Hybrid NN-GPT-2 architecture having One Hidden Layer With RELUs and only 2 output-tokens (XOR implementation with synthetic data input and d = 2- vector Detokenizer).  Python script runs but there is an architecture problem causing it to not effectively reduce the individual-token "one" and "zero" error, and MSE converges only down to .250 and flatlines there within 200 epochs. The output is logically incorrect for XOR because the two computed logits both hover around 0.5 regardless of the inputs. But, my simpler NN implementation without RELU in hidden layer and without Detokenizer output already accomplished logically correct XOR output (with a fair margin between the two output neurons (one for "zero" and one for "one"). So, i guess I am troubleshooting and learning "error" definition/computation methods and methods of backpropagation in the realm of using a "detokenizer" output head. I am not currently using softmax and cross-entropy, nor dropout during pretraining (as it was not needed in the pure-NN version). I was thinking of adding more tokens (increased vocabulary set beyond two) but this is supposed to be a "binary" (only "one" and "zero" as tokens) vocabulary so that I can generate synthetic data and experiment on CPU-only machine. (Maybe I will add an "undefined" token for future of implementing larger logic gates.)

 I did not know what MUD/Mush means.  Google Gemini LearnLM 1.5 tells me the below information.  So, you are developing an LLM-based personality to interact with humans within a "Shared Hallucination"?   Is that correct? Will your LLM run inference off of your "home computer"?  What hyperparameters can your home computer support?  Is there already a known example of people using LLM for doing this Mud/Mush?  Will your LLM be "finetuning" based on its interactions with MUSH people online?  Or is the LLM is just always using a static parameter set?  What if someone reacts negatively to your LLM output?  Can you use free tier Google Gemini 1.5 model API to "do sentiment analysis"? on the received response you get from online people, and alert you to a possible problem?  
GenAI:
Based on the context, a MUD/MUSH client for LLMs is a program that allows a Large Language Model (LLM) to interact with a MUD (Multi-User Dungeon) or MUSH (Multi-User Shared Hallucination). These are text-based online role-playing environments.The client acts as an intermediary, translating the text-based commands and descriptions from the MUD/MUSH into a format the LLM can understand, and then converting the LLM's responses back into commands and actions within the MUD/MUSH world. Essentially, it lets the LLM "play" in the MUD/MUSH.Corianas's goal in creating this client is likely to:Embody the LLM: Give the LLM a persistent presence in a simulated world, allowing it to experience events and interact with other players (potentially other LLMs or humans).Train through Interaction: Use the MUD/MUSH as a training ground for the LLM, allowing it to learn from its interactions and develop better understanding of language, narrative, and social dynamics. This aligns with their previously mentioned goal of creating a dataset for LLMs to learn how things react in the world. A MUD/MUSH provides a dynamic, interactive world for this learning.Explore emergent behavior: Observe how the LLM behaves in a complex environment and potentially discover unexpected capabilities or limitations.In essence, the MUD/MUSH acts as a virtual "life" for the LLM, offering a rich and interactive platform for learning and experimentation.

MartialTerran changed discussion status to closed

At 5Gigabytes, the training smallstories dataset is a little too large for my home computer setup. I would have to borrow my friend's gaming PC. In meantime, can you generate the analysis of your smallstories dataset using my scripts at https://huggingface.co/datasets/MartialTerran/Scripts_for_checking_Train.py ?

Yeah, MUDS are online text adventures that multiple people are in the same world for, I am trying to both make ones that fill in the world, and items, but also try and just... live in them, making memories, and be a world.

Will run the script later and let you know.

Have to run right now, be back later and let you know, but the goal is a Re-act looping agent, that starts with static prompt and memory stores, and... the rest in interacting and making memories. I hope to do a challenge, of what prompt (or set of, in a loop) would make the most... real player.

Sign up or log in to comment