Looking at what makes ChatGPT special...
Outside of its refinement using human feedback, it's quite interesting just how much context it can maintain between requests.
It doesn't look like it's looking at the actual text that came before, 'cus that can easily exceed the 3K words or something window that it can handle by a lot.
I feel that there has to be some kind of compressed (neural net specific) representation of what it has learned so far, that is ideally easy to store in a file - because you can't be sure when a user will continue a past discussion where you need to still maintain context - so you obviously can't keep state in memory.
Kinda makes me think of Neo learning Kung Fu in the Matrix... Human learning is beyond words, and you'd need some representation of everything down to muscle memory (sorta), to transfer that knowledge of Kung Fu in one go.
I think they haven't publicly shared the key advancements that are allowing ChatGPT to perform as well as it does.
They feed the whole input from current session, just refresh the window and you will see it will "forget" everything.
there is gpt_index, perhaps they are using something similar to bypass current limitations, have you ever tried counting tokens and asking questions related to the beggining of conversation after passing some threshold?
I've tried it with up to 20K words in the current context, and it still remembered everything fine.
I've also tried it with other kinds of data that would probably have been closer to 100K words if it was words, and it still had a good grasp of it.
But on the other hand, the limit for GPT-3 is 4096 tokens, and I think that applies to GPT 3.5 as well. So some workaround is obviously needed.
gpt_index does sound interesting. The goal seems to be to make a large database accessible and then use GPT-3 to answer questions in natural language, based on a queried subset of relevant information. That way, GPT-3 wouldn't need to be trained with additional data, and could answer questions related to your knowledge base.
I never tried it myself with so much tokens, but if it works it is impressive indeed
I have a chat that is over 46K words and whenever I ask it to remember/recall it says it cannot but what I do is just write something along the lines of "read this entire chat and.." and it has no problem recounting any part of the chat or summarizing anything about the chat.
You could get around the input token limitation with a bit of engineering?
- Store conversation so far as a compressed knowledge graph (setup for semantic search)
- When you receive a prompt, generate search terms for the knowledge graph and identify relevant parts
- If too many relevant parts of the conversation are identified, can constrain it using (1) ranking by relevance, (2) compression of text, (3) random sampling [can weigh the sampling by rank]
- Feed this compressed representation along as a header to the prompt.
I have no idea if ChatGPT is doing any of these things, but it's one possible approach that I am assessing as I continue using the tool and re-evaluating my mental model of what it can do.
We have had interesting findings on context retention in a recent trial (20,310 lines; 1,186,413 positions in Notepad++). How does that size compare to typical exploratory single conversation lengths?
You could get around the input token limitation with a bit of engineering?
- Store conversation so far as a compressed knowledge graph (setup for semantic search)
- When you receive a prompt, generate search terms for the knowledge graph and identify relevant parts
- If too many relevant parts of the conversation are identified, can constrain it using (1) ranking by relevance, (2) compression of text, (3) random sampling [can weigh the sampling by rank]
- Feed this compressed representation along as a header to the prompt.
I have no idea if ChatGPT is doing any of these things, but it's one possible approach that I am assessing as I continue using the tool and re-evaluating my mental model of what it can do.
I think knowledge graph + summarization is part of it, probably along with other data structures. I also think it does more than a single pass when the situation is complicated, breaking things down and processing them further, before it starts up on the current generation (possibly using different / faster models for specific sub-tasks).