I like the approach you've taken

#1
by tachyphylaxis - opened

I'm not any kind of expert in any relevant field, but many months ago I became convinced that all of these models people were making were becoming inbred. I can't prove it, of course, and I don't know if merging even has that effect, really. I could be totally wrong and simply am spending too much time talking to these things. As I see it, though, as time goes on, this is what the "genetics" of a lot of the models looks like lol:

image.png

I started using KoboldAI before llama was released, and although their models based on like gpt-neox-20b were "dumber" and only could handle 2048 characters of context, they nevertheless didn't seem to have these stubborn preferences and "speech patterns".

I suggested to people that they do some training on novel datasets, but no one seemed to care, and I couldn't pay for it so ... lol. If I understand you properly, you're trying to address the same thing, albeit with merges.

If you ever are considering doing any training and are interested, I have a parquet I made from the raw data from The Mind is a Metaphor. This includes:

Provenance (last updated July, 2013)
More than 5,980 of the metaphors were found keyword searching Chadwyck-Healey through the Stanford Humanities Digital Information Service SULAIR search interface. The search interface, named HUGO, has now been retired.
Over 900 more metaphors were discovered searching Proquest's Literature Online collections (LION), which expanded and have now replaced the original Chadwyck-Healey collections
783 metaphors are from my Orals reading or date from my first six months of collection
Over 3,000 I've encountered while reading since then
More than 450 metaphors were discovered searching in Google Books
338 were found browsing in Eighteenth-Century Collections Online (ECCO)
218 were found keyword-searching texts in the Liberty Fund's Online Library of Liberty (OLL)
188 were found keyword searching the Intelex Past Masters database
180 are from Roger Lonsdale's Eighteenth-Century Women Poets. Oxford: OUP, 1989.
150 are from the King James Bible (UVA edition)
110 were found browsing in Early English Books Online (EEBO)
Over 100 were found searching Project Gutenberg texts
67 were taken from Johnson's Dictionary
27 are from the Oxford English Dictionary (OED)
21 are from Ad Fontes Digital Library of Classic Protestant Texts
Some Rubrics (last updated April, 2015)
721 Animal metaphors (counted as entries)
986 Architecture metaphors
1,365 Body metaphors
440 Fetters metaphors*
509 Plant metaphors
1,827 Government metaphors*
882 Impression metaphors
738 Light metaphors
689 Liquid metaphors
273 Machine metaphors
1,015 Mineral metaphors*
444 Optics metaphors
1,055 Population metaphors
171 Vehicle metaphors
268 Visual Arts metaphors
667 War metaphors*
524 Weather metaphors
817 Writing metaphors*

I have a short list of linguistic corpuses I found a while ago ... which I put ... somewhere lol. I could probably find a lot of them again. There was one which was a collection of erotic chat logs from online services or something. There are linguists out there who do corpus linguistics as a full-time job, and so there's an absolutely massive amount of data out there. Most of the journal articles I found required you to contact the authors for it, but in my experience academics are usually happy to fulfill such requests so long as it's for non-commercial purposes.

If you ever need some data, lemme know, heh.

Sign up or log in to comment