grimulkan
/

aurelian-alpha0.1-70b-rope8-32K-6bpw_h8_exl2

Text Generation

Transformers

llama

Inference Endpoints

Model card Files Files and versions Community

grimulkan commited on Jan 11, 2024

Commit

2657edd

verified ·

1 Parent(s): 999f910

Update README.md

Browse files

Files changed (1) hide show

README.md +2 -103

README.md CHANGED Viewed

@@ -2,107 +2,6 @@
 license: unknown
 ---
-**This is an alpha release for testing & feedback, there are known issues.** (see known issues below). I am already training the next version, but due to the long training times, I'd appreciate any feedback in the interim period.
-### Details
-* Base model: [LongLORA 70B](https://huggingface.co/Yukang/Llama-2-70b-longlora-32k) (no base instruction tuning)
-* Fine-tuned with Llama-2 chat format
-* System prompt: `An interaction between a user providing instructions, and an imaginative assistant providing responses.`
-* 32K context length, use **Linear Rope Scaling = 8** (IMPORTANT: use a factor of 8 even if you are not using the full 32K context length)
-* Not sure how well the model performs in notebook mode/completions. At least for the initial prompts of a conversation, the chat format & system message seem to matter.
-* **This model is not censored, and is capable of producing offensive and NSFW content. Please use this model with caution, and do not use if you are offended by such content.**
-## Available Quantizations
-* [bfloat16](https://huggingface.co/grimulkan/aurelian-alpha0.1-70b-rope8-32K-fp16)
-* [EXL2 2.4bit](https://huggingface.co/grimulkan/aurelian-alpha0.1-70b-rope8-32K-2.4bpw_h6_exl2) (experimental new quant) fits in 1x24GB using Exllamav2 & 8-bit cache @ 10K context
-* [EXL2 4bit](https://huggingface.co/grimulkan/aurelian-alpha0.1-70b-rope8-32K-4.65bpw_h6_exl2) fits in 2x24GB (19/24) using Exllamav2 @ 16K context
-* [EXL2 6bit](https://huggingface.co/grimulkan/aurelian-alpha0.1-70b-rope8-32K-6bpw_h8_exl2) fits in 48GB+24GB (36/24 split) or 3x24GB (16/17/20 split) using Exllamav2 @ 32k context
-* [GGUFs here](https://huggingface.co/Noeda/aurelian-alpha0.1-70b-rope8-32K-GGUF) thanks to [Noeda](https://huggingface.co/Noeda)!
-### Main functions
-* **Story Co-writing**: Co-write a story over multiple guided prompts over a 32K context, staying consistent with prior story details, capable of writing both long and short scenes. Start by explaining that you are writing a story scene-by-scene, provide some background, themes/tags, and describe what you want in the first scene. After that, continue directing the story direction one piece at a time. You can give the model more creative control by asking it to add imaginative details, or have it precisely follow your scene outline.
-* **Brainstorming/Speculation/Analysis**: Pause writing in the midst of co-writing a story, and analyze the story so far, bounce ideas about future directions, etc.
-* **Oneshot Story-writing**: Write a complete story in one go, based on an outline, themes/tags, etc. Make sure you explain that this is not a scene-by-scene writing, and is meant to be written in a single go. You can specify a word-count to shoot for (though the model may not respect it). [Example](https://files.catbox.moe/scu11c.txt)
-* **Document Search/Analysis**: Reading comprehension & finding information from a long document, or sets of documents (up to 32K tokens)
-### Secondary Functions (limited training so far, model may get confused between task types)
-* **Roleplaying (RP)**: Explain what RP is, setup a scenario and characters, and start the RP. You can specify any rules like OOC or use of emojis, etc.
-* **Interactive Fiction (IF) Emulation**: Adventure game/interactive fiction emulation like Zork, Anchorhead, etc. Explain what it is, and how the AI should respond, specify the kind of game, tags, and so on. You can interact with usual commands like 'north', 'examine book', etc.
-* **Choose Your Own Adventure (CYOA) Emulation**: Explain what you're looking for and how you want the AI to respond (egs., with a numbered list of prompts at the end of each turn), and you can pick which option you want the story/game to go. Most such human-written games tend to have 1-2 prompts, so I had a hard time getting the AI to give more options. Finetuning is helping, but the model is now only half-baked.
-* **Document Summary/Editing**: Brief or comprehensive summaries of a long document, or sets of documents, in various formats (prose, bulleted list, table). Can also do some limited re-writing, conversions between formats and grammar checking.
-* **General Chatting**: Explain that it is a general chat, or provide some preamble to your interaction before starting. Otherwise the model might not know if you want to RP, story-write or something else.
-* **General Logic/Reasoning**: Same guidelines as above.
-## Prompting Guidelines
-* Treat the first prompt like you normally would the system prompt
-  * System prompt itself does not change
-  * Describe what you want the AI to do in detail in the first prompt, even if you feel it is obvious. This is how the AI can tell what sort of task it is supposed to perform (story-writing, RP, adventure game emulation, summarization, and so on).
-  * After that, specify anything else you want in the first prompt (your instructions for the next response, for instance).
-* Bias the length of the output with your prompt. This is no guarantee, so you may need to regenerate if you don't get your preferred length. The model will easily produce 2000+ tokens (egs., for a story scene), so make sure your response limit can handle that.
-  * Egs., Statements like `Make this a long response` would bias the response longer
-  * Statements like `Respond briefly` would bias it shorter
-* Explain clearly if you want the content to be SFW or NSFW in the first prompt as well. However, **there are no guarantees that the model won't generate NSFW content** if you force it to, in a later prompt, even if you specify the content should be SFW at the start. It's just a statistical bias (that should get better with more training).
-* Give the model details to go on. The more you flesh out what you want, the better and more consistently it will write. Tiny prompt = Tiny story, and more ChatGPTisms.
-## Known Issues
-* **Blank Outputs**: When you have many short prompts, sometimes the model just produces the EOS token. Especially with RP and adventure game emulation. I believe that this is due to [this issue](https://huggingface.co/jondurbin/airoboros-l2-70b-3.1.2/discussions/2). Fixing in next iteration, but meanwhile, workarounds:
-   * If you're a few prompts into your conversation, change the prompt format away from Llama-chat to break the model out of it. The model seems to adapt just fine to the new format mid-conversation.
-   * Use `Start reply with` in Oobabooga to force the first token be something other than `</s>`
-   * Ban the EOS token (though you need to stop the generation manually in that case)
-   * Strip the space after the final `[/INST]`, though I don't know of an easy way to do that without writing code in Oobabooga
-   * Ban the EOS token only for the first generated token, though not sure how you'd do that without some code (this feature seems like a good idea to always have enabled actually)
-   * Wait for the next iteration where I think I have it fixed! Airoboros went through the same issue when they switched to Llama-chat.
-* **Lack of Diversity for NSFW Content**: Some common phrases & descriptions are over-used. I believe I know why this is, and believe it can be fixed with more training/diverse content (which is currently being done).
-* **ChatGPTisms**: Not refusals, but canned responses, happy endings, that sort of thing. Thankfully, this does not happen often, but it shouldn't happen _at all_, as this was _not_ in the training data. But it shows up anyhow, possibly because base Llama-2 has it baked in.
-   * The first prompt is the worst, because Llama is probably remembering the ChatGPT conversations it saw in pre-training. Gets less likely at longer contexts which look less like ChatGPT samples. Any reference to a 'chat' or a 'helpful assistant' seems to trigger these 'memories'.
-   * I will eventually fight this with DPO (using prompt-biased GPT-4 generated responses as the *rejected* option).
-   * For now, regenerate, or prompt engineer. The model usually *can* regenerate diverse responses based on `temperature`, `top_p` and `top_K` (many models don't have a diverse distribution outside the top greedy tokens, but you *don't* want that in a creative model like this one).
-   * If your client allows it, *edit* the model output to make it how you want. This will serve as an example in the history to align the model away from its pre-training in future rounds. Editing the model response is more effective than negative instructions (egs., don't do X).
-* **Repetition**: The usual thing. Not sure why it happens. I avoid it by setting `repetition penalty = 1.16` (or higher) and `repetition range = 4096`.
-### Training Data
-For 70% of the training data, the outputs were written by humans, though some of the _inputs_ may have been originally seeded by GPT-4 (and expanded using other LORAs).
-For 30% of the training data, the outpus were written by GPT3.5/4, but this was mostly logical reasoning, summarization and other non-creative content (with no issue of refusals or alignment). Some GPT4-generated creative content was present in the RP proxy logs however.
-Partial List of Training Data (there are other tiny datasets I am yet to sort out, but this is the bulk):
-* Main dataset: Human-written stories from forums, The Pile and other sources
-  * Raw text was chunked and converted into Q&A (analysis, reading comprehension) and long-form, multi-round, scene-by-scene writing interactions. This was accomplished using other LORAs trained for this purpose.
-  * Each interaction was truncated at 32K (usually the first third of a normal-sized novel).
-  * In a small fraction of cases, the story-writing continued beyond 32K, but with the dropped initial portions either summarized, or re-constituted using RAG, to fit in 32K and still provide context.
-  * Will publish the relevant LORAs and data sources as best I can, once I sort everything out. I am not sure what the current legal status of The Pile is (especially the books section).
-* Summaries of Wikipedia articles in various formats (generated by `gpt-4-1106-preview`, will publish as soon as I can).
-* Double-checked (by GPT4) Spatial Reasoning (PIAGET style, line of sight, physical deduction) and Theory of Mind (who knows what about what) problems generated by `gpt-4-1106-preview`: Only limited size so far. Llama2 is weak at spatial reasoning, and so is one-shot GPT4, but with double-checking and prompting, GPT4 gets good enough to generate this synthetic dataset).
-* Document Error Correction: GPT4 generated passages with errors/typos introduced using the python `typo` library as input, with the original (correct) passages as output.
-* Sections of [Airoboros 2.2.1/3.1](https://huggingface.co/datasets/jondurbin/airoboros-3.1) (RP, chain-of-thought, rules-based chats, theory of mind, dealigned writing, jokes/riddles).
-* Sections of [Surge Instruct](https://huggingface.co/datasets/sachith-surge/evol-instruct) (for extraction, summarization, re-writing, classification).
-* Proxy RP Logs (GPT4 outputs only): [Jannie](https://huggingface.co/datasets/v2ray/jannie-log), [Teatime](https://huggingface.co/datasets/OpenLeecher/Teatime) & AICG were all re-stitched together to create a single seamless conversion (sometimes from the original source) to undo the 2K or 4K divisions, and augmented with more context and rules about the conversation in the first prompt. Will publish the stitched-up versions when I can.
-* A fully re-generated version of [Floyd Text Adventures](https://huggingface.co/datasets/PocketDoc/Floyd-Text-Adventures) with better context and AI interaction format. Here is the link to the original until I upload the modified version.
-* A fully re-generated version of the CYS CYOA dataset (re-generated from source by 'dungeon crawling' the space automatically, maximizing visiting unique 'rooms', then converting the output logs into a chat format).
-* [NART synthetic therapy logs](https://huggingface.co/datasets/jerryjalapeno/nart-100k-synthetic) was heavily filtered and used cautiously (lots of GPTisms, but actually relevant in this context where the AI is playing a supportive role).
-* [Augmental Stenisgate RP](https://huggingface.co/datasets/Heralax/Augmental-Dataset) was modified to add more context, and make the AI only play a single character (I'll publish the modded version as soon as I can).
-* [Bluemoon RP](https://huggingface.co/datasets/Squish42/bluemoon-fandom-1-1-rp-cleaned) was fully re-generated using [Karen The Editor](https://huggingface.co/FPHam/Karen_theEditor_13b_HF) to clean it up. Until I publish the modified data on HF, you can get it from here: [Part 1](https://files.catbox.moe/vxqjqg.json) [Part 2](https://files.catbox.moe/1o44yc.json)
-* [PIPPA RP](https://huggingface.co/datasets/PygmalionAI/PIPPA) was augmented to add more context and rules (derived from the content of the conversation). Will update with a link to the modified version.
-* [LimaRP](https://huggingface.co/datasets/lemonilia/LimaRP) was slightly augmented to add more context, and the divided conversations were stitched together. No conversations were ever split up.
-* [Erotic Analysis](https://huggingface.co/datasets/openerotica/erotica-analysis) was used in reverse for one-shot NSFW story generation.
-* [Reading Comprehension](https://huggingface.co/datasets/jmartin233/reading_comprehension_exercise_dataset)
-* [Unnatural Instructions](https://huggingface.co/datasets/mrm8488/unnatural-instructions-full) for word-constrained generation.
-* [Long Instructions](https://huggingface.co/datasets/nRuaif/Long-instructions) for relevant document finding/retrieval up to 32K.
-* [OpenORCA](https://huggingface.co/datasets/Open-Orca/OpenOrca) GPT4 outputs only.
-* [Ultrachat Uncensored](https://huggingface.co/datasets/ehartford/ultrachat-uncensored) with capitalization errors fixed & further scrubbed for GPTisms (not just refusals, sentiment as well).
-* [ShareGPT Hyper Filtered](https://huggingface.co/datasets/totally-not-an-llm/sharegpt-hyperfiltered-3k) further scrubbed for GPTisms (not just refusals, sentiment as well).
-* [Claude Multiround](https://huggingface.co/datasets/Norquinal/claude_multiround_chat_30k) also further scrubbed, but being a different model than GPT4 I may not have caught all the gushing positivity.
-* [Wizard Vicuna Unfiltered](https://huggingface.co/datasets/ehartford/wizard_vicuna_70k_unfiltered) further scrubbed like the others.
-* [TinyStories GPT4](https://huggingface.co/datasets/skeskinen/TinyStories-GPT4) I may not include this in the future.
-* [SODA Synthetic Dialogue](https://huggingface.co/datasets/emozilla/soda_synthetic_dialogue) used with caution (mostly for title suggestions).
-### License
-Unsure. It uses some datasets which were generated using GPT-4 outputs, so openAI's terms may apply. I personally have no objection about this model being used for any commercial or non-commercial purpose, but please respect the license agreements of Meta, OpenAI or other parties involved.

 license: unknown
 ---
+This is a 6-bit EXL2 quantization of [Aurelian v0.1alpha 70B 32K](https://huggingface.co/grimulkan/aurelian-alpha0.1-70b-rope8-32K-fp16) for testing & feedback. See that page for more details.
+This quantization fits in 48GB+24GB (36/24 split) or 3x24GB (16/17/20 split) using Exllamav2 @ 32k context.