Spaces:
Sleeping
Sleeping
update
Browse files
README.md
CHANGED
@@ -1,184 +1,13 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
## Features
|
16 |
-
This repo provides three ways to interact with your data-driven characters:
|
17 |
-
1. [Export to character.ai](https://github.com/mbchang/data-driven-characters/tree/main#export-to-characterai)
|
18 |
-
2. [Debug locally in the command line or with a Streamlit interface](https://github.com/mbchang/data-driven-characters/tree/main#debug-locally)
|
19 |
-
3. [Host a self-contained Streamlit app in the browser](https://github.com/mbchang/data-driven-characters/tree/main#host-on-streamlit)
|
20 |
-
|
21 |
-
**Example chatbot architectures provided in this repo include:**
|
22 |
-
1. character summary
|
23 |
-
2. retrieval over transcript
|
24 |
-
3. retrieval over summarized transcript
|
25 |
-
4. character summary + retrieval over transcript
|
26 |
-
5. character summary + retrieval over summarized transcript
|
27 |
-
|
28 |
-
## Export to character.ai
|
29 |
-
1. Put the corpus into a single a `.txt` file inside the `data/` directory.
|
30 |
-
2. Run either `generate_single_character.ipynb` to generate the definition of a specific character or `generate_multiple_characters.ipynb` to generate the definitions of muliple characters
|
31 |
-
3. Export character definitions to character.ai to [create a character](https://beta.character.ai/character/create?) or [create a room](https://beta.character.ai/room/create?) and enjoy!
|
32 |
-
|
33 |
-
### Example
|
34 |
-
Here is how to generate the description of "Evelyn" from the movie [Everything Everywhere All At Once (2022)](https://scrapsfromtheloft.com/movies/everything-everywhere-all-at-once-transcript/).
|
35 |
-
```python
|
36 |
-
from dataclasses import asdict
|
37 |
-
import json
|
38 |
-
|
39 |
-
from data_driven_characters.character import generate_character_definition
|
40 |
-
from data_driven_characters.corpus import generate_corpus_summaries, load_docs
|
41 |
-
|
42 |
-
# copy the transcript into this text file
|
43 |
-
CORPUS = 'data/everything_everywhere_all_at_once.txt'
|
44 |
-
|
45 |
-
# the name of the character we want to generate a description for
|
46 |
-
CHARACTER_NAME = "Evelyn"
|
47 |
-
|
48 |
-
# split corpus into a set of chunks
|
49 |
-
docs = load_docs(corpus_path=CORPUS, chunk_size=2048, chunk_overlap=64)
|
50 |
-
|
51 |
-
# generate character.ai character definition
|
52 |
-
character_definition = generate_character_definition(
|
53 |
-
name=CHARACTER_NAME,
|
54 |
-
corpus_summaries=generate_corpus_summaries(docs=docs))
|
55 |
-
|
56 |
-
print(json.dumps(asdict(character_definition), indent=4))
|
57 |
-
```
|
58 |
-
gives
|
59 |
-
```python
|
60 |
-
{
|
61 |
-
"name": "Evelyn",
|
62 |
-
"short_description": "I'm Evelyn, a Verse Jumper exploring universes.",
|
63 |
-
"long_description": "I'm Evelyn, able to Verse Jump, linking my consciousness to other versions of me in different universes. This unique ability has led to strange events, like becoming a Kung Fu master and confessing love. Verse Jumping cracks my mind, risking my grip on reality. I'm in a group saving the multiverse from a great evil, Jobu Tupaki. Amidst chaos, I've learned the value of kindness and embracing life's messiness.",
|
64 |
-
"greeting": "Hey there, nice to meet you! I'm Evelyn, and I'm always up for an adventure. Let's see what we can discover together!"
|
65 |
-
}
|
66 |
-
```
|
67 |
-
Now you can [chat with Evelyn on character.ai](https://c.ai/c/be5UgphMggDyaf504SSdAdrlV2LHyEgFQZDA5WuQfgw).
|
68 |
-
|
69 |
-
## Creating your own chatbots
|
70 |
-
Beyond generating character.ai character definitions, this repo gives you tools to easily create, debug, and run your own chatbots trained on your own corpora.
|
71 |
-
|
72 |
-
### Why create your own chatbot?
|
73 |
-
|
74 |
-
If you primarily interested in accessibility and open-ended entertainment, character.ai is a better choice.
|
75 |
-
But if you want more control in the design of your chatbots, such as how your chatbots use memory, how they are initialized, and how they respond, `data-driven-characters` may be a better option to consider.
|
76 |
-
|
77 |
-
Compare the conversation with the [Evelyn chatbot on character.ai](https://c.ai/c/be5UgphMggDyaf504SSdAdrlV2LHyEgFQZDA5WuQfgw) with our own Evelyn chatbot designed with `data-driven-characters`. The character.ai Evelyn appears to simply latch onto the local concepts present in the conversation, without bringing new information from its backstory. In contrast, our Evelyn chatbot stays in character and grounds its dialogue in real events from the transcript.
|
78 |
-
<img width="1127" alt="image" src="https://github.com/mbchang/data-driven-characters/assets/6439365/4f60e314-7c19-4f3a-8925-517caa85dead">
|
79 |
-
|
80 |
-
### Features
|
81 |
-
This repo implements the following tools for packaging information for your character chatbots:
|
82 |
-
1. character summary
|
83 |
-
2. retrieval over the transcript
|
84 |
-
3. retrieval over a summarized version of the transcript
|
85 |
-
|
86 |
-
To summarize the transcript, one has the option to use [LangChain's `map_reduce` or `refine` chains](https://langchain-langchain.vercel.app/docs/modules/chains/document/).
|
87 |
-
Generated transcript summaries and character definitions are cached in the `output/<corpus>` directory.
|
88 |
-
|
89 |
-
### Debug locally
|
90 |
-
**Command Line Interface**
|
91 |
-
|
92 |
-
Example command:
|
93 |
-
|
94 |
-
```
|
95 |
-
python chat.py --corpus data/everything_everywhere_all_at_once.txt --character_name Evelyn --chatbot_type retrieval --retrieval_docs raw
|
96 |
-
```
|
97 |
-
|
98 |
-
**Streamlit Interface**
|
99 |
-
|
100 |
-
Example command:
|
101 |
-
|
102 |
-
```
|
103 |
-
python -m streamlit run chat.py -- --corpus data/everything_everywhere_all_at_once.txt --character_name Evelyn --chatbot_type retrieval --retrieval_docs summarized --interface streamlit
|
104 |
-
```
|
105 |
-
This produces a UI based on [the official Streamlit chatbot example]([url](https://github.com/streamlit/llm-examples/blob/main/Chatbot.py)) that looks like this:
|
106 |
-
![image](https://github.com/mbchang/data-driven-characters/assets/6439365/14317eaa-d2d9-48fa-ac32-7f515825cb85)
|
107 |
-
It uses the `map_reduce` summarization chain for generating corpus summaries by default.
|
108 |
-
|
109 |
-
|
110 |
-
### Host on Streamlit
|
111 |
-
Run the following command:
|
112 |
-
```
|
113 |
-
python -m streamlit run app.py
|
114 |
-
```
|
115 |
-
This will produce an app that looks like this:
|
116 |
-
![image](https://github.com/mbchang/data-driven-characters/assets/6439365/b5ed2aa7-e509-47f2-b0c2-a26f99d76106)
|
117 |
-
|
118 |
-
Interact with the hosted app [here](https://mbchang-data-driven-characters-app-273bzg.streamlit.app/).
|
119 |
-
|
120 |
-
## Installation
|
121 |
-
To install the data_driven_character_chat package, you need to clone the repository and install the dependencies.
|
122 |
-
|
123 |
-
You can clone the repository using the following command:
|
124 |
-
|
125 |
-
```bash
|
126 |
-
git clone https://github.com/mbchang/data-driven-characters.git
|
127 |
-
```
|
128 |
-
Then, navigate into the cloned directory:
|
129 |
-
|
130 |
-
```bash
|
131 |
-
cd data-driven-characters
|
132 |
-
```
|
133 |
-
Install the package and its dependencies with:
|
134 |
-
|
135 |
-
```bash
|
136 |
-
pip install -e .
|
137 |
-
```
|
138 |
-
|
139 |
-
## Data
|
140 |
-
The examples in this repo are movie transcripts taken from [Scraps from the Loft](https://scrapsfromtheloft.com/). However, any text corpora can be used, including books and interviews.
|
141 |
-
|
142 |
-
## Character.ai characters that have been generated with this repo:
|
143 |
-
- Movie Transcript: [Everything Everywhere All At Once (2022)](https://scrapsfromtheloft.com/movies/everything-everywhere-all-at-once-transcript/)
|
144 |
-
- [Evelyn](https://c.ai/c/be5UgphMggDyaf504SSdAdrlV2LHyEgFQZDA5WuQfgw)
|
145 |
-
- [Alpha Waymond](https://c.ai/c/5-9rmqhdVPz_MkFxh5Z-zhb8FpBi0WuzDNXF45T6UoI)
|
146 |
-
- [Jobu Tupaki](https://c.ai/c/PmQe9esp_TeuLM2BaIsBZWgdcKkQPbQRe891XkLu_NM)
|
147 |
-
|
148 |
-
- Movie Transcript: [Thor: Love and Thunder (2022)](https://scrapsfromtheloft.com/movies/thor-love-and-thunder-transcript/)
|
149 |
-
- [Thor](https://c.ai/c/1Z-uA7GCTQAFOwGdjD8ZFmdNiGZ4i2XbUV4Xq60UMoU)
|
150 |
-
- [Jane Foster](https://c.ai/c/ZTiyQY3D5BzpLfliyhqg1HJzM7V3Fl_UGb-ltv4yUDk)
|
151 |
-
- [Gorr the God Butcher](https://c.ai/c/PM9YD-mMxGMd8aE6FyCELjvYas6GLIS833bjJbEhE28)
|
152 |
-
- [Korg](https://c.ai/c/xaUrztPYZ32IQFO6wBjn2mk2a4IkfM1_0DH5NAmFGkA)
|
153 |
-
|
154 |
-
- Movie Transcript: [Top Gun: Maverick (2022)](https://scrapsfromtheloft.com/movies/top-gun-maverick-transcript/)
|
155 |
-
- [Peter "Maverick" Mitchell](https://c.ai/c/sWIpYun3StvmhHshlBx4q2l3pMuhceQFPTOvBwRpl9o)
|
156 |
-
- [Bradley "Rooster" Bradshaw](https://c.ai/c/Cw7Nn7ufOGUwRKsQ2AGqMclIPwtSbvX6knyePMETev4)
|
157 |
-
- [Admiral Cain](https://c.ai/c/5X8w0ZoFUGTOOghki2QtQx4QSfak2CEJC86Zn-jJCss)
|
158 |
-
- Fan Fiction: [My Immortal](https://ia801201.us.archive.org/0/items/MyImmortalFanFiction/My%20Immortal.xhtml)
|
159 |
-
- [Ebony Dark'ness Dementia Raven Way](https://c.ai/c/7rOo5z_Nfa-nAlz8hKEezzxTPE6amGXRow98m0v05XY) (courtesy of [@sdtoyer](https://twitter.com/sdtoyer))
|
160 |
-
|
161 |
-
## Contributing
|
162 |
-
Contribute your characters with a pull request by placing the link to the character [above](#characters-generated-with-this-repo), along with a link to the text corpus you used to generate them with.
|
163 |
-
|
164 |
-
Other pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
|
165 |
-
|
166 |
-
### RoadMap
|
167 |
-
General points for improvement:
|
168 |
-
- better prompt engineering for embodying the speaking style of the character
|
169 |
-
- new summarization techniques
|
170 |
-
- more customizable UI than what streamlit provides
|
171 |
-
|
172 |
-
Concrete features to add:
|
173 |
-
- [ ] Add the option to summarize the raw corpus from the character's perspective. This would be more expensive, because we cannot reuse corpus summaries for other characters, but it could make the character personality more realistic
|
174 |
-
- [ ] recursive summarization
|
175 |
-
- [ ] calculate token expenses
|
176 |
-
|
177 |
-
Known issues:
|
178 |
-
- In the [hosted app](https://github.com/mbchang/data-driven-characters/tree/main#host-on-streamlit), clicking "Rerun" does not reset the conversation. Streamlit is implemented in such a way that the entire app script (in this case `app.py`) from top to bottom every time a user interacts with the app, which means that we need to use `st.session_state` to cache previous messages in the conversation. What this means, however, is that the `st.session_state` persists when the user clicks "Rerun". **Therefore, to reset the conversation, please click the "Reset" button instead.**
|
179 |
-
|
180 |
-
|
181 |
-
<!-- Please make sure to update tests as appropriate. -->
|
182 |
-
|
183 |
-
## License
|
184 |
-
[MIT](LICENSE)
|
|
|
1 |
+
---
|
2 |
+
title: Dov Tzamir
|
3 |
+
emoji: 📚
|
4 |
+
colorFrom: green
|
5 |
+
colorTo: gray
|
6 |
+
sdk: streamlit
|
7 |
+
sdk_version: 1.21.0
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
license: mit
|
11 |
+
---
|
12 |
+
|
13 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|