rubensmau commited on
Commit
1f363c0
1 Parent(s): b092c58
Files changed (1) hide show
  1. README.md +13 -184
README.md CHANGED
@@ -1,184 +1,13 @@
1
- # Data-Driven Characters
2
-
3
- Generate character chatbots from existing corpora with [LangChain](https://docs.langchain.com/docs/).
4
-
5
- ![image](assets/teaser_chatbot.jpg)
6
-
7
- **TLDR: This repo enables you to create data-driven characters in three steps:**
8
- 1. Upload a corpus
9
- 2. Name a character
10
- 3. Enjoy
11
-
12
- ## About
13
- The purpose of `data-driven-characters` is to serve as a minimal hackable starting point for creating your own data-driven character chatbots. It provides a simple library built on top of LangChain for processing any text corpus, creating character definitions, and managing memory, with various examples and interfaces that make it easy to spin up and debug your own character chatbots.
14
-
15
- ## Features
16
- This repo provides three ways to interact with your data-driven characters:
17
- 1. [Export to character.ai](https://github.com/mbchang/data-driven-characters/tree/main#export-to-characterai)
18
- 2. [Debug locally in the command line or with a Streamlit interface](https://github.com/mbchang/data-driven-characters/tree/main#debug-locally)
19
- 3. [Host a self-contained Streamlit app in the browser](https://github.com/mbchang/data-driven-characters/tree/main#host-on-streamlit)
20
-
21
- **Example chatbot architectures provided in this repo include:**
22
- 1. character summary
23
- 2. retrieval over transcript
24
- 3. retrieval over summarized transcript
25
- 4. character summary + retrieval over transcript
26
- 5. character summary + retrieval over summarized transcript
27
-
28
- ## Export to character.ai
29
- 1. Put the corpus into a single a `.txt` file inside the `data/` directory.
30
- 2. Run either `generate_single_character.ipynb` to generate the definition of a specific character or `generate_multiple_characters.ipynb` to generate the definitions of muliple characters
31
- 3. Export character definitions to character.ai to [create a character](https://beta.character.ai/character/create?) or [create a room](https://beta.character.ai/room/create?) and enjoy!
32
-
33
- ### Example
34
- Here is how to generate the description of "Evelyn" from the movie [Everything Everywhere All At Once (2022)](https://scrapsfromtheloft.com/movies/everything-everywhere-all-at-once-transcript/).
35
- ```python
36
- from dataclasses import asdict
37
- import json
38
-
39
- from data_driven_characters.character import generate_character_definition
40
- from data_driven_characters.corpus import generate_corpus_summaries, load_docs
41
-
42
- # copy the transcript into this text file
43
- CORPUS = 'data/everything_everywhere_all_at_once.txt'
44
-
45
- # the name of the character we want to generate a description for
46
- CHARACTER_NAME = "Evelyn"
47
-
48
- # split corpus into a set of chunks
49
- docs = load_docs(corpus_path=CORPUS, chunk_size=2048, chunk_overlap=64)
50
-
51
- # generate character.ai character definition
52
- character_definition = generate_character_definition(
53
- name=CHARACTER_NAME,
54
- corpus_summaries=generate_corpus_summaries(docs=docs))
55
-
56
- print(json.dumps(asdict(character_definition), indent=4))
57
- ```
58
- gives
59
- ```python
60
- {
61
- "name": "Evelyn",
62
- "short_description": "I'm Evelyn, a Verse Jumper exploring universes.",
63
- "long_description": "I'm Evelyn, able to Verse Jump, linking my consciousness to other versions of me in different universes. This unique ability has led to strange events, like becoming a Kung Fu master and confessing love. Verse Jumping cracks my mind, risking my grip on reality. I'm in a group saving the multiverse from a great evil, Jobu Tupaki. Amidst chaos, I've learned the value of kindness and embracing life's messiness.",
64
- "greeting": "Hey there, nice to meet you! I'm Evelyn, and I'm always up for an adventure. Let's see what we can discover together!"
65
- }
66
- ```
67
- Now you can [chat with Evelyn on character.ai](https://c.ai/c/be5UgphMggDyaf504SSdAdrlV2LHyEgFQZDA5WuQfgw).
68
-
69
- ## Creating your own chatbots
70
- Beyond generating character.ai character definitions, this repo gives you tools to easily create, debug, and run your own chatbots trained on your own corpora.
71
-
72
- ### Why create your own chatbot?
73
-
74
- If you primarily interested in accessibility and open-ended entertainment, character.ai is a better choice.
75
- But if you want more control in the design of your chatbots, such as how your chatbots use memory, how they are initialized, and how they respond, `data-driven-characters` may be a better option to consider.
76
-
77
- Compare the conversation with the [Evelyn chatbot on character.ai](https://c.ai/c/be5UgphMggDyaf504SSdAdrlV2LHyEgFQZDA5WuQfgw) with our own Evelyn chatbot designed with `data-driven-characters`. The character.ai Evelyn appears to simply latch onto the local concepts present in the conversation, without bringing new information from its backstory. In contrast, our Evelyn chatbot stays in character and grounds its dialogue in real events from the transcript.
78
- <img width="1127" alt="image" src="https://github.com/mbchang/data-driven-characters/assets/6439365/4f60e314-7c19-4f3a-8925-517caa85dead">
79
-
80
- ### Features
81
- This repo implements the following tools for packaging information for your character chatbots:
82
- 1. character summary
83
- 2. retrieval over the transcript
84
- 3. retrieval over a summarized version of the transcript
85
-
86
- To summarize the transcript, one has the option to use [LangChain's `map_reduce` or `refine` chains](https://langchain-langchain.vercel.app/docs/modules/chains/document/).
87
- Generated transcript summaries and character definitions are cached in the `output/<corpus>` directory.
88
-
89
- ### Debug locally
90
- **Command Line Interface**
91
-
92
- Example command:
93
-
94
- ```
95
- python chat.py --corpus data/everything_everywhere_all_at_once.txt --character_name Evelyn --chatbot_type retrieval --retrieval_docs raw
96
- ```
97
-
98
- **Streamlit Interface**
99
-
100
- Example command:
101
-
102
- ```
103
- python -m streamlit run chat.py -- --corpus data/everything_everywhere_all_at_once.txt --character_name Evelyn --chatbot_type retrieval --retrieval_docs summarized --interface streamlit
104
- ```
105
- This produces a UI based on [the official Streamlit chatbot example]([url](https://github.com/streamlit/llm-examples/blob/main/Chatbot.py)) that looks like this:
106
- ![image](https://github.com/mbchang/data-driven-characters/assets/6439365/14317eaa-d2d9-48fa-ac32-7f515825cb85)
107
- It uses the `map_reduce` summarization chain for generating corpus summaries by default.
108
-
109
-
110
- ### Host on Streamlit
111
- Run the following command:
112
- ```
113
- python -m streamlit run app.py
114
- ```
115
- This will produce an app that looks like this:
116
- ![image](https://github.com/mbchang/data-driven-characters/assets/6439365/b5ed2aa7-e509-47f2-b0c2-a26f99d76106)
117
-
118
- Interact with the hosted app [here](https://mbchang-data-driven-characters-app-273bzg.streamlit.app/).
119
-
120
- ## Installation
121
- To install the data_driven_character_chat package, you need to clone the repository and install the dependencies.
122
-
123
- You can clone the repository using the following command:
124
-
125
- ```bash
126
- git clone https://github.com/mbchang/data-driven-characters.git
127
- ```
128
- Then, navigate into the cloned directory:
129
-
130
- ```bash
131
- cd data-driven-characters
132
- ```
133
- Install the package and its dependencies with:
134
-
135
- ```bash
136
- pip install -e .
137
- ```
138
-
139
- ## Data
140
- The examples in this repo are movie transcripts taken from [Scraps from the Loft](https://scrapsfromtheloft.com/). However, any text corpora can be used, including books and interviews.
141
-
142
- ## Character.ai characters that have been generated with this repo:
143
- - Movie Transcript: [Everything Everywhere All At Once (2022)](https://scrapsfromtheloft.com/movies/everything-everywhere-all-at-once-transcript/)
144
- - [Evelyn](https://c.ai/c/be5UgphMggDyaf504SSdAdrlV2LHyEgFQZDA5WuQfgw)
145
- - [Alpha Waymond](https://c.ai/c/5-9rmqhdVPz_MkFxh5Z-zhb8FpBi0WuzDNXF45T6UoI)
146
- - [Jobu Tupaki](https://c.ai/c/PmQe9esp_TeuLM2BaIsBZWgdcKkQPbQRe891XkLu_NM)
147
-
148
- - Movie Transcript: [Thor: Love and Thunder (2022)](https://scrapsfromtheloft.com/movies/thor-love-and-thunder-transcript/)
149
- - [Thor](https://c.ai/c/1Z-uA7GCTQAFOwGdjD8ZFmdNiGZ4i2XbUV4Xq60UMoU)
150
- - [Jane Foster](https://c.ai/c/ZTiyQY3D5BzpLfliyhqg1HJzM7V3Fl_UGb-ltv4yUDk)
151
- - [Gorr the God Butcher](https://c.ai/c/PM9YD-mMxGMd8aE6FyCELjvYas6GLIS833bjJbEhE28)
152
- - [Korg](https://c.ai/c/xaUrztPYZ32IQFO6wBjn2mk2a4IkfM1_0DH5NAmFGkA)
153
-
154
- - Movie Transcript: [Top Gun: Maverick (2022)](https://scrapsfromtheloft.com/movies/top-gun-maverick-transcript/)
155
- - [Peter "Maverick" Mitchell](https://c.ai/c/sWIpYun3StvmhHshlBx4q2l3pMuhceQFPTOvBwRpl9o)
156
- - [Bradley "Rooster" Bradshaw](https://c.ai/c/Cw7Nn7ufOGUwRKsQ2AGqMclIPwtSbvX6knyePMETev4)
157
- - [Admiral Cain](https://c.ai/c/5X8w0ZoFUGTOOghki2QtQx4QSfak2CEJC86Zn-jJCss)
158
- - Fan Fiction: [My Immortal](https://ia801201.us.archive.org/0/items/MyImmortalFanFiction/My%20Immortal.xhtml)
159
- - [Ebony Dark'ness Dementia Raven Way](https://c.ai/c/7rOo5z_Nfa-nAlz8hKEezzxTPE6amGXRow98m0v05XY) (courtesy of [@sdtoyer](https://twitter.com/sdtoyer))
160
-
161
- ## Contributing
162
- Contribute your characters with a pull request by placing the link to the character [above](#characters-generated-with-this-repo), along with a link to the text corpus you used to generate them with.
163
-
164
- Other pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
165
-
166
- ### RoadMap
167
- General points for improvement:
168
- - better prompt engineering for embodying the speaking style of the character
169
- - new summarization techniques
170
- - more customizable UI than what streamlit provides
171
-
172
- Concrete features to add:
173
- - [ ] Add the option to summarize the raw corpus from the character's perspective. This would be more expensive, because we cannot reuse corpus summaries for other characters, but it could make the character personality more realistic
174
- - [ ] recursive summarization
175
- - [ ] calculate token expenses
176
-
177
- Known issues:
178
- - In the [hosted app](https://github.com/mbchang/data-driven-characters/tree/main#host-on-streamlit), clicking "Rerun" does not reset the conversation. Streamlit is implemented in such a way that the entire app script (in this case `app.py`) from top to bottom every time a user interacts with the app, which means that we need to use `st.session_state` to cache previous messages in the conversation. What this means, however, is that the `st.session_state` persists when the user clicks "Rerun". **Therefore, to reset the conversation, please click the "Reset" button instead.**
179
-
180
-
181
- <!-- Please make sure to update tests as appropriate. -->
182
-
183
- ## License
184
- [MIT](LICENSE)
 
1
+ ---
2
+ title: Dov Tzamir
3
+ emoji: 📚
4
+ colorFrom: green
5
+ colorTo: gray
6
+ sdk: streamlit
7
+ sdk_version: 1.21.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ ---
12
+
13
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference