ChatData 🔍 📖

We are constantly improving LangChain's self-query retriever. Some of the features are not merged.

Yet another chat-with-documents app, but supporting query over millions of files with MyScale and LangChain.

News 🔥

🔧 Our contribution to LangChain that helps self-query retrievers filter with more types and functions
🌟 We just opened a FREE pod hosting data for ArXiv paper. Anyone can try their own SQL with vector search!!! Feel the power when SQL meets vector search! See how to access the pod here.
📚 We collected 1.67 million papers on arxiv! We are collecting more and we need your advice!
More coming...

Quickstart

Create an virtual environment

python3 -m venv .venv
source .venv/bin/activate

Install dependencies

This app is currently using MyScale's fork of LangChain. It contains improved prompts for comparators LIKE and CONTAIN in MyScale self-query retriever.

python3 -m pip install -r requirements.txt

Run the app!

# fill you OpenAI key in .streamlit/secrets.toml
cp .streamlit/secrets.example.toml .streamlit/secrets.toml
# start the app
python3 -m streamlit run app.py

Quick Navigator 🧭

How can I run this app?
How this app is built?
What is the overview pipeline?
How did LangChain and MyScale convert natural language to structured filters?
How to make chain execution more responsive in LangChain?

Where can I get those arxiv data?

From parquet files on S3

Or directly use MyScale database as service... for FREE ✨

import clickhouse_connect

client = clickhouse_connect.get_client(
    host='msc-1decbcc9.us-east-1.aws.staging.myscale.cloud',
    port=443,
    username='chatdata',
    password='myscale_rocks'
)

Or put these settings in .streamlit/secrets.toml

MYSCALE_HOST = "msc-1decbcc9.us-east-1.aws.staging.myscale.cloud"
MYSCALE_PORT = 443
MYSCALE_USER = "chatdata"
MYSCALE_PASSWORD = "myscale_rocks"

Introduction

ChatData brings millions of papers into your knowledge base. We imported 1.67 million papers with metadata info (continuously updating), which contains:

metadata.authors: paper's authors in list of strings
metadata.abstract: paper's abstracts used as ranking criterion (with InstructXL)
metadata.titles: papers's titles
metadata.categories: paper's categories in list of strings like ["cs.CV"]
metadata.pubdate: paper's date of publication in ISO 8601 formated strings
metadata.primary_category: paper's primary category in strings defined by ArXiv
metadata.comment: some additional comment to the paper

And for overall table schema, please refer to table creation section in docs/self-query.md.

How to run 🏃

python3 -m pip install requirements.txt
python3 -m streamlit run app.py

How to build? 🧱

See docs/self-query.md

Special Thanks 👏 (Ordered Alphabetically)

ArXiv API for its open access interoperability to pre-printed papers.
InstructorXL for its promptable embeddings that improves retrieve performance.
LangChain🦜️🔗 for its easy-to-use and composable API designs and prompts.
The Alexandria Index for providing arXiv data index to the public.