A template for a RAG system with Gradio UI

Deliberately stripped down to leave some room for experimenting

Setting it up

Clone https://github.com/huggingface/transformers to a local machine
Use the prep_scrips/markdown_to_text.py script to extract raw text from markdown from transformers/docs/source/en/
Break the resulting texts down into semantically meaningful pieces. Experiment with different chunking mechanisms to make sure the semantic meaning is captured.
Use prep_scrips/lancedb_setup.py to embed and store chunks in a lancedb instance. It also creates an index for fast ANN retrieval (not really needed for this exercise but necessary at scale). You'll need to put your own values into VECTOR_COLUMN_NAME, TEXT_COLUMN_NAME, DB_TABLE_NAME.
Move the database directory (.lancedb by default) to gradio_app/
Use the template given in gradio_app to wrap everything into the Gradio app and run it on HF spaces. Make sure to adjust VECTOR_COLUMN_NAME, TEXT_COLUMN_NAME, DB_TABLE_NAME according to your DB setup.
In your space, set up secrets OPENAI_API_KEY and HUGGING_FACE_HUB_TOKEN to use OpenAI and open-source models correspondingly
TODOs:
- Experiment with chunking, see how it affects the results. When deciding how to chunk it helps to think about what kind of chunks you'd like to see as context to your queries.
  - Deliverables: Demonstrate how retrieved documents differ with different chunking strategies and how it affects the output.
- Try out different embedding models (EMBED_NAME). Good models to start with are sentence-transformers/all-MiniLM-L6-v2 - lightweight, thenlper/gte-large - relatively heavy but more powerful.
  - Deliverables: Demonstrate how retrieved documents differ with different embedding models and how they affect the output. Provide an estimate of how the time to embed the chunks and DB ingestion time differs (happening in prep_scrips/lancedb_setup.py).
- Add a re-ranker (cross-encoder) to the pipeline. Start with sentence-transformers pages on cross-encoders 1 2, then pick a pretrained cross-encoder, e.g. cross-encoder/ms-marco-MiniLM-L-12-v2. Don't forget to increase the number of retrieved documents when using re-ranker. The number of documents used as context should stay the same.
  - Deliverables: Demonstrate how retrieved documents differ after adding a re-ranker and how it affects the output. Provide an estimate of how latency changes.
- Try another LLM (e.g. LLaMA-2-70b, falcon-180b).
  - Deliverables: Demonstrate how LLMs affect the output and how latency changes with the model size.
- Add more documents (e.g. diffusers, tokenizers, optimum, etc) to see how the system scales.
  - Deliverables: Demonstrate how latency changes, and how it differs with and without index (index is added in prep_scrips/lancedb_setup.py).
- (Bonus) Use an LLM to quantitatively compare outputs of different variants of the system (LLM as a Judge)
  - Deliverables: Describe the experimental setup and evaluation results