PoliticsToYou / README_GitHub.md
TomData's picture
test commit
cdd7ddb

A newer version of the Gradio SDK is available: 5.30.0

Upgrade

Intoduction

Welcome to PoliticsToYou - your playground for investigating the heart of politics in Germany

Would you like to gain insights into political debates or reveal party positions on specific topics from any legislature? You can use the ChatBot to ask all your questions or search for related speech content in the Keyword Search section.

You try out the application on Hugging Face Spaces

You can continue reading for the technical details and an explanation of each tab in the application.

Technical documentation

Overview

  1. Execute query.py to retrieve all speeches.
  2. Execute FAISS.py to create vectorstores and store them under the FAISS folder (~24h)
  3. Excecute Home.py to run the app

Extract Speeches data

The speeches data is retrieved from the locally running opendiscourse database. Refere to their documentation to setup the database locally here

Execute query.py to extract all speeches since dDecember 1949.

Transform speeches and load into FAISS vectorstore

As a vectorstore I decided to use FAISS due its support for large-scale datasets of million of vectors and optimization for fast retrieval. FAISS (Facebook AI Similarity Search) is an open-source vectordatabase developed by Facebook AI Research. It's designed to quickly find similar items in a dataset, such as images, text, or audio in large-scale datasets.

In Retrieval-Augmented Generation (RAG) systems, documents are typically split into smaller chunks. This is essential because smaller chunks fit more easily in the LLM's context window and provide more focused information improving overall response accuracy. In this project, I implemented the well established RecursiveCharacterTextSplitter from LangChain. The splitter recursively splits text along common structural boundaries such as double newlines ("\n\n"), single newlines ("\n"), and spaces (" "). Chunks are created by merging these small segments until a threshold is met. To preserve context a certain amount of overlap is introduced between adjacent chunks. However, this also results into tiny chunks which are discarded if they contain fewer than 100 characters.

For the retrieval step, documents are converted into vector representations using a sentence-transformer model. In this project, I employed the paraphrase-multilingual-MiniLM-L12-v2 model from Hugging Face(https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) which has model demonstrated strong performance in similarity-based retrieval tasks for multiple languages.

Vectordatabases using FAISS are created for each legislature period as well as on vectorstore for all speeches. This enables users of the app to retrieve information from different periods of time.

App

The App contains two sections: A chatbot interface and a keywordsearch

Chatbot

The chatbot is based on the multilingual mistralai/Mixtral-8x7B-Instruct-v0.1 model tuned for instructions. Large models are not available due to resource restrictions. As in every RAG implementation, based on the human similar sample documents are provided to the LLM to generate an answer based on the provided speeches. The app also supports templates for German and English allowing users to interact with the LLM in both languages. The templates for eachscenario are implemented in chatbot.py. Furthermore, users can change the underlying vectorstore providing context documents to the LLM by selecting all speeches, speeches from one legislature or multiple legislatures. The latter is implemented by merging vectorstores resulting in a small latency.

KeywordSearch

In the keywordsearch tab, users can enter any word or phrase to find related speeches from all parties or particular ones. The user can all so download their findings as JSON, EXCEL or CSV file for further analysis. In the background a similarity search is performed to pressent the most similar documents to the input.

Further improvements & ideas

Improvements:

  • Experiment with different LLMs and Templates
  • Include chat history in RAG
  • Add a date or legislature filter to KeywordSearch
  • Improve inference time

Ideas:

  • Add a RAG based chatbot for party manifestos
  • Expand the scope to speeches hold in parliaments from differnt countries
  • Implement a pipeline to continiously update the underlying data with thhe most recent speeches

Acknowledgement

Big thank you to the OpenDiscourse team for creating the underlying speeches corpus. Check out their website here.