More_Advanced_Embeddings_Comparator

Sleeping

App Files Files Community

Chris4K commited on Oct 20, 2024

Commit

f0f9414

verified ·

1 Parent(s): 586cc45

Update app.py

Browse files

Files changed (1) hide show

app.py +315 -15

app.py CHANGED Viewed

@@ -686,7 +686,7 @@ def launch_interface(share=True):
             use_reranking_input = gr.Checkbox(label="Use Reranking", value=False)
         ####
-        with gr.Tab("Automated"):
             auto_file_input = gr.File(label="Upload File (Optional)")
             auto_query_input = gr.Textbox(label="Search Query")
             auto_model_types = gr.CheckboxGroup(
@@ -750,27 +750,327 @@ def launch_interface(share=True):
         ###
-    tutorial_md = """
-    # Advanced Embedding Comparison Tool Tutorial
-    This tool allows you to compare different embedding models and retrieval strategies for document search and similarity matching.
-    ## How to use:
-    1. Upload a file (optional) or use the default files in the system.
-    2. Enter a search query.
-    3. Enter embedding models as a comma-separated list (e.g., HuggingFace:paraphrase-miniLM,OpenAI:text-embedding-ada-002).
-    4. Set the number of top results to retrieve.
-    5. Optionally, specify advanced settings such as custom embedding models, text splitting strategies, and vector store types.
-    6. Choose whether to use optional features like vocabulary optimization, query optimization, or result reranking.
-    7. If you have a custom tokenizer, upload the file and specify its attributes.
-    The tool will process your query and display results, statistics, and visualizations to help you compare the performance of different models and strategies.
     """
     iface = gr.TabbedInterface(
-        [iface, gr.Markdown(tutorial_md)],
-        ["Embedding Comparison", "Tutorial"]
     )
     iface.launch(share=share)

             use_reranking_input = gr.Checkbox(label="Use Reranking", value=False)
         ####
+        with gr.Tab("Automation"):
             auto_file_input = gr.File(label="Upload File (Optional)")
             auto_query_input = gr.Textbox(label="Search Query")
             auto_model_types = gr.CheckboxGroup(
         ###
+    use_case_md = """
+    # 🚀 AI Act Embedding Use Case Guide
+## 📚 Use Case: Embedding the German AI Act for Local Chat Retrieval
+In this guide, we'll walk through the process of embedding the German version of the AI Act using our advanced embedding tool and MTEB. We'll then use these embeddings in a local chat application as a retriever/context.
+### Step 1: Prepare the Document 📄
+1. Download the German version of the AI Act (let's call it `ai_act_de.txt`).
+2. Place the file in your project directory.
+### Step 2: Set Up the Embedding Tool 🛠️
+1. Open the Embedding Comparison Tool.
+2. Navigate to the new "Automation" tab.
+### Step 3: Configure the Automated Test 🔧
+In the "Use Case" tab, set up the following configuration:
+```markdown
+- File: ai_act_de.txt
+- Query: "Wie definiert das Gesetz KI-Systeme?"
+- Model Types: ✅ HuggingFace, ✅ Sentence Transformers
+- Model Names: paraphrase-multilingual-MiniLM-L12-v2, distiluse-base-multilingual-cased-v2
+- Split Strategies: ✅ recursive, ✅ token
+- Chunk Sizes: 256, 512, 1024
+- Overlap Sizes: 32, 64, 128
+- Vector Store Types: ✅ FAISS
+- Search Types: ✅ similarity, ✅ mmr
+- Top K Values: 3, 5, 7
+- Test Vocabulary Optimization: ✅
+- Test Query Optimization: ✅
+- Test Reranking: ✅
+```
+### Step 4: Run the Automated Test 🏃‍♂️
+Click the "Run Automated Tests" button and wait for the results.
+### Step 5: Analyze the Results 📊
+Let's say we got the following simulated results:
+```markdown
+Best Model: Sentence Transformers - paraphrase-multilingual-MiniLM-L12-v2
+Best Settings:
+- Split Strategy: recursive
+- Chunk Size: 512
+- Overlap Size: 64
+- Vector Store Type: FAISS
+- Search Type: mmr
+- Top K: 5
+- Optimize Vocabulary: True
+- Use Query Optimization: True
+- Use Reranking: True
+Performance Summary:
+- Search Time: 0.15s
+- Result Diversity: 0.82
+- Rank Correlation: 0.91
+- Silhouette Score: 0.76
+```
+### Step 6: Understand the Results 🧠
+1. **Model**: The Sentence Transformers model performed better, likely due to its multilingual capabilities and fine-tuning for paraphrasing tasks.
+2. **Split Strategy**: Recursive splitting worked best, probably because it respects the document's structure better than fixed-length token splitting.
+3. **Chunk Size**: 512 tokens provide a good balance between context and specificity.
+4. **Search Type**: MMR (Maximum Marginal Relevance) outperformed simple similarity search, likely due to its ability to balance relevance and diversity in results.
+5. **Optimizations**: All optimizations (vocabulary, query, and reranking) proved beneficial, indicating that the extra processing time is worth the improved results.
+### Step 7: Implement in Local Chat 💬
+Now that we have the optimal settings, let's implement this in a local chat application:
+1. Use the `paraphrase-multilingual-MiniLM-L12-v2` model for embeddings.
+2. Set up a FAISS vector store with the embedded chunks.
+3. Implement MMR search with a top-k of 5.
+4. Include the optimization steps in your pipeline.
+### Step 8: Test the Implementation 🧪
+Create a simple chat interface and test with various queries about the AI Act. For example:
+User: "Was sind die Hauptziele des KI-Gesetzes?"
     """
+    tutorial_md = """
+# Advanced Embedding Comparison Tool Tutorial
+Welcome to the **Advanced Embedding Comparison Tool**! This comprehensive guide will help you understand and utilize the tool's features to optimize your **Retrieval-Augmented Generation (RAG)** systems.
+## Table of Contents
+1. [Introduction to RAG](#introduction-to-rag)
+2. [Key Components of RAG](#key-components-of-rag)
+3. [Impact of Parameter Changes](#impact-of-parameter-changes)
+4. [Advanced Features](#advanced-features)
+5. [Using the Embedding Comparison Tool](#using-the-embedding-comparison-tool)
+6. [Automated Testing and Analysis](#automated-testing-and-analysis)
+7. [Mathematical Concepts and Metrics](#mathematical-concepts-and-metrics)
+8. [Code Examples](#code-examples)
+9. [Best Practices and Tips](#best-practices-and-tips)
+10. [Resources and Further Reading](#resources-and-further-reading)
+---
+## Introduction to RAG
+**Retrieval-Augmented Generation (RAG)** is a powerful technique that combines the strengths of large language models (LLMs) with the ability to access and use external knowledge. RAG is particularly useful for:
+- Providing up-to-date information
+- Answering questions based on specific documents or data sources
+- Reducing hallucinations in AI responses
+- Customizing AI outputs for specific domains or use cases
+RAG is ideal for applications requiring accurate, context-specific information retrieval combined with natural language generation, such as chatbots, question-answering systems, and document analysis tools.
+---
+## Key Components of RAG
+### 1. Document Loading
+Ingests documents from various sources (PDFs, web pages, databases, etc.) into a format that can be processed by the RAG system. The tool supports multiple file formats, including PDF, DOCX, and TXT.
+### 2. Document Splitting
+Splits large documents into smaller chunks for more efficient processing and retrieval. Available strategies include:
+- **Token-based splitting**
+- **Recursive splitting**
+### 3. Vector Store and Embeddings
+Embeddings are dense vector representations of text that capture semantic meaning. The tool supports multiple embedding models and vector stores:
+- **Embedding models**: HuggingFace, OpenAI, Cohere, and custom models.
+- **Vector stores**: FAISS and Chroma.
+### 4. Retrieval
+Finds the most relevant documents or chunks based on a query. Available retrieval methods include:
+- **Similarity search**
+- **Maximum Marginal Relevance (MMR)**
+- **Custom search methods**
+---
+## Impact of Parameter Changes
+Understanding how different parameters affect your RAG system is crucial for optimization:
+- **Chunk Size**: Larger chunks provide more context but may reduce precision. Smaller chunks increase precision but may lose context.
+- **Overlap**: More overlap helps maintain context between chunks but increases computational load.
+- **Embedding Model**: Performance varies across languages and domains.
+- **Vector Store**: Affects query speed and the types of searches.
+- **Retrieval Method**: Influences the diversity and relevance of retrieved documents.
+---
+## Advanced Features
+### 1. Custom Tokenization
+Upload a custom tokenizer file and specify the tokenizer model, vocabulary size, and special tokens for domain or language-specific tokenization.
+### 2. Query Optimization
+Improve search results by generating multiple variations of the input query using a language model to capture different phrasings.
+### 3. Reranking
+Further refine search results by using a separate model to re-score and reorder the initial retrieval results.
+### 4. Phonetic Matching
+For languages like German, phonetic matching with adjustable weighting is available.
+### 5. Vocabulary Optimization
+Optimize vocabulary for domain-specific applications during the embedding process.
+---
+## Using the Embedding Comparison Tool
+The tool is divided into several tabs for ease of use:
+### Simple Tab
+1. **File Upload**: Upload a file (PDF, DOCX, or TXT) or use files from the `./files` directory.
+2. **Search Query**: Enter the search query.
+3. **Embedding Models**: Select one or more embedding models to compare.
+4. **Top K**: Set the number of top results to retrieve (1-10).
+### Advanced Tab
+5. **Custom Embedding Model**: Specify a custom embedding model.
+6. **Split Strategy**: Choose between 'token' and 'recursive' splitting.
+7. **Chunk Size**: Set chunk size (100-1000).
+8. **Overlap Size**: Set overlap between chunks (0-100).
+9. **Custom Split Separators**: Enter custom separators for text splitting.
+10. **Vector Store Type**: Choose between FAISS and Chroma.
+11. **Search Type**: Select 'similarity', 'mmr', or 'custom'.
+12. **Language**: Specify the document's primary language.
+### Optional Tab
+13. **Text Preprocessing**: Toggle text preprocessing.
+14. **Vocabulary Optimization**: Enable vocabulary optimization.
+15. **Phonetic Matching**: Enable phonetic matching and set its weight.
+16. **Custom Tokenizer**: Upload a custom tokenizer and specify parameters.
+17. **Query Optimization**: Enable query optimization and specify the model.
+18. **Reranking**: Enable result reranking.
+---
+## Automated Testing and Analysis
+The **Automation tab** allows you to run comprehensive tests across multiple configurations:
+1. Set up test parameters like model types, split strategies, chunk sizes, etc.
+2. Click "Run Automated Tests."
+3. View results, statistics, and recommendations to find optimal configurations for your use case.
+---
+## Mathematical Concepts and Metrics
+### Cosine Similarity
+Measures the cosine of the angle between two vectors, used in similarity search:
+$$\text{cosine similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}$$
+### Maximum Marginal Relevance (MMR)
+Balances relevance and diversity in search results:
+$$\text{MMR} = \arg\max_{D_i \in R \setminus S} [\lambda \text{Sim}_1(D_i, Q) - (1-\lambda) \max_{D_j \in S} \text{Sim}_2(D_i, D_j)]$$
+### Silhouette Score
+Measures how well an object fits within its own cluster compared to others. Scores range from -1 to 1, where higher values indicate better-defined clusters.
+---
+## Code Examples
+### Custom Tokenization
+```python
+def create_custom_tokenizer(file_path, model_type='WordLevel', vocab_size=10000, special_tokens=None):
+    with open(file_path, 'r', encoding='utf-8') as f:
+        text = f.read()
+    tokenizer = Tokenizer(models.WordLevel(unk_token="[UNK]")) if model_type == 'WordLevel' else Tokenizer(models.BPE(unk_token="[UNK]"))
+    tokenizer.pre_tokenizer = Whitespace()
+    trainer = trainers.WordLevelTrainer(special_tokens=special_tokens or ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=vocab_size)
+    tokenizer.train_from_iterator([text], trainer)
+    return tokenizer
+````
+### Query Optimization
+```python
+def optimize_query(query, llm):
+    multi_query_retriever = MultiQueryRetriever.from_llm(
+        retriever=get_retriever(vector_store, search_type, search_kwargs),
+        llm=llm
+    )
+    optimized_queries = multi_query_retriever.generate_queries(query)
+    return optimized_queries
+````
+### Reranking
+```python
+def rerank_results(results, query, reranker):
+    reranked_results = reranker.rerank(query, [doc.page_content for doc in results])
+    return reranked_results
+````
+### Best Practices and Tips
+- Start Simple: Begin with basic configurations, then gradually add complexity.
+- Benchmark: Use automated testing to benchmark different setups.
+- Domain-Specific Tuning: Consider custom tokenizers and embeddings for specialized domains.
+- Balance Performance and Cost: Use advanced features like query optimization and reranking judiciously.
+- Iterate: Optimization is an iterative process—refine your approach based on tool insights.
+    ## Useful Resources and Links
+    Here are some valuable resources to help you better understand and work with embeddings, retrieval systems, and natural language processing:
+    ### Embeddings and Vector Databases
+    - [Understanding Embeddings](https://www.tensorflow.org/text/guide/word_embeddings): A guide by TensorFlow on word embeddings
+    - [FAISS: A Library for Efficient Similarity Search](https://github.com/facebookresearch/faiss): Facebook AI's vector similarity search library
+    - [Chroma: The AI-native open-source embedding database](https://www.trychroma.com/): An embedding database designed for AI applications
+    ### Natural Language Processing
+    - [NLTK (Natural Language Toolkit)](https://www.nltk.org/): A leading platform for building Python programs to work with human language data
+    - [spaCy](https://spacy.io/): Industrial-strength Natural Language Processing in Python
+    - [Hugging Face Transformers](https://huggingface.co/transformers/): State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0
+    ### Retrieval-Augmented Generation (RAG)
+    - [LangChain](https://python.langchain.com/docs/get_started/introduction): A framework for developing applications powered by language models
+    - [OpenAI's RAG Tutorial](https://platform.openai.com/docs/tutorials/web-qa-embeddings): A guide on building a QA system with embeddings
+    ### German Language Processing
+    - [Kölner Phonetik](https://en.wikipedia.org/wiki/Cologne_phonetics): Information about the Kölner Phonetik algorithm
+    - [German NLP Resources](https://github.com/adbar/German-NLP): A curated list of open-access resources for German NLP
+    ### Benchmarks and Evaluation
+    - [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard): Massive Text Embedding Benchmark leaderboard
+    - [GLUE Benchmark](https://gluebenchmark.com/): General Language Understanding Evaluation benchmark
+    ### Tools and Libraries
+    - [Gensim](https://radimrehurek.com/gensim/): Topic modelling for humans
+    - [Sentence-Transformers](https://www.sbert.net/): A Python framework for state-of-the-art sentence, text and image embeddings
+This tool empowers you to fine-tune your RAG system for optimal performance. Experiment with different settings, run automated tests, and use insights to create an efficient information retrieval and generation system.
+        """
     iface = gr.TabbedInterface(
+        [iface, gr.Markdown(tutorial_md) gr.Markdown( use_case_md )],
+        ["Embedding Comparison", "Tutorial", "Use Case"]
     )
     iface.launch(share=share)