Chris4K commited on
Commit
f0f9414
1 Parent(s): 586cc45

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +315 -15
app.py CHANGED
@@ -686,7 +686,7 @@ def launch_interface(share=True):
686
  use_reranking_input = gr.Checkbox(label="Use Reranking", value=False)
687
 
688
  ####
689
- with gr.Tab("Automated"):
690
  auto_file_input = gr.File(label="Upload File (Optional)")
691
  auto_query_input = gr.Textbox(label="Search Query")
692
  auto_model_types = gr.CheckboxGroup(
@@ -750,27 +750,327 @@ def launch_interface(share=True):
750
  ###
751
 
752
 
753
- tutorial_md = """
754
- # Advanced Embedding Comparison Tool Tutorial
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
755
 
756
- This tool allows you to compare different embedding models and retrieval strategies for document search and similarity matching.
 
 
 
 
 
 
 
 
 
 
 
757
 
758
- ## How to use:
 
 
 
 
 
759
 
760
- 1. Upload a file (optional) or use the default files in the system.
761
- 2. Enter a search query.
762
- 3. Enter embedding models as a comma-separated list (e.g., HuggingFace:paraphrase-miniLM,OpenAI:text-embedding-ada-002).
763
- 4. Set the number of top results to retrieve.
764
- 5. Optionally, specify advanced settings such as custom embedding models, text splitting strategies, and vector store types.
765
- 6. Choose whether to use optional features like vocabulary optimization, query optimization, or result reranking.
766
- 7. If you have a custom tokenizer, upload the file and specify its attributes.
767
 
768
- The tool will process your query and display results, statistics, and visualizations to help you compare the performance of different models and strategies.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
769
  """
770
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
771
  iface = gr.TabbedInterface(
772
- [iface, gr.Markdown(tutorial_md)],
773
- ["Embedding Comparison", "Tutorial"]
774
  )
775
 
776
  iface.launch(share=share)
 
686
  use_reranking_input = gr.Checkbox(label="Use Reranking", value=False)
687
 
688
  ####
689
+ with gr.Tab("Automation"):
690
  auto_file_input = gr.File(label="Upload File (Optional)")
691
  auto_query_input = gr.Textbox(label="Search Query")
692
  auto_model_types = gr.CheckboxGroup(
 
750
  ###
751
 
752
 
753
+ use_case_md = """
754
+ # 🚀 AI Act Embedding Use Case Guide
755
+
756
+ ## 📚 Use Case: Embedding the German AI Act for Local Chat Retrieval
757
+
758
+ In this guide, we'll walk through the process of embedding the German version of the AI Act using our advanced embedding tool and MTEB. We'll then use these embeddings in a local chat application as a retriever/context.
759
+
760
+ ### Step 1: Prepare the Document 📄
761
+
762
+ 1. Download the German version of the AI Act (let's call it `ai_act_de.txt`).
763
+ 2. Place the file in your project directory.
764
+
765
+ ### Step 2: Set Up the Embedding Tool 🛠️
766
+
767
+ 1. Open the Embedding Comparison Tool.
768
+ 2. Navigate to the new "Automation" tab.
769
+
770
+ ### Step 3: Configure the Automated Test 🔧
771
+
772
+ In the "Use Case" tab, set up the following configuration:
773
+
774
+ ```markdown
775
+ - File: ai_act_de.txt
776
+ - Query: "Wie definiert das Gesetz KI-Systeme?"
777
+ - Model Types: ✅ HuggingFace, ✅ Sentence Transformers
778
+ - Model Names: paraphrase-multilingual-MiniLM-L12-v2, distiluse-base-multilingual-cased-v2
779
+ - Split Strategies: ✅ recursive, ✅ token
780
+ - Chunk Sizes: 256, 512, 1024
781
+ - Overlap Sizes: 32, 64, 128
782
+ - Vector Store Types: ✅ FAISS
783
+ - Search Types: ✅ similarity, ✅ mmr
784
+ - Top K Values: 3, 5, 7
785
+ - Test Vocabulary Optimization: ✅
786
+ - Test Query Optimization: ✅
787
+ - Test Reranking: ✅
788
+ ```
789
+
790
+ ### Step 4: Run the Automated Test 🏃‍♂️
791
+
792
+ Click the "Run Automated Tests" button and wait for the results.
793
+
794
+ ### Step 5: Analyze the Results 📊
795
+
796
+ Let's say we got the following simulated results:
797
 
798
+ ```markdown
799
+ Best Model: Sentence Transformers - paraphrase-multilingual-MiniLM-L12-v2
800
+ Best Settings:
801
+ - Split Strategy: recursive
802
+ - Chunk Size: 512
803
+ - Overlap Size: 64
804
+ - Vector Store Type: FAISS
805
+ - Search Type: mmr
806
+ - Top K: 5
807
+ - Optimize Vocabulary: True
808
+ - Use Query Optimization: True
809
+ - Use Reranking: True
810
 
811
+ Performance Summary:
812
+ - Search Time: 0.15s
813
+ - Result Diversity: 0.82
814
+ - Rank Correlation: 0.91
815
+ - Silhouette Score: 0.76
816
+ ```
817
 
818
+ ### Step 6: Understand the Results 🧠
 
 
 
 
 
 
819
 
820
+ 1. **Model**: The Sentence Transformers model performed better, likely due to its multilingual capabilities and fine-tuning for paraphrasing tasks.
821
+
822
+ 2. **Split Strategy**: Recursive splitting worked best, probably because it respects the document's structure better than fixed-length token splitting.
823
+
824
+ 3. **Chunk Size**: 512 tokens provide a good balance between context and specificity.
825
+
826
+ 4. **Search Type**: MMR (Maximum Marginal Relevance) outperformed simple similarity search, likely due to its ability to balance relevance and diversity in results.
827
+
828
+ 5. **Optimizations**: All optimizations (vocabulary, query, and reranking) proved beneficial, indicating that the extra processing time is worth the improved results.
829
+
830
+ ### Step 7: Implement in Local Chat 💬
831
+
832
+ Now that we have the optimal settings, let's implement this in a local chat application:
833
+
834
+ 1. Use the `paraphrase-multilingual-MiniLM-L12-v2` model for embeddings.
835
+ 2. Set up a FAISS vector store with the embedded chunks.
836
+ 3. Implement MMR search with a top-k of 5.
837
+ 4. Include the optimization steps in your pipeline.
838
+
839
+ ### Step 8: Test the Implementation 🧪
840
+
841
+ Create a simple chat interface and test with various queries about the AI Act. For example:
842
+
843
+ User: "Was sind die Hauptziele des KI-Gesetzes?"
844
  """
845
 
846
+
847
+ tutorial_md = """
848
+ # Advanced Embedding Comparison Tool Tutorial
849
+
850
+ Welcome to the **Advanced Embedding Comparison Tool**! This comprehensive guide will help you understand and utilize the tool's features to optimize your **Retrieval-Augmented Generation (RAG)** systems.
851
+
852
+ ## Table of Contents
853
+ 1. [Introduction to RAG](#introduction-to-rag)
854
+ 2. [Key Components of RAG](#key-components-of-rag)
855
+ 3. [Impact of Parameter Changes](#impact-of-parameter-changes)
856
+ 4. [Advanced Features](#advanced-features)
857
+ 5. [Using the Embedding Comparison Tool](#using-the-embedding-comparison-tool)
858
+ 6. [Automated Testing and Analysis](#automated-testing-and-analysis)
859
+ 7. [Mathematical Concepts and Metrics](#mathematical-concepts-and-metrics)
860
+ 8. [Code Examples](#code-examples)
861
+ 9. [Best Practices and Tips](#best-practices-and-tips)
862
+ 10. [Resources and Further Reading](#resources-and-further-reading)
863
+
864
+ ---
865
+
866
+ ## Introduction to RAG
867
+
868
+ **Retrieval-Augmented Generation (RAG)** is a powerful technique that combines the strengths of large language models (LLMs) with the ability to access and use external knowledge. RAG is particularly useful for:
869
+
870
+ - Providing up-to-date information
871
+ - Answering questions based on specific documents or data sources
872
+ - Reducing hallucinations in AI responses
873
+ - Customizing AI outputs for specific domains or use cases
874
+
875
+ RAG is ideal for applications requiring accurate, context-specific information retrieval combined with natural language generation, such as chatbots, question-answering systems, and document analysis tools.
876
+
877
+ ---
878
+
879
+ ## Key Components of RAG
880
+
881
+ ### 1. Document Loading
882
+ Ingests documents from various sources (PDFs, web pages, databases, etc.) into a format that can be processed by the RAG system. The tool supports multiple file formats, including PDF, DOCX, and TXT.
883
+
884
+ ### 2. Document Splitting
885
+ Splits large documents into smaller chunks for more efficient processing and retrieval. Available strategies include:
886
+ - **Token-based splitting**
887
+ - **Recursive splitting**
888
+
889
+ ### 3. Vector Store and Embeddings
890
+ Embeddings are dense vector representations of text that capture semantic meaning. The tool supports multiple embedding models and vector stores:
891
+ - **Embedding models**: HuggingFace, OpenAI, Cohere, and custom models.
892
+ - **Vector stores**: FAISS and Chroma.
893
+
894
+ ### 4. Retrieval
895
+ Finds the most relevant documents or chunks based on a query. Available retrieval methods include:
896
+ - **Similarity search**
897
+ - **Maximum Marginal Relevance (MMR)**
898
+ - **Custom search methods**
899
+
900
+ ---
901
+
902
+ ## Impact of Parameter Changes
903
+
904
+ Understanding how different parameters affect your RAG system is crucial for optimization:
905
+
906
+ - **Chunk Size**: Larger chunks provide more context but may reduce precision. Smaller chunks increase precision but may lose context.
907
+ - **Overlap**: More overlap helps maintain context between chunks but increases computational load.
908
+ - **Embedding Model**: Performance varies across languages and domains.
909
+ - **Vector Store**: Affects query speed and the types of searches.
910
+ - **Retrieval Method**: Influences the diversity and relevance of retrieved documents.
911
+
912
+ ---
913
+
914
+ ## Advanced Features
915
+
916
+ ### 1. Custom Tokenization
917
+ Upload a custom tokenizer file and specify the tokenizer model, vocabulary size, and special tokens for domain or language-specific tokenization.
918
+
919
+ ### 2. Query Optimization
920
+ Improve search results by generating multiple variations of the input query using a language model to capture different phrasings.
921
+
922
+ ### 3. Reranking
923
+ Further refine search results by using a separate model to re-score and reorder the initial retrieval results.
924
+
925
+ ### 4. Phonetic Matching
926
+ For languages like German, phonetic matching with adjustable weighting is available.
927
+
928
+ ### 5. Vocabulary Optimization
929
+ Optimize vocabulary for domain-specific applications during the embedding process.
930
+
931
+ ---
932
+
933
+ ## Using the Embedding Comparison Tool
934
+
935
+ The tool is divided into several tabs for ease of use:
936
+
937
+ ### Simple Tab
938
+ 1. **File Upload**: Upload a file (PDF, DOCX, or TXT) or use files from the `./files` directory.
939
+ 2. **Search Query**: Enter the search query.
940
+ 3. **Embedding Models**: Select one or more embedding models to compare.
941
+ 4. **Top K**: Set the number of top results to retrieve (1-10).
942
+
943
+ ### Advanced Tab
944
+ 5. **Custom Embedding Model**: Specify a custom embedding model.
945
+ 6. **Split Strategy**: Choose between 'token' and 'recursive' splitting.
946
+ 7. **Chunk Size**: Set chunk size (100-1000).
947
+ 8. **Overlap Size**: Set overlap between chunks (0-100).
948
+ 9. **Custom Split Separators**: Enter custom separators for text splitting.
949
+ 10. **Vector Store Type**: Choose between FAISS and Chroma.
950
+ 11. **Search Type**: Select 'similarity', 'mmr', or 'custom'.
951
+ 12. **Language**: Specify the document's primary language.
952
+
953
+ ### Optional Tab
954
+ 13. **Text Preprocessing**: Toggle text preprocessing.
955
+ 14. **Vocabulary Optimization**: Enable vocabulary optimization.
956
+ 15. **Phonetic Matching**: Enable phonetic matching and set its weight.
957
+ 16. **Custom Tokenizer**: Upload a custom tokenizer and specify parameters.
958
+ 17. **Query Optimization**: Enable query optimization and specify the model.
959
+ 18. **Reranking**: Enable result reranking.
960
+
961
+ ---
962
+
963
+ ## Automated Testing and Analysis
964
+
965
+ The **Automation tab** allows you to run comprehensive tests across multiple configurations:
966
+
967
+ 1. Set up test parameters like model types, split strategies, chunk sizes, etc.
968
+ 2. Click "Run Automated Tests."
969
+ 3. View results, statistics, and recommendations to find optimal configurations for your use case.
970
+
971
+ ---
972
+
973
+ ## Mathematical Concepts and Metrics
974
+
975
+ ### Cosine Similarity
976
+ Measures the cosine of the angle between two vectors, used in similarity search:
977
+ $$\text{cosine similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}$$
978
+
979
+ ### Maximum Marginal Relevance (MMR)
980
+ Balances relevance and diversity in search results:
981
+ $$\text{MMR} = \arg\max_{D_i \in R \setminus S} [\lambda \text{Sim}_1(D_i, Q) - (1-\lambda) \max_{D_j \in S} \text{Sim}_2(D_i, D_j)]$$
982
+
983
+ ### Silhouette Score
984
+ Measures how well an object fits within its own cluster compared to others. Scores range from -1 to 1, where higher values indicate better-defined clusters.
985
+
986
+ ---
987
+
988
+ ## Code Examples
989
+
990
+ ### Custom Tokenization
991
+ ```python
992
+ def create_custom_tokenizer(file_path, model_type='WordLevel', vocab_size=10000, special_tokens=None):
993
+ with open(file_path, 'r', encoding='utf-8') as f:
994
+ text = f.read()
995
+
996
+ tokenizer = Tokenizer(models.WordLevel(unk_token="[UNK]")) if model_type == 'WordLevel' else Tokenizer(models.BPE(unk_token="[UNK]"))
997
+ tokenizer.pre_tokenizer = Whitespace()
998
+
999
+ trainer = trainers.WordLevelTrainer(special_tokens=special_tokens or ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=vocab_size)
1000
+ tokenizer.train_from_iterator([text], trainer)
1001
+
1002
+ return tokenizer
1003
+ ````
1004
+
1005
+ ### Query Optimization
1006
+ ```python
1007
+ def optimize_query(query, llm):
1008
+ multi_query_retriever = MultiQueryRetriever.from_llm(
1009
+ retriever=get_retriever(vector_store, search_type, search_kwargs),
1010
+ llm=llm
1011
+ )
1012
+ optimized_queries = multi_query_retriever.generate_queries(query)
1013
+ return optimized_queries
1014
+ ````
1015
+
1016
+ ### Reranking
1017
+ ```python
1018
+ def rerank_results(results, query, reranker):
1019
+ reranked_results = reranker.rerank(query, [doc.page_content for doc in results])
1020
+ return reranked_results
1021
+ ````
1022
+
1023
+ ### Best Practices and Tips
1024
+
1025
+ - Start Simple: Begin with basic configurations, then gradually add complexity.
1026
+ - Benchmark: Use automated testing to benchmark different setups.
1027
+ - Domain-Specific Tuning: Consider custom tokenizers and embeddings for specialized domains.
1028
+ - Balance Performance and Cost: Use advanced features like query optimization and reranking judiciously.
1029
+ - Iterate: Optimization is an iterative process—refine your approach based on tool insights.
1030
+
1031
+
1032
+ ## Useful Resources and Links
1033
+
1034
+ Here are some valuable resources to help you better understand and work with embeddings, retrieval systems, and natural language processing:
1035
+
1036
+ ### Embeddings and Vector Databases
1037
+ - [Understanding Embeddings](https://www.tensorflow.org/text/guide/word_embeddings): A guide by TensorFlow on word embeddings
1038
+ - [FAISS: A Library for Efficient Similarity Search](https://github.com/facebookresearch/faiss): Facebook AI's vector similarity search library
1039
+ - [Chroma: The AI-native open-source embedding database](https://www.trychroma.com/): An embedding database designed for AI applications
1040
+
1041
+ ### Natural Language Processing
1042
+ - [NLTK (Natural Language Toolkit)](https://www.nltk.org/): A leading platform for building Python programs to work with human language data
1043
+ - [spaCy](https://spacy.io/): Industrial-strength Natural Language Processing in Python
1044
+ - [Hugging Face Transformers](https://huggingface.co/transformers/): State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0
1045
+
1046
+ ### Retrieval-Augmented Generation (RAG)
1047
+ - [LangChain](https://python.langchain.com/docs/get_started/introduction): A framework for developing applications powered by language models
1048
+ - [OpenAI's RAG Tutorial](https://platform.openai.com/docs/tutorials/web-qa-embeddings): A guide on building a QA system with embeddings
1049
+
1050
+ ### German Language Processing
1051
+ - [Kölner Phonetik](https://en.wikipedia.org/wiki/Cologne_phonetics): Information about the Kölner Phonetik algorithm
1052
+ - [German NLP Resources](https://github.com/adbar/German-NLP): A curated list of open-access resources for German NLP
1053
+
1054
+ ### Benchmarks and Evaluation
1055
+ - [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard): Massive Text Embedding Benchmark leaderboard
1056
+ - [GLUE Benchmark](https://gluebenchmark.com/): General Language Understanding Evaluation benchmark
1057
+
1058
+ ### Tools and Libraries
1059
+ - [Gensim](https://radimrehurek.com/gensim/): Topic modelling for humans
1060
+ - [Sentence-Transformers](https://www.sbert.net/): A Python framework for state-of-the-art sentence, text and image embeddings
1061
+
1062
+
1063
+
1064
+ This tool empowers you to fine-tune your RAG system for optimal performance. Experiment with different settings, run automated tests, and use insights to create an efficient information retrieval and generation system.
1065
+
1066
+
1067
+
1068
+ """
1069
+
1070
+
1071
  iface = gr.TabbedInterface(
1072
+ [iface, gr.Markdown(tutorial_md) gr.Markdown( use_case_md )],
1073
+ ["Embedding Comparison", "Tutorial", "Use Case"]
1074
  )
1075
 
1076
  iface.launch(share=share)