sethuiyer
/

txtai-medsearch

Sentence Similarity

Model card Files Files and versions Community

sethuiyer commited on Jan 24, 2024

Commit

330205e

·

verified ·

1 Parent(s): 3ff8121

Create README.md

Files changed (1) hide show

README.md +56 -0

README.md ADDED Viewed

	@@ -0,0 +1,56 @@

+---
+inference: false
+language:
+- en
+- zh
+license:
+- cc-by-sa-3.0
+- gfdl
+library_name: txtai
+tags:
+- sentence-similarity
+---
+# Medical txtai embeddings index
+This is a [txtai](https://github.com/neuml/txtai) embeddings index specifically designed for medical texts, encompassing a diverse corpus in both English and Chinese.
+The model is primed for integration into medical information systems, aiding in the quick retrieval of relevant clinical information.
+### Data Sources
+The model is trained on a substantial dataset, including 434411 entries from a bilingual (English and Chinese) corpus of clinical texts. The sources are:
+- `shibing624/medical`, a dataset featuring a variety of medical scenarios and questions in both English and Chinese, suitable for text generation and medical question-answering systems. It's licensed under Apache 2.0.
+- `keivalya/MedQuad-MedicalQnADataset`, offering detailed insights into various health conditions and their treatments, covering prevention, diagnosis, treatment, and susceptibility.
+- `GBaker/MedQA-USMLE-4-options`, a collection of multiple-choice questions based on the USMLE, focusing on a wide range of medical topics and scenarios.
+- `medalpaca/medical_meadow_medqa`, a dataset for question answering in English and Chinese, encompassing clinical scenarios and medical queries with multiple-choice answers.
+- `medalpaca/medical_meadow_medical_flashcards`, featuring over 34,000 rows of question and answer pairs derived from medical flashcards, focusing on a wide range of medical subjects.
+Each of these datasets contributes to the depth and diversity of the medical knowledge encapsulated in the txtai embeddings model, making it an effective tool for medical information retrieval and analysis.
+## Indexing
+The txtai embeddings model utilizes 'efederici/multilingual-e5-small-4096', a transformer-based model with 12 layers and an embedding size of 384, supporting 94 languages.
+### Configuration
+The embedding model is quantized to 4 bits for size efficiency and supports batch encoding of 15 for optimized performance.
+The indexing is implemented using simple numpy cosine similarity, ensuring straightforward and efficient retrieval.
+## Usage
+1. Load the dataset using the provided JSON file.
+2. Initialize and load the embeddings using txtai:
+   ```python
+   from txtai import Embeddings
+   embeddings = Embeddings()
+   embeddings.load('index.tar.gz')
+   ```
+## Next Steps
+1. More detailed usage, including using txtai to create inter-operability between English and Chinese
+2. Create an usecase with [CrewAI](https://github.com/joaomdmoura/crewAI) and [Dr.Samantha](https://huggingface.co/sethuiyer/Dr_Samantha_7b_mistral)
+## License
+This model is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License and the GNU Free Documentation License.