sethuiyer commited on
Commit
330205e
1 Parent(s): 3ff8121

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -0
README.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ inference: false
3
+ language:
4
+ - en
5
+ - zh
6
+ license:
7
+ - cc-by-sa-3.0
8
+ - gfdl
9
+ library_name: txtai
10
+ tags:
11
+ - sentence-similarity
12
+ ---
13
+
14
+ # Medical txtai embeddings index
15
+
16
+ This is a [txtai](https://github.com/neuml/txtai) embeddings index specifically designed for medical texts, encompassing a diverse corpus in both English and Chinese.
17
+ The model is primed for integration into medical information systems, aiding in the quick retrieval of relevant clinical information.
18
+
19
+ ### Data Sources
20
+
21
+ The model is trained on a substantial dataset, including 434411 entries from a bilingual (English and Chinese) corpus of clinical texts. The sources are:
22
+
23
+ - `shibing624/medical`, a dataset featuring a variety of medical scenarios and questions in both English and Chinese, suitable for text generation and medical question-answering systems. It's licensed under Apache 2.0.
24
+ - `keivalya/MedQuad-MedicalQnADataset`, offering detailed insights into various health conditions and their treatments, covering prevention, diagnosis, treatment, and susceptibility.
25
+ - `GBaker/MedQA-USMLE-4-options`, a collection of multiple-choice questions based on the USMLE, focusing on a wide range of medical topics and scenarios.
26
+ - `medalpaca/medical_meadow_medqa`, a dataset for question answering in English and Chinese, encompassing clinical scenarios and medical queries with multiple-choice answers.
27
+ - `medalpaca/medical_meadow_medical_flashcards`, featuring over 34,000 rows of question and answer pairs derived from medical flashcards, focusing on a wide range of medical subjects.
28
+
29
+ Each of these datasets contributes to the depth and diversity of the medical knowledge encapsulated in the txtai embeddings model, making it an effective tool for medical information retrieval and analysis.
30
+
31
+ ## Indexing
32
+
33
+ The txtai embeddings model utilizes 'efederici/multilingual-e5-small-4096', a transformer-based model with 12 layers and an embedding size of 384, supporting 94 languages.
34
+
35
+ ### Configuration
36
+
37
+ The embedding model is quantized to 4 bits for size efficiency and supports batch encoding of 15 for optimized performance.
38
+ The indexing is implemented using simple numpy cosine similarity, ensuring straightforward and efficient retrieval.
39
+
40
+ ## Usage
41
+
42
+ 1. Load the dataset using the provided JSON file.
43
+ 2. Initialize and load the embeddings using txtai:
44
+ ```python
45
+ from txtai import Embeddings
46
+ embeddings = Embeddings()
47
+ embeddings.load('index.tar.gz')
48
+ ```
49
+
50
+ ## Next Steps
51
+ 1. More detailed usage, including using txtai to create inter-operability between English and Chinese
52
+ 2. Create an usecase with [CrewAI](https://github.com/joaomdmoura/crewAI) and [Dr.Samantha](https://huggingface.co/sethuiyer/Dr_Samantha_7b_mistral)
53
+
54
+ ## License
55
+
56
+ This model is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License and the GNU Free Documentation License.