Dhrumit1314 commited on
Commit
aba8919
1 Parent(s): ed00d9b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -0
README.md CHANGED
@@ -1,3 +1,84 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+ 
5
+ # SkimLit: NLP Model for Medical Abstracts
6
+ SkimLit is a natural language processing (NLP) project aimed at making the reading of medical abstracts more accessible. This project replicates the methodology outlined in the paper "PubMed 200K RCT: a Dataset for Sequenctial Sentence Classification in Medical Abstracts," using TensorFlow and various deep learning techniques.
7
+
8
+ # Project Overview
9
+
10
+ # **`Section 1`**
11
+
12
+ ## Data Collection
13
+ - The PubMed 200K RCT dataset is obtained from the author's GitHub repository using the following commands:
14
+ ```
15
+ git clone https://github.com/Franck-Dernoncourt/pubmed-rct
16
+ cd pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign
17
+ ```
18
+
19
+ ## Data Prepocessing
20
+ - Sentences are extracted from the dataset, and numeric labels are assigned for machine learning models.
21
+ - Three baseline models are established to set the foundation for more complex models.
22
+
23
+ ## Baseline Model (Model 0)
24
+ - TF-IDF Multinomial Naive Bayes Classifier is implemented.
25
+ - Classification evaluation metrics such as accuracy, precision, recall, and F1-score are employed.
26
+
27
+ ## Deep Sequence Models
28
+ ### Model 1: Conv1D with Token Embeddings
29
+ - Custom TextVectorizer and text embedding layers are created.
30
+ - Data is optimized for efficiency using TensorFlow tf.data API.
31
+
32
+ ### Model 2: Pretrained Token Embeddings
33
+ - Universal Sentence Encoder (USE) from TensorFlow Hub is used for feature extraction.
34
+
35
+ ### Model 3: Conv1D with Character Embeddings
36
+ - Character-level tokenizer and embedding are implemented.
37
+ - Conv1D model is constructed using character embeddings.
38
+
39
+ ### Model 4: Hybrid Embedding Layer
40
+ - Token and character-level embeddings are combined using layers.Concatenate.
41
+ - A model is developed to process both types of embeddings and output label probabilities.
42
+
43
+ ### Model 5: Transfer Learning with Positional Embeddings
44
+ - Positional embeddings are introduced to enhance the model's understanding of the sequence.
45
+ - A tribrid embedding model is created, combining token, character, line_number, and total_lines features.
46
+
47
+ ## Model Evaluation and Comparison
48
+ - Models are evaluated on various datasets to compare their performance.
49
+
50
+ ## Save and Load Models
51
+ - Models are saved and loaded for future use.
52
+
53
+ ## Model Loading and Evaluation
54
+ - Pre-trained models are loaded and evaluated on validation datasets.
55
+
56
+ ## Test Dataset Processing and Prediction
57
+ - A test dataset is created, preprocessed, and used for making predictions with the loaded model.
58
+
59
+ ## Enriching Test Dataframe with Predictions
60
+ - Predictions and additional columns are added to the test dataframe for analysis.
61
+
62
+ ## Finding Top Wrong Predictions
63
+ - The top 100 most inaccurately predicted samples are identified.
64
+
65
+ ## Investigating Top Wrong Predictions
66
+ - Detailed information on the top 10 wrong predictions is displayed.
67
+
68
+ # **`Section 2`**
69
+ ## Example Abstracts
70
+ - Example abstracts are downloaded from a GitHub repository.
71
+
72
+ ## Processing Example Abstracts with spaCy
73
+ - spaCy is used to parse sentences from example abstracts.
74
+
75
+ ## One-Hot Encoding and Prediction on Example Abstracts
76
+ - Line numbers and total lines are one-hot encoded, and predictions are made using the loaded model.
77
+
78
+ ## Visualizing Predictions on Example Abstracts
79
+ - Predicted sequence labels for each line in the abstract are displayed.
80
+
81
+ # Conclusion
82
+ - SkimLit provides a comprehensive exploration of NLP techniques for medical abstracts, from baseline models to sophisticated deep learning architectures. The models are evaluated, compared, and applied to real-world examples, offering insights into their strengths and limitations.
83
+
84
+ - Feel free to explore the code, experiment with different models, and contribute to the advancement of Skimlit NLP.