--- license: mit language: - en ---  # SkimLit: NLP Model for Medical Abstracts SkimLit is a natural language processing (NLP) project aimed at making the reading of medical abstracts more accessible. This project replicates the methodology outlined in the paper "PubMed 200K RCT: a Dataset for Sequenctial Sentence Classification in Medical Abstracts," using TensorFlow and various deep learning techniques. # Project Overview # **`Section 1`** ## Data Collection - The PubMed 200K RCT dataset is obtained from the author's GitHub repository using the following commands: ``` git clone https://github.com/Franck-Dernoncourt/pubmed-rct cd pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign ``` ## Data Prepocessing - Sentences are extracted from the dataset, and numeric labels are assigned for machine learning models. - Three baseline models are established to set the foundation for more complex models. ## Baseline Model (Model 0) - TF-IDF Multinomial Naive Bayes Classifier is implemented. - Classification evaluation metrics such as accuracy, precision, recall, and F1-score are employed. ## Deep Sequence Models ### Model 1: Conv1D with Token Embeddings - Custom TextVectorizer and text embedding layers are created. - Data is optimized for efficiency using TensorFlow tf.data API. ### Model 2: Pretrained Token Embeddings - Universal Sentence Encoder (USE) from TensorFlow Hub is used for feature extraction. ### Model 3: Conv1D with Character Embeddings - Character-level tokenizer and embedding are implemented. - Conv1D model is constructed using character embeddings. ### Model 4: Hybrid Embedding Layer - Token and character-level embeddings are combined using layers.Concatenate. - A model is developed to process both types of embeddings and output label probabilities. ### Model 5: Transfer Learning with Positional Embeddings - Positional embeddings are introduced to enhance the model's understanding of the sequence. - A tribrid embedding model is created, combining token, character, line_number, and total_lines features. ## Model Evaluation and Comparison - Models are evaluated on various datasets to compare their performance. ## Save and Load Models - Models are saved and loaded for future use. ## Model Loading and Evaluation - Pre-trained models are loaded and evaluated on validation datasets. ## Test Dataset Processing and Prediction - A test dataset is created, preprocessed, and used for making predictions with the loaded model. ## Enriching Test Dataframe with Predictions - Predictions and additional columns are added to the test dataframe for analysis. ## Finding Top Wrong Predictions - The top 100 most inaccurately predicted samples are identified. ## Investigating Top Wrong Predictions - Detailed information on the top 10 wrong predictions is displayed. # **`Section 2`** ## Example Abstracts - Example abstracts are downloaded from a GitHub repository. ## Processing Example Abstracts with spaCy - spaCy is used to parse sentences from example abstracts. ## One-Hot Encoding and Prediction on Example Abstracts - Line numbers and total lines are one-hot encoded, and predictions are made using the loaded model. ## Visualizing Predictions on Example Abstracts - Predicted sequence labels for each line in the abstract are displayed. # Conclusion - SkimLit provides a comprehensive exploration of NLP techniques for medical abstracts, from baseline models to sophisticated deep learning architectures. The models are evaluated, compared, and applied to real-world examples, offering insights into their strengths and limitations. - Feel free to explore the code, experiment with different models, and contribute to the advancement of Skimlit NLP.