|
--- |
|
title: NutriGenMe PaperExtractor |
|
emoji: 📄 |
|
colorFrom: green |
|
colorTo: blue |
|
sdk: docker |
|
pinned: false |
|
license: apache-2.0 |
|
app_port: 8501 |
|
--- |
|
|
|
# NutriGenMe Paper Extractor |
|
|
|
## Overview |
|
The NutriGenMe Paper Extractor is a tool designed to extract relevant information from genomic papers related to the NutriGenMe project. It utilizes natural language processing techniques to parse through documents and extract key data points, enabling researchers and practitioners to efficiently gather insights from a large corpus of literature. |
|
|
|
## Features |
|
- **Automated Extraction**: Extracts various entities, such as title, authors, and conclusion of the study, from academic papers automatically. |
|
- **Fast Extraction**: Capable of extracting information from complex papers in under 10 minutes. |
|
- **Table Extraction**: Extracts values from tables, particularly focusing on gene names, SNPs, and associated diseases. |
|
- **Export to Excel**: Export extraction results to Excel format for easy integration and further analysis. |
|
|
|
## Usage |
|
1. Clone this repository: |
|
```bash |
|
git clone https://github.com/KalbeDigitalLab/nutrigenme-paper-extractor |
|
``` |
|
|
|
2. Install dependencies: |
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
3. Prepare environment keys: |
|
```dosini |
|
# Credentials for LLM Models |
|
OPENAI_API_KEY=<api_key> |
|
GOOGLE_API_KEY=<api_key> |
|
PERPLEXITY_API_KEY=<api_key> |
|
|
|
# (Optional) Tracking your extraction process with LangSmith |
|
LANGCHAIN_TRACING_V2='true' |
|
LANGCHAIN_API_KEY=<langchain_api_key> |
|
LANGCHAIN_ENDPOINT='https://api.smith.langchain.com' |
|
LANGCHAIN_PROJECT=<project_name> |
|
``` |
|
4. Run the application with `streamlit`: |
|
```bash |
|
streamlit run app.py |
|
``` |
|
|
|
This program is also already deployed in 🤗HuggingFace [Space](https://huggingface.co/spaces/KalbeDigitalLab/nutrigenme-paper-extractor/). |
|
|
|
## Documentation |
|
**app.py**: Designs the user interface and guides the application flow, calling on other scripts for specific tasks. |
|
|
|
**process.py**: Orchestrates the information extraction by delegating tasks to other scripts and handling the overall workflow. |
|
|
|
**prompt.py**: Stores prompts crafted for Large Language Models (LLMs) to target specific information during extraction. |
|
|
|
**table_detector.py**: Focuses on extracting info from Optical Character Recognition (OCR) tables, using functions to detect and process them. |
|
|
|
## Contributing |
|
Contributions are welcome! If you'd like to contribute to this project, feel free to create pull requests. |
|
|