Spaces:

Yijun-Yang
/

ReadReview

Runtime error

App Files Files Community

root commited on Jun 22

Commit

5c5c629

•

1 Parent(s): 92bcd1d

abstractfunctionadded

Browse files

Files changed (7) hide show

README.md +8 -6
README_en.md +70 -0
app.py +75 -41
applocal.py +75 -41
config.ini +2 -2
huixiangdou/service/findarticles.py +29 -0
huixiangdou/service/retriever.py +1 -1

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: 文献综述助手 (又名不想看文献)
 emoji: 📚
 colorFrom: blue
 colorTo: indigo
@@ -9,11 +9,13 @@ sdk_version: "4.25.0"
 app_file: app.py
 pinned: false
 ---
-# MedicalReviewAgent 不想看文献
 ## 项目概述
-- 整一个帮我写综述的Agent，希望他能完成文献内容的收集，文本分类和总结，科学事实对比，撰写综述等功能
-- 计划用到RAG, function calling等技术
-- 还在不断摸索中，欢迎大佬指导！
 - [huggingface 体验链接](https://huggingface.co/spaces/Yijun-Yang/ReadReview/), zeroGPUs 比较吝啬 我把本地推理给阉割了 不要用本地模型哈 用API 用本地模型会报错
 ## 流程图
@@ -97,7 +99,7 @@ huggingface-cli download maidalun1020/bce-reranker-base_v1 --local-dir /root/mod
    ```bash
 conda activate ReviewAgent
 cd MedicalReviewAgent
-python3 app.py --model_downloaded True # 如果已经在/root/models下载了模型 这个参数会换一个配置文件,里面的modelpath是本地路径不是hf的仓库路径 自己显卡跑跑用这个
 python3 app.py # 如果不打算用本地/root/models储存的模型 这是hf的spaces的构建配置
    ```
    gradio在本地7860端口运行

 ---
+title: 文献综述助手
 emoji: 📚
 colorFrom: blue
 colorTo: indigo
 app_file: app.py
 pinned: false
 ---
+[English](README_en.md) | 中文
+# MedicalReviewAgent
 ## 项目概述
+- 一个基于RAG技术和Agent流程的医学文献综述辅助工具。他允许用户配置本地或远程的大语言模型，通过关键词或PMID搜索PubMed以获取文献，上传PDF文件，以及创建和管理文献数据库。用户可以通过设置不同的参数来生成数据库，用于不同的需求。
+- 其中文本分块的聚类和标注功能作为一个创新点，目标是通过聚类算法对大量的文本分块进行聚类，这样大模型只需要阅读少量代表性分块并对聚类进行标注就可以输出对数据库内容的整体认识。
+- 最后写综述功能可以基于用户提问输出一段完整带有相关参考文献的综述文本。
+- 总体来说这个小工具旨在帮助科研人员高效检索，管理，阅读和总结文献。
 - [huggingface 体验链接](https://huggingface.co/spaces/Yijun-Yang/ReadReview/), zeroGPUs 比较吝啬 我把本地推理给阉割了 不要用本地模型哈 用API 用本地模型会报错
 ## 流程图
    ```bash
 conda activate ReviewAgent
 cd MedicalReviewAgent
+python3 applocal.py --model_downloaded True # 如果已经在/root/models下载了模型 这个参数会换一个配置文件,里面的modelpath是本地路径不是hf的仓库路径 自己显卡跑跑用这个
 python3 app.py # 如果不打算用本地/root/models储存的模型 这是hf的spaces的构建配置
    ```
    gradio在本地7860端口运行

README_en.md ADDED Viewed

	@@ -0,0 +1,70 @@

+English | [中文](README.md)
+# MedicalReviewAgent:
+## Project Overview
+- MedicalReviewAgent is a medical literature review assistance tool based on RAG technology and agent workflows. It enables users to configure local or remote large language models to search PubMed via keywords or PMIDs, upload PDF files, and create and manage literature databases. Users can generate databases with different settings for various needs.
+- The tool innovatively includes text block clustering and tagging to manage large volumes of text efficiently. By clustering text blocks, the large model only needs to read a few representative blocks and annotate clusters to summarize the database content comprehensively.
+- The "write review" feature allows generating complete review text with references based on user queries.
+- Overall, this tool is designed to help researchers efficiently retrieve, manage, read, and summarize literature.
+- [Hugging Face Experience Link](https://huggingface.co/spaces/Yijun-Yang/ReadReview/). Note: ZeroGPUs are limited, so avoid using the local model as it may result in errors.
+## Workflow Diagrams
+### Literature and Knowledge Base Construction
+![Literature and Knowledge Base Construction Diagram](https://github.com/jabberwockyang/MedicalReviewAgent/assets/52541128/d70a2ec1-7a20-4b5b-a91c-bf649f657319)
+### Human-Computer Collaborative Writing
+![Human-Computer Collaborative Writing Diagram](https://github.com/jabberwockyang/MedicalReviewAgent/assets/52541128/fc394d8b-1668-4349-9adc-1c4c0a7e0a8b)
+## Features
+1. **Model Service Configuration**
+   - **Remote Model Selection**: Allows users to choose between remote or local large models from various providers like Kimi, Deepseek, Zhipuai, and GPT.
+2. **Literature Search + Database Creation**
+   - **Literature Search**: Users can enter keywords, set the search quantity, and conduct PubMed PMC literature searches.
+   - **Literature Database Management**: Supports deleting existing literature databases and provides real-time updates on the library's overview.
+   - **Database Creation**: Users can set block size and cluster numbers for text clustering.
+   - **Database Management**: Supports creating new databases, deleting existing ones, and viewing database overviews.
+3. **Writing Reviews**
+   - **Sampling Annotated Article Clusters**: Users can choose block size and cluster numbers, set the sampling annotation ratio, and start the annotation process.
+   - **Inspiration Generation**: Based on annotated article clusters, the large model provides inspiration to help generate the framework of questions needed for the review.
+   - **Review Generation**: Users can input the content or topic they wish to write about, click the generate review button, and the system will automatically generate review text with references.
+## Highlights
+1. **Efficient Literature Search and Management**: Quickly find related literature by keywords and supports uploading existing PDF literature for easy library construction and management.
+2. **Flexible Database Generation**: Provides flexible parameters for database generation, supports multiple generations and updates to ensure timeliness and accuracy.
+3. **Intelligent Review Generation**: Utilizes advanced large model technology for automated article cluster annotation and inspiration generation, helping users quickly produce high-quality review text.
+4. **User-Friendly Interface**: Intuitive interface and detailed usage instructions make it easy for users to start and use all features.
+5. **Remote and Local Model Support**: Supports a variety of large model providers to meet different user needs. Whether using local or remote models, configurations can be flexibly adjusted.
+## Installation and Running
+Create a new conda environment:
+```bash
+conda create --name ReviewAgent python=3.10.14
+conda activate ReviewAgent
+```
+Clone the GitHub repository:
+```bash
+git clone https://github.com/jabberwockyang/MedicalReviewAgent.git
+cd MedicalReviewAgent
+pip install -r requirements.txt
+```
+Download models with huggingface-cli (optional, HF will download on first call, but there might be firewall issues):
+```bash
+cd /root && mkdir models
+cd /root/models
+# login required
+huggingface-cli download Qwen/Qwen1.5-7B-Chat --local-dir /root/models/Qwen1.5-7B-Chat
+huggingface-cli download maidalun1020/bce-embedding-base_v1 --local-dir /root/models/bce-embedding-base_v1
+huggingface-cli download maidalun1020/bce-reranker-base_v1 --local-dir /root/models/bce-reranker-base_v1
+```
+Start the service:
+```bash
+conda activate ReviewAgent
+cd MedicalReviewAgent
+python3 app.py --model_downloaded True # Use this if models

app.py CHANGED Viewed

@@ -82,7 +82,7 @@ def udate_model_dropdown(remote_company):
         'kimi': ['moonshot-v1-128k'],
         'deepseek': ['deepseek-chat'],
         'zhipuai': ['glm-4'],
-        'gpt': ['gpt-4-32k-0613','gpt-3.5-turbo']
     }
     return gr.Dropdown(choices= model_choices[remote_company])
@@ -107,7 +107,7 @@ def update_remote_config(remote_ornot,remote_company = None,api = None,baseurl =
     return gr.Button("配置已保存")
 # @spaces.GPU(duration=120)
-def get_ready(query:str,chunksize=None,k=None):
     with open(CONFIG_PATH, encoding='utf8') as f:
         config = pytoml.load(f)
@@ -124,6 +124,8 @@ def get_ready(query:str,chunksize=None,k=None):
     except:
         pass
     if query == 'annotation':
         if not chunksize or not k:
             raise ValueError('chunksize or k not provided')
@@ -182,9 +184,11 @@ def update_repo_info():
             pmc_success = repo_info['pmc_success_d']
             scihub_success = repo_info['scihub_success_d']
             failed_download = repo_info['failed_download']
             number_of_upload = number_of_pdf-scihub_success
-            return keywords, retmax, search_len, import_len, failed_pmid_len, pmc_success, scihub_success, number_of_pdf, failed_download, number_of_upload
         else:
             return None,None,None,None,None,None,None,None,None,number_of_pdf
     else:
@@ -223,33 +227,39 @@ def delete_articles_repo():
     repodir, workdir, _ = get_ready('repo_work')
     if os.path.exists(repodir):
         shutil.rmtree(repodir)
     if os.path.exists(workdir):
         shutil.rmtree(workdir)
     return gr.Textbox(label="文献库概况",lines =3,
                       value = '文献库和相关数据库已删除',
                       visible = True)
 def update_repo():
-    keys, retmax, search_len, import_len, _, pmc_success, scihub_success, pdflen, failed, pdflen = update_repo_info()
     newinfo = ""
     if keys == None:
         newinfo += '无关键词搜索相关信息\n'
         newinfo += '无导入的PMID\n'
-        if pdflen:
-            newinfo += f'上传的PDF数量: {pdflen}\n'
         else:
             newinfo += '无上传的PDF\n'
     else:
         newinfo += f'关键词搜索:'
-        newinfo += f'   关键词: {keys}\n'
-        newinfo += f'   搜索上限: {retmax}\n'
-        newinfo += f'   搜索到的PMID数量: {search_len}\n'
         newinfo += f'导入的PMID数量: {import_len}\n'
-        newinfo += f'成功获取PMC全文数量: {pmc_success}\n'
-        newinfo += f'成功获取SciHub全文数量: {scihub_success}\n'
-        newinfo += f"下载失败的ID: {failed}\n"
-        newinfo += f'上传的PDF数量: {pdflen}\n'
     return gr.Textbox(label="文献库概况",lines =1,
                       value = newinfo,
@@ -259,26 +269,35 @@ def update_database_info():
     with open(CONFIG_PATH, encoding='utf8') as f:
         config = pytoml.load(f)
     workdir = config['feature_store']['work_dir']
-    chunkdirs = glob.glob(os.path.join(workdir, 'chunksize_*'))
-    chunkdirs.sort()
-    list_of_chunksize = [int(chunkdir.split('_')[-1]) for chunkdir in chunkdirs]
-    # print(list_of_chunksize)
-    jsonobj = {}
-    for chunkdir in chunkdirs:
-        k_dir = glob.glob(os.path.join(chunkdir, 'cluster_features','cluster_features_*'))
-        k_dir.sort()
-        list_of_k = [int(k.split('_')[-1]) for k in k_dir]
-        jsonobj[int(chunkdir.split('_')[-1])] = list_of_k
-    new_options = [f"chunksize:{chunksize}, k:{k}" for chunksize in list_of_chunksize for k in jsonobj[chunksize]]
-    return new_options, jsonobj
 @spaces.GPU(duration=120)
 def generate_database(chunksize:int,nclusters:str|list[str]):
     # 在这里运行生成数据库的函数
     repodir, workdir, _ = get_ready('repo_work')
     if not os.path.exists(repodir):
         return gr.Textbox(label="数据库已生成",value = '请先生成文献库',visible = True)
     nclusters = [int(i) for i in nclusters]
@@ -295,12 +314,17 @@ def generate_database(chunksize:int,nclusters:str|list[str]):
                             chunk_size=chunksize,
                             n_clusters=nclusters,
                            config_path=CONFIG_PATH)
     # walk all files in repo dir
-    file_opr = FileOperation()
     files = file_opr.scan_dir(repo_dir=repodir)
     fs_init.initialize(files=files, work_dir=workdir,file_opr=file_opr)
     file_opr.summarize(files)
     del fs_init
     cache.pop('default')
     texts, _ = update_database_info()
@@ -310,6 +334,7 @@ def delete_database():
     _, workdir, _ = get_ready('repo_work')
     if os.path.exists(workdir):
         shutil.rmtree(workdir)
     return  gr.Textbox(label="数据库概况",lines =3,value = '数据库已删除',visible = True)
 def update_database_textbox():
@@ -319,17 +344,24 @@ def update_database_textbox():
     else:
         return gr.Textbox(label="数据库概况",value = '\n'.join(texts),visible = True)
-def update_chunksize_dropdown():
     _, jsonobj = update_database_info()
-    return gr.Dropdown(choices= jsonobj.keys())
-def update_ncluster_dropdown(chunksize:int):
     _, jsonobj = update_database_info()
-    nclusters = jsonobj[chunksize]
     return gr.Dropdown(choices= nclusters)
 # @spaces.GPU(duration=120)
-def annotation(n,chunksize:int,nclusters:int,remote_ornot:bool):
     '''
     use llm to annotate cluster
     n: percentage of clusters to annotate
@@ -340,7 +372,7 @@ def annotation(n,chunksize:int,nclusters:int,remote_ornot:bool):
     else:
         backend = 'local'
-    clusterdir, samples, assistant, theme = get_ready('annotation',chunksize,nclusters)
     new_obj_list = []
     n = round(n * len(samples.keys()))
     for cluster_no in random.sample(samples.keys(), n):
@@ -369,14 +401,14 @@ def annotation(n,chunksize:int,nclusters:int,remote_ornot:bool):
     return '\n\n'.join([obj['annotation'] for obj in new_obj_list])
 # @spaces.GPU(duration=120)
-def inspiration(annotation:str,chunksize:int,nclusters:int,remote_ornot:bool):
     query = 'inspiration'
     if remote_ornot:
         backend = 'remote'
     else:
         backend = 'local'
-    clusterdir, annoresult, assistant, theme = get_ready('inspiration',chunksize,nclusters)
     new_obj_list = []
     if annotation is not None: # if the user wants to get inspiration from specific clusters only
@@ -418,13 +450,13 @@ def getpmcurls(references):
     return urls
 @spaces.GPU(duration=120)
-def summarize_text(query,chunksize:int,remote_ornot:bool):
     if remote_ornot:
         backend = 'remote'
     else:
         backend = 'local'
-    assistant,_ = get_ready('summarize',chunksize=chunksize,k=None)
     code, reply, references = assistant.generate(query=query,
                                                 history=[],
                                                 groupname='',backend = backend)
@@ -611,6 +643,7 @@ def main_interface():
             with gr.Accordion("聚类标注相关参数", open=True):
                 with gr.Row():
                     update_options = gr.Button("更新数据库情况", scale=0)
                     chunksize = gr.Dropdown([], label="选择块大小", scale=0)
                     nclusters = gr.Dropdown([], label="选择聚类数", scale=0)
                     ntoread = gr.Slider(
@@ -637,22 +670,23 @@ def main_interface():
             output_references = gr.Markdown(label="参考文献")
             update_options.click(update_chunksize_dropdown,
                                 outputs=[chunksize])
             chunksize.change(update_ncluster_dropdown,
-                             inputs=[chunksize],
                              outputs= [nclusters])
             annotation_button.click(annotation,
-                                    inputs = [ntoread, chunksize, nclusters,remote_ornot],
                                     outputs=[annotation_output])
             inspiration_button.click(inspiration,
-                                     inputs= [annotation_output, chunksize, nclusters,remote_ornot],
                                      outputs=[inspiration_output])
             write_button.click(summarize_text,
-                                inputs=[query, chunksize,remote_ornot],
                                 outputs =[output_text,output_references])
     demo.launch(share=False, server_name='0.0.0.0', debug=True,show_error=True,allowed_paths=['img_0.jpg'])

         'kimi': ['moonshot-v1-128k'],
         'deepseek': ['deepseek-chat'],
         'zhipuai': ['glm-4'],
+        'gpt': ['gpt-4-32k-0613','gpt-3.5-turbo','gpt-4']
     }
     return gr.Dropdown(choices= model_choices[remote_company])
     return gr.Button("配置已保存")
 # @spaces.GPU(duration=120)
+def get_ready(query:str,chunksize=None,k=None,use_abstract=False):
     with open(CONFIG_PATH, encoding='utf8') as f:
         config = pytoml.load(f)
     except:
         pass
+    if use_abstract:
+        workdir = workdir + '_ab'
     if query == 'annotation':
         if not chunksize or not k:
             raise ValueError('chunksize or k not provided')
             pmc_success = repo_info['pmc_success_d']
             scihub_success = repo_info['scihub_success_d']
             failed_download = repo_info['failed_download']
+            abstract_success = repo_info['abstract_success']
+            failed_abstract = repo_info['failed_abstract']
             number_of_upload = number_of_pdf-scihub_success
+            return keywords, retmax, search_len, import_len, failed_pmid_len, pmc_success, scihub_success, failed_download, abstract_success, failed_abstract, number_of_upload
         else:
             return None,None,None,None,None,None,None,None,None,number_of_pdf
     else:
     repodir, workdir, _ = get_ready('repo_work')
     if os.path.exists(repodir):
         shutil.rmtree(repodir)
+        shutil.rmtree(repodir + '_ab')
     if os.path.exists(workdir):
         shutil.rmtree(workdir)
+        shutil.rmtree(workdir + '_ab')
     return gr.Textbox(label="文献库概况",lines =3,
                       value = '文献库和相关数据库已删除',
                       visible = True)
 def update_repo():
+    #  keywords, retmax, search_len, import_len, failed_pmid_len, pmc_success, scihub_success, failed_download, abstract_success, failed_abstract, number_of_upload
+    #  None,None,None,None,None,None,None,None,None,number_of_pdf
+    keys, retmax, search_len, import_len, _, pmc_success, scihub_success, failed, abstract_success, failed_abstract, pdfuplo = update_repo_info()
     newinfo = ""
     if keys == None:
         newinfo += '无关键词搜索相关信息\n'
         newinfo += '无导入的PMID\n'
+        if pdfuplo:
+            newinfo += f'上传的PDF数量: {pdfuplo}\n'
         else:
             newinfo += '无上传的PDF\n'
     else:
         newinfo += f'关键词搜索:'
+        newinfo += f'       关键词: {keys}\n'
+        newinfo += f'       搜索上限: {retmax}\n'
+        newinfo += f'       搜索到的PMID数量: {search_len}\n'
         newinfo += f'导入的PMID数量: {import_len}\n'
+        newinfo += f'       成功获取PMC全文数量: {pmc_success}\n'
+        newinfo += f'       成功获取SciHub全文数量: {scihub_success}\n'
+        newinfo += f"       下载失败的ID: {failed}\n"
+        newinfo += f"       成功获取摘要的数量: {abstract_success}\n"
+        newinfo += f"       获取摘要失败的数量: {failed_abstract}\n"
+        newinfo += f'上传的PDF数量: {pdfuplo}\n'
     return gr.Textbox(label="文献库概况",lines =1,
                       value = newinfo,
     with open(CONFIG_PATH, encoding='utf8') as f:
         config = pytoml.load(f)
     workdir = config['feature_store']['work_dir']
+    abworkdir = workdir + '_ab'
+    options = []
+    total_json_obj = {}
+    for dir in [workdir,abworkdir]:
+        tag = 'FullText' if '_ab' not in dir else 'Abstract'
+        chunkdirs = glob.glob(os.path.join(dir, 'chunksize_*'))
+        chunkdirs.sort()
+        list_of_chunksize = [int(chunkdir.split('_')[-1]) for chunkdir in chunkdirs]
+        # print(list_of_chunksize)
+        jsonobj = {}
+        for chunkdir in chunkdirs:
+            k_dir = glob.glob(os.path.join(chunkdir, 'cluster_features','cluster_features_*'))
+            k_dir.sort()
+            list_of_k = [int(k.split('_')[-1]) for k in k_dir]
+            jsonobj[int(chunkdir.split('_')[-1])] = list_of_k
+        total_json_obj[tag] = jsonobj
+        newoptions = [f"{tag}, chunksize:{chunksize}, k:{k}" for chunksize in list_of_chunksize for k in jsonobj[chunksize]]
+        options.extend(newoptions)
+    return options, total_json_obj
 @spaces.GPU(duration=120)
 def generate_database(chunksize:int,nclusters:str|list[str]):
     # 在这里运行生成数据库的函数
     repodir, workdir, _ = get_ready('repo_work')
+    abrepodir = repodir + '_ab'
+    abworkdir = workdir + '_ab'
     if not os.path.exists(repodir):
         return gr.Textbox(label="数据库已生成",value = '请先生成文献库',visible = True)
     nclusters = [int(i) for i in nclusters]
                             chunk_size=chunksize,
                             n_clusters=nclusters,
                            config_path=CONFIG_PATH)
+    file_opr = FileOperation()
     # walk all files in repo dir
     files = file_opr.scan_dir(repo_dir=repodir)
     fs_init.initialize(files=files, work_dir=workdir,file_opr=file_opr)
     file_opr.summarize(files)
+    files = file_opr.scan_dir(repo_dir=abrepodir)
+    fs_init.initialize(files=files, work_dir=abworkdir,file_opr=file_opr)
+    file_opr.summarize(files)
     del fs_init
     cache.pop('default')
     texts, _ = update_database_info()
     _, workdir, _ = get_ready('repo_work')
     if os.path.exists(workdir):
         shutil.rmtree(workdir)
+        shutil.rmtree(workdir+'_ab')
     return  gr.Textbox(label="数据库概况",lines =3,value = '数据库已删除',visible = True)
 def update_database_textbox():
     else:
         return gr.Textbox(label="数据库概况",value = '\n'.join(texts),visible = True)
+def update_chunksize_dropdown(use_abstract):
     _, jsonobj = update_database_info()
+    if use_abstract:
+        choices = jsonobj['Abstract'].keys()
+    else:
+        choices = jsonobj['FullText'].keys()
+    return gr.Dropdown(choices= choices)
+def update_ncluster_dropdown(chunksize:int,use_abstract:bool):
     _, jsonobj = update_database_info()
+    if use_abstract:
+        nclusters = jsonobj['Abstract'][chunksize]
+    else:
+        nclusters = jsonobj['FullText'][chunksize]
     return gr.Dropdown(choices= nclusters)
 # @spaces.GPU(duration=120)
+def annotation(n,chunksize:int,nclusters:int,remote_ornot:bool,use_abstract:bool):
     '''
     use llm to annotate cluster
     n: percentage of clusters to annotate
     else:
         backend = 'local'
+    clusterdir, samples, assistant, theme = get_ready('annotation',chunksize,nclusters,use_abstract)
     new_obj_list = []
     n = round(n * len(samples.keys()))
     for cluster_no in random.sample(samples.keys(), n):
     return '\n\n'.join([obj['annotation'] for obj in new_obj_list])
 # @spaces.GPU(duration=120)
+def inspiration(annotation:str,chunksize:int,nclusters:int,remote_ornot:bool,use_abstract:bool):
     query = 'inspiration'
     if remote_ornot:
         backend = 'remote'
     else:
         backend = 'local'
+    clusterdir, annoresult, assistant, theme = get_ready('inspiration',chunksize,nclusters,use_abstract)
     new_obj_list = []
     if annotation is not None: # if the user wants to get inspiration from specific clusters only
     return urls
 @spaces.GPU(duration=120)
+def summarize_text(query,chunksize:int,remote_ornot:bool,use_abstract:bool):
     if remote_ornot:
         backend = 'remote'
     else:
         backend = 'local'
+    assistant,_ = get_ready('summarize',chunksize=chunksize,k=None,use_abstract=use_abstract)
     code, reply, references = assistant.generate(query=query,
                                                 history=[],
                                                 groupname='',backend = backend)
             with gr.Accordion("聚类标注相关参数", open=True):
                 with gr.Row():
                     update_options = gr.Button("更新数据库情况", scale=0)
+                    use_abstract = gr.Checkbox(label="是否仅使用摘要",scale=0)
                     chunksize = gr.Dropdown([], label="选择块大小", scale=0)
                     nclusters = gr.Dropdown([], label="选择聚类数", scale=0)
                     ntoread = gr.Slider(
             output_references = gr.Markdown(label="参考文献")
             update_options.click(update_chunksize_dropdown,
+                                 inputs=[use_abstract],
                                 outputs=[chunksize])
             chunksize.change(update_ncluster_dropdown,
+                             inputs=[chunksize,use_abstract],
                              outputs= [nclusters])
             annotation_button.click(annotation,
+                                    inputs = [ntoread, chunksize, nclusters,remote_ornot,use_abstract],
                                     outputs=[annotation_output])
             inspiration_button.click(inspiration,
+                                     inputs= [annotation_output, chunksize, nclusters,remote_ornot,use_abstract],
                                      outputs=[inspiration_output])
             write_button.click(summarize_text,
+                                inputs=[query, chunksize,remote_ornot,use_abstract],
                                 outputs =[output_text,output_references])
     demo.launch(share=False, server_name='0.0.0.0', debug=True,show_error=True,allowed_paths=['img_0.jpg'])

applocal.py CHANGED Viewed

@@ -82,7 +82,7 @@ def udate_model_dropdown(remote_company):
         'kimi': ['moonshot-v1-128k'],
         'deepseek': ['deepseek-chat'],
         'zhipuai': ['glm-4'],
-        'gpt': ['gpt-4-32k-0613','gpt-3.5-turbo']
     }
     return gr.Dropdown(choices= model_choices[remote_company])
@@ -107,7 +107,7 @@ def update_remote_config(remote_ornot,remote_company = None,api = None,baseurl =
     return gr.Button("配置已保存")
 # @spaces.GPU(duration=360)
-def get_ready(query:str,chunksize=None,k=None):
     with open(CONFIG_PATH, encoding='utf8') as f:
         config = pytoml.load(f)
@@ -124,6 +124,8 @@ def get_ready(query:str,chunksize=None,k=None):
     except:
         pass
     if query == 'annotation':
         if not chunksize or not k:
             raise ValueError('chunksize or k not provided')
@@ -182,9 +184,11 @@ def update_repo_info():
             pmc_success = repo_info['pmc_success_d']
             scihub_success = repo_info['scihub_success_d']
             failed_download = repo_info['failed_download']
             number_of_upload = number_of_pdf-scihub_success
-            return keywords, retmax, search_len, import_len, failed_pmid_len, pmc_success, scihub_success, number_of_pdf, failed_download, number_of_upload
         else:
             return None,None,None,None,None,None,None,None,None,number_of_pdf
     else:
@@ -223,33 +227,39 @@ def delete_articles_repo():
     repodir, workdir, _ = get_ready('repo_work')
     if os.path.exists(repodir):
         shutil.rmtree(repodir)
     if os.path.exists(workdir):
         shutil.rmtree(workdir)
     return gr.Textbox(label="文献库概况",lines =3,
                       value = '文献库和相关数据库已删除',
                       visible = True)
 def update_repo():
-    keys, retmax, search_len, import_len, _, pmc_success, scihub_success, pdflen, failed, pdflen = update_repo_info()
     newinfo = ""
     if keys == None:
         newinfo += '无关键词搜索相关信息\n'
         newinfo += '无导入的PMID\n'
-        if pdflen:
-            newinfo += f'上传的PDF数量: {pdflen}\n'
         else:
             newinfo += '无上传的PDF\n'
     else:
         newinfo += f'关键词搜索:'
-        newinfo += f'   关键词: {keys}\n'
-        newinfo += f'   搜索上限: {retmax}\n'
-        newinfo += f'   搜索到的PMID数量: {search_len}\n'
         newinfo += f'导入的PMID数量: {import_len}\n'
-        newinfo += f'成功获取PMC全文数量: {pmc_success}\n'
-        newinfo += f'成功获取SciHub全文数量: {scihub_success}\n'
-        newinfo += f"下载失败的ID: {failed}\n"
-        newinfo += f'上传的PDF数量: {pdflen}\n'
     return gr.Textbox(label="文献库概况",lines =1,
                       value = newinfo,
@@ -259,26 +269,35 @@ def update_database_info():
     with open(CONFIG_PATH, encoding='utf8') as f:
         config = pytoml.load(f)
     workdir = config['feature_store']['work_dir']
-    chunkdirs = glob.glob(os.path.join(workdir, 'chunksize_*'))
-    chunkdirs.sort()
-    list_of_chunksize = [int(chunkdir.split('_')[-1]) for chunkdir in chunkdirs]
-    # print(list_of_chunksize)
-    jsonobj = {}
-    for chunkdir in chunkdirs:
-        k_dir = glob.glob(os.path.join(chunkdir, 'cluster_features','cluster_features_*'))
-        k_dir.sort()
-        list_of_k = [int(k.split('_')[-1]) for k in k_dir]
-        jsonobj[int(chunkdir.split('_')[-1])] = list_of_k
-    new_options = [f"chunksize:{chunksize}, k:{k}" for chunksize in list_of_chunksize for k in jsonobj[chunksize]]
-    return new_options, jsonobj
 # @spaces.GPU(duration=360)
 def generate_database(chunksize:int,nclusters:str|list[str]):
     # 在这里运行生成数据库的函数
     repodir, workdir, _ = get_ready('repo_work')
     if not os.path.exists(repodir):
         return gr.Textbox(label="数据库已生成",value = '请先生成文献库',visible = True)
     nclusters = [int(i) for i in nclusters]
@@ -295,12 +314,17 @@ def generate_database(chunksize:int,nclusters:str|list[str]):
                             chunk_size=chunksize,
                             n_clusters=nclusters,
                            config_path=CONFIG_PATH)
     # walk all files in repo dir
-    file_opr = FileOperation()
     files = file_opr.scan_dir(repo_dir=repodir)
     fs_init.initialize(files=files, work_dir=workdir,file_opr=file_opr)
     file_opr.summarize(files)
     del fs_init
     cache.pop('default')
     texts, _ = update_database_info()
@@ -310,6 +334,7 @@ def delete_database():
     _, workdir, _ = get_ready('repo_work')
     if os.path.exists(workdir):
         shutil.rmtree(workdir)
     return  gr.Textbox(label="数据库概况",lines =3,value = '数据库已删除',visible = True)
 def update_database_textbox():
@@ -319,17 +344,24 @@ def update_database_textbox():
     else:
         return gr.Textbox(label="数据库概况",value = '\n'.join(texts),visible = True)
-def update_chunksize_dropdown():
     _, jsonobj = update_database_info()
-    return gr.Dropdown(choices= jsonobj.keys())
-def update_ncluster_dropdown(chunksize:int):
     _, jsonobj = update_database_info()
-    nclusters = jsonobj[chunksize]
     return gr.Dropdown(choices= nclusters)
 # @spaces.GPU(duration=360)
-def annotation(n,chunksize:int,nclusters:int,remote_ornot:bool):
     '''
     use llm to annotate cluster
     n: percentage of clusters to annotate
@@ -340,7 +372,7 @@ def annotation(n,chunksize:int,nclusters:int,remote_ornot:bool):
     else:
         backend = 'local'
-    clusterdir, samples, assistant, theme = get_ready('annotation',chunksize,nclusters)
     new_obj_list = []
     n = round(n * len(samples.keys()))
     for cluster_no in random.sample(samples.keys(), n):
@@ -369,14 +401,14 @@ def annotation(n,chunksize:int,nclusters:int,remote_ornot:bool):
     return '\n\n'.join([obj['annotation'] for obj in new_obj_list])
 # @spaces.GPU(duration=360)
-def inspiration(annotation:str,chunksize:int,nclusters:int,remote_ornot:bool):
     query = 'inspiration'
     if remote_ornot:
         backend = 'remote'
     else:
         backend = 'local'
-    clusterdir, annoresult, assistant, theme = get_ready('inspiration',chunksize,nclusters)
     new_obj_list = []
     if annotation is not None: # if the user wants to get inspiration from specific clusters only
@@ -418,13 +450,13 @@ def getpmcurls(references):
     return urls
 # @spaces.GPU(duration=360)
-def summarize_text(query,chunksize:int,remote_ornot:bool):
     if remote_ornot:
         backend = 'remote'
     else:
         backend = 'local'
-    assistant,_ = get_ready('summarize',chunksize=chunksize,k=None)
     code, reply, references = assistant.generate(query=query,
                                                 history=[],
                                                 groupname='',backend = backend)
@@ -611,6 +643,7 @@ def main_interface():
             with gr.Accordion("聚类标注相关参数", open=True):
                 with gr.Row():
                     update_options = gr.Button("更新数据库情况", scale=0)
                     chunksize = gr.Dropdown([], label="选择块大小", scale=0)
                     nclusters = gr.Dropdown([], label="选择聚类数", scale=0)
                     ntoread = gr.Slider(
@@ -637,22 +670,23 @@ def main_interface():
             output_references = gr.Markdown(label="参考文献")
             update_options.click(update_chunksize_dropdown,
                                 outputs=[chunksize])
             chunksize.change(update_ncluster_dropdown,
-                             inputs=[chunksize],
                              outputs= [nclusters])
             annotation_button.click(annotation,
-                                    inputs = [ntoread, chunksize, nclusters,remote_ornot],
                                     outputs=[annotation_output])
             inspiration_button.click(inspiration,
-                                     inputs= [annotation_output, chunksize, nclusters,remote_ornot],
                                      outputs=[inspiration_output])
             write_button.click(summarize_text,
-                                inputs=[query, chunksize,remote_ornot],
                                 outputs =[output_text,output_references])
     demo.launch(share=False, server_name='0.0.0.0', debug=True,show_error=True,allowed_paths=['img_0.jpg'])

         'kimi': ['moonshot-v1-128k'],
         'deepseek': ['deepseek-chat'],
         'zhipuai': ['glm-4'],
+        'gpt': ['gpt-4-32k-0613','gpt-3.5-turbo','gpt-4']
     }
     return gr.Dropdown(choices= model_choices[remote_company])
     return gr.Button("配置已保存")
 # @spaces.GPU(duration=360)
+def get_ready(query:str,chunksize=None,k=None,use_abstract=False):
     with open(CONFIG_PATH, encoding='utf8') as f:
         config = pytoml.load(f)
     except:
         pass
+    if use_abstract:
+        workdir = workdir + '_ab'
     if query == 'annotation':
         if not chunksize or not k:
             raise ValueError('chunksize or k not provided')
             pmc_success = repo_info['pmc_success_d']
             scihub_success = repo_info['scihub_success_d']
             failed_download = repo_info['failed_download']
+            abstract_success = repo_info['abstract_success']
+            failed_abstract = repo_info['failed_abstract']
             number_of_upload = number_of_pdf-scihub_success
+            return keywords, retmax, search_len, import_len, failed_pmid_len, pmc_success, scihub_success, failed_download, abstract_success, failed_abstract, number_of_upload
         else:
             return None,None,None,None,None,None,None,None,None,number_of_pdf
     else:
     repodir, workdir, _ = get_ready('repo_work')
     if os.path.exists(repodir):
         shutil.rmtree(repodir)
+        shutil.rmtree(repodir + '_ab')
     if os.path.exists(workdir):
         shutil.rmtree(workdir)
+        shutil.rmtree(workdir + '_ab')
     return gr.Textbox(label="文献库概况",lines =3,
                       value = '文献库和相关数据库已删除',
                       visible = True)
 def update_repo():
+    #  keywords, retmax, search_len, import_len, failed_pmid_len, pmc_success, scihub_success, failed_download, abstract_success, failed_abstract, number_of_upload
+    #  None,None,None,None,None,None,None,None,None,number_of_pdf
+    keys, retmax, search_len, import_len, _, pmc_success, scihub_success, failed, abstract_success, failed_abstract, pdfuplo = update_repo_info()
     newinfo = ""
     if keys == None:
         newinfo += '无关键词搜索相关信息\n'
         newinfo += '无导入的PMID\n'
+        if pdfuplo:
+            newinfo += f'上传的PDF数量: {pdfuplo}\n'
         else:
             newinfo += '无上传的PDF\n'
     else:
         newinfo += f'关键词搜索:'
+        newinfo += f'       关键词: {keys}\n'
+        newinfo += f'       搜索上限: {retmax}\n'
+        newinfo += f'       搜索到的PMID数量: {search_len}\n'
         newinfo += f'导入的PMID数量: {import_len}\n'
+        newinfo += f'       成功获取PMC全文数量: {pmc_success}\n'
+        newinfo += f'       成功获取SciHub全文数量: {scihub_success}\n'
+        newinfo += f"       下载失败的ID: {failed}\n"
+        newinfo += f"       成功获取摘要的数量: {abstract_success}\n"
+        newinfo += f"       获取摘要失败的数量: {failed_abstract}\n"
+        newinfo += f'上传的PDF数量: {pdfuplo}\n'
     return gr.Textbox(label="文献库概况",lines =1,
                       value = newinfo,
     with open(CONFIG_PATH, encoding='utf8') as f:
         config = pytoml.load(f)
     workdir = config['feature_store']['work_dir']
+    abworkdir = workdir + '_ab'
+    options = []
+    total_json_obj = {}
+    for dir in [workdir,abworkdir]:
+        tag = 'FullText' if '_ab' not in dir else 'Abstract'
+        chunkdirs = glob.glob(os.path.join(dir, 'chunksize_*'))
+        chunkdirs.sort()
+        list_of_chunksize = [int(chunkdir.split('_')[-1]) for chunkdir in chunkdirs]
+        # print(list_of_chunksize)
+        jsonobj = {}
+        for chunkdir in chunkdirs:
+            k_dir = glob.glob(os.path.join(chunkdir, 'cluster_features','cluster_features_*'))
+            k_dir.sort()
+            list_of_k = [int(k.split('_')[-1]) for k in k_dir]
+            jsonobj[int(chunkdir.split('_')[-1])] = list_of_k
+        total_json_obj[tag] = jsonobj
+        newoptions = [f"{tag}, chunksize:{chunksize}, k:{k}" for chunksize in list_of_chunksize for k in jsonobj[chunksize]]
+        options.extend(newoptions)
+    return options, total_json_obj
 # @spaces.GPU(duration=360)
 def generate_database(chunksize:int,nclusters:str|list[str]):
     # 在这里运行生成数据库的函数
     repodir, workdir, _ = get_ready('repo_work')
+    abrepodir = repodir + '_ab'
+    abworkdir = workdir + '_ab'
     if not os.path.exists(repodir):
         return gr.Textbox(label="数据库已生成",value = '请先生成文献库',visible = True)
     nclusters = [int(i) for i in nclusters]
                             chunk_size=chunksize,
                             n_clusters=nclusters,
                            config_path=CONFIG_PATH)
+    file_opr = FileOperation()
     # walk all files in repo dir
     files = file_opr.scan_dir(repo_dir=repodir)
     fs_init.initialize(files=files, work_dir=workdir,file_opr=file_opr)
     file_opr.summarize(files)
+    files = file_opr.scan_dir(repo_dir=abrepodir)
+    fs_init.initialize(files=files, work_dir=abworkdir,file_opr=file_opr)
+    file_opr.summarize(files)
     del fs_init
     cache.pop('default')
     texts, _ = update_database_info()
     _, workdir, _ = get_ready('repo_work')
     if os.path.exists(workdir):
         shutil.rmtree(workdir)
+        shutil.rmtree(workdir+'_ab')
     return  gr.Textbox(label="数据库概况",lines =3,value = '数据库已删除',visible = True)
 def update_database_textbox():
     else:
         return gr.Textbox(label="数据库概况",value = '\n'.join(texts),visible = True)
+def update_chunksize_dropdown(use_abstract):
     _, jsonobj = update_database_info()
+    if use_abstract:
+        choices = jsonobj['Abstract'].keys()
+    else:
+        choices = jsonobj['FullText'].keys()
+    return gr.Dropdown(choices= choices)
+def update_ncluster_dropdown(chunksize:int,use_abstract:bool):
     _, jsonobj = update_database_info()
+    if use_abstract:
+        nclusters = jsonobj['Abstract'][chunksize]
+    else:
+        nclusters = jsonobj['FullText'][chunksize]
     return gr.Dropdown(choices= nclusters)
 # @spaces.GPU(duration=360)
+def annotation(n,chunksize:int,nclusters:int,remote_ornot:bool,use_abstract:bool):
     '''
     use llm to annotate cluster
     n: percentage of clusters to annotate
     else:
         backend = 'local'
+    clusterdir, samples, assistant, theme = get_ready('annotation',chunksize,nclusters,use_abstract)
     new_obj_list = []
     n = round(n * len(samples.keys()))
     for cluster_no in random.sample(samples.keys(), n):
     return '\n\n'.join([obj['annotation'] for obj in new_obj_list])
 # @spaces.GPU(duration=360)
+def inspiration(annotation:str,chunksize:int,nclusters:int,remote_ornot:bool,use_abstract:bool):
     query = 'inspiration'
     if remote_ornot:
         backend = 'remote'
     else:
         backend = 'local'
+    clusterdir, annoresult, assistant, theme = get_ready('inspiration',chunksize,nclusters,use_abstract)
     new_obj_list = []
     if annotation is not None: # if the user wants to get inspiration from specific clusters only
     return urls
 # @spaces.GPU(duration=360)
+def summarize_text(query,chunksize:int,remote_ornot:bool,use_abstract:bool):
     if remote_ornot:
         backend = 'remote'
     else:
         backend = 'local'
+    assistant,_ = get_ready('summarize',chunksize=chunksize,k=None,use_abstract=use_abstract)
     code, reply, references = assistant.generate(query=query,
                                                 history=[],
                                                 groupname='',backend = backend)
             with gr.Accordion("聚类标注相关参数", open=True):
                 with gr.Row():
                     update_options = gr.Button("更新数据库情况", scale=0)
+                    use_abstract = gr.Checkbox(label="是否仅使用摘要",scale=0)
                     chunksize = gr.Dropdown([], label="选择块大小", scale=0)
                     nclusters = gr.Dropdown([], label="选择聚类数", scale=0)
                     ntoread = gr.Slider(
             output_references = gr.Markdown(label="参考文献")
             update_options.click(update_chunksize_dropdown,
+                                 inputs=[use_abstract],
                                 outputs=[chunksize])
             chunksize.change(update_ncluster_dropdown,
+                             inputs=[chunksize,use_abstract],
                              outputs= [nclusters])
             annotation_button.click(annotation,
+                                    inputs = [ntoread, chunksize, nclusters,remote_ornot,use_abstract],
                                     outputs=[annotation_output])
             inspiration_button.click(inspiration,
+                                     inputs= [annotation_output, chunksize, nclusters,remote_ornot,use_abstract],
                                      outputs=[inspiration_output])
             write_button.click(summarize_text,
+                                inputs=[query, chunksize,remote_ornot,use_abstract],
                                 outputs =[output_text,output_references])
     demo.launch(share=False, server_name='0.0.0.0', debug=True,show_error=True,allowed_paths=['img_0.jpg'])

config.ini CHANGED Viewed

@@ -4,8 +4,8 @@ embedding_model_path = "/root/models/bce-embedding-base_v1"
 reranker_model_path = "/root/models/bce-reranker-base_v1"
 repo_dir = "repodir"
 work_dir = "workdir"
-n_clusters = [10, 20]
-chunk_size = 1024
 [web_search]
 x_api_key = "${YOUR-API-KEY}"

 reranker_model_path = "/root/models/bce-reranker-base_v1"
 repo_dir = "repodir"
 work_dir = "workdir"
+n_clusters = [10]
+chunk_size = 2482
 [web_search]
 x_api_key = "${YOUR-API-KEY}"

huixiangdou/service/findarticles.py CHANGED Viewed

@@ -88,6 +88,7 @@ class ArticleRetrieval:
         for docsum in root.findall('DocSum'):
             pmcid = None
             doi = None
             id_value = docsum.find('Id').text
             for item in docsum.findall('.//Item[@Name="doi"]'):
                 doi = item.text
@@ -155,11 +156,16 @@ class ArticleRetrieval:
     def fetch_full_text(self):
         if not os.path.exists(self.repo_dir):
             os.makedirs(self.repo_dir)
         print(f"Saving articles to {self.repo_dir}.")
         self.pmc_success = 0
         self.scihub_success = 0
         self.failed_download = []
         downloaded = os.listdir(self.repo_dir)
         for id in tqdm(self.pmc_ids, desc="Fetching full texts", unit="article"):
             # check if file already downloaded
             if f"{id}.txt" in downloaded:
@@ -194,6 +200,27 @@ class ArticleRetrieval:
                 self.scihub_success += 1
             else:
                 self.failed_download.append(doi)
     def save_config(self):
         config = {
@@ -213,6 +240,8 @@ class ArticleRetrieval:
             "pmc_success_d": self.pmc_success,
             "scihub_success_d": self.scihub_success,
             "failed_download": self.failed_download,
         }
         with open(os.path.join(self.repo_dir, 'info.json'), 'w') as f:

         for docsum in root.findall('DocSum'):
             pmcid = None
             doi = None
+            abstract = None
             id_value = docsum.find('Id').text
             for item in docsum.findall('.//Item[@Name="doi"]'):
                 doi = item.text
     def fetch_full_text(self):
         if not os.path.exists(self.repo_dir):
             os.makedirs(self.repo_dir)
+            os.makedirs(self.repo_dir + '_ab')
         print(f"Saving articles to {self.repo_dir}.")
         self.pmc_success = 0
         self.scihub_success = 0
+        self.abstract_success = 0
         self.failed_download = []
+        self.failed_abstract = []
         downloaded = os.listdir(self.repo_dir)
+        downloaded_ab = os.listdir(self.repo_dir + '_ab')
         for id in tqdm(self.pmc_ids, desc="Fetching full texts", unit="article"):
             # check if file already downloaded
             if f"{id}.txt" in downloaded:
                 self.scihub_success += 1
             else:
                 self.failed_download.append(doi)
+        for pmid in tqdm(self.pmids, desc="Fetching abstract texts", unit="article"):
+            # check if file already downloaded
+            if f"{pmid}.txt" in downloaded_ab:
+                print(f"File already downloaded: {pmid}")
+                self.scihub_success += 1
+                continue
+            base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
+            params = {
+                "db": "pubmed",
+                "id": pmid,
+            }
+            response = requests.get(base_url, params=params)
+            root = ET.fromstring(response.content)
+            abstract = root.find('.//AbstractText')
+            if abstract is not None:
+                with open(os.path.join(self.repo_dir + '_ab',f'{pmid}.txt'), 'w') as f:
+                    f.write(abstract.text)
+                self.abstract_success += 1
+            else:
+                self.failed_abstract.append(pmid)
     def save_config(self):
         config = {
             "pmc_success_d": self.pmc_success,
             "scihub_success_d": self.scihub_success,
             "failed_download": self.failed_download,
+            "abstract_success": self.abstract_success,
+            "failed_abstract": self.failed_abstract
         }
         with open(os.path.join(self.repo_dir, 'info.json'), 'w') as f:

huixiangdou/service/retriever.py CHANGED Viewed

@@ -40,7 +40,7 @@ class Retriever:
                 search_type='similarity',
                 search_kwargs={
                     'score_threshold': 0.15,
-                    'k': 5
                 })
         self.reordering = LongContextReorder()

                 search_type='similarity',
                 search_kwargs={
                     'score_threshold': 0.15,
+                    'k': 10
                 })
         self.reordering = LongContextReorder()