Spaces:

xiangmind
/

Easy_Cite_Chip

Running

App Files Files Community

Easy_Cite_Chip / README.md

zhaoxiang

init

5e5d326 5 months ago

preview code

raw

history blame contribute delete

No virus

2.53 kB

	---
	title: Easy Cite Chip
	emoji: ⚡
	colorFrom: gray
	colorTo: gray
	sdk: gradio
	sdk_version: 4.18.0
	app_file: app.py
	pinned: false
	license: mit
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference





	1. spaCy 分句，直接使用 llm（ chatgpt ）构造合适的query完成这个任务；
	- 使用LLM构造合适的 search query，实验证明比NER强多了
	- 使用citationCount排序搜索结果
	- 稍后会使用rag进行句子/paper相似度匹配，确保不要引用自身paper

	2. 从semanticscholar拿到的文章越多越好，先获得query相关文献前30篇，按照citation筛选出前10篇, 这10篇文章是10个json对象，记得向这10个json对象中添加一个key sentence_text
	4. 对intro的每句话重复1-3步，得到N个json对象，名字使用semanticscholar的paperId，对应于intro相关的N个paper，将N篇paper加入加入到本地文献库（大量json文件）
	5. 基于本地本地文献库`papers`，用llamaindex incrementally构建index
	- https://docs.llamaindex.ai/en/stable/examples/discover_llamaindex/document_management/Discord_Thread_Management.html#refresh-the-index-with-new-data
	- docs = papers 要持续更新
	- index 会基于docs增量的更新

	6. 构建retriever，https://docs.llamaindex.ai/en/stable/module_guides/querying/retriever/root.html，将每句话作为query从index中搜索语意相似的nodes，从nodes获得bibtex(反推出paperID)，更新intro和.bib

	8. 要构建一个自动的metric来评估整个pipeline的准确率

	7. 要在每一个semantic query 的结果中去匹配citation。现在根据sent从所有的index中搜索相似文献，check_node_sentence_match 数值非常低而且这个方法很慢；解决方案：

	- 不要使用sent去retrieve，而是使用llm生成的query去retrive，另外，好处是本地有一个统一的本地库，随着时间会越来越大; 失败

	- 直接使用semantic query 的结果去生成citation，为每一句话构建单独的index，每个index只包含10个文章，用index的retrival对10个文章基于相似度排序每次任务结束删除所有本地库；新建一个nb叫做main_lite.ipynb；成功

	- retriever.retrieve(sentence) => retriever.retrieve(search_query); 失败

	- 尝试不同的 embed model


	9. gradio 构建demo
	- https://www.gradio.app/guides/the-interface-class#example-inputs

	10. todo: 把所有top 5 都给到用户，把score写到bib中，让用户根据threshold能自动选择引用论文数量