yangdx commited on
Commit
899105a
·
2 Parent(s): 391879e 0fd9dfe

Merge remote-tracking branch 'origin/main' into refactor-api-server

Browse files
.github/ISSUE_TEMPLATE/bug_report.yml ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Bug Report
2
+ description: File a bug report
3
+ title: "[Bug]: <title>"
4
+ labels: ["bug", "triage"]
5
+
6
+ body:
7
+ - type: checkboxes
8
+ id: existingcheck
9
+ attributes:
10
+ label: Do you need to file an issue?
11
+ description: Please help us manage our time by avoiding duplicates and common bugs with the steps below.
12
+ options:
13
+ - label: I have searched the existing issues and this bug is not already filed.
14
+ - label: I believe this is a legitimate bug, not just a question or feature request.
15
+ - type: textarea
16
+ id: description
17
+ attributes:
18
+ label: Describe the bug
19
+ description: A clear and concise description of what the bug is.
20
+ placeholder: What went wrong?
21
+ - type: textarea
22
+ id: reproduce
23
+ attributes:
24
+ label: Steps to reproduce
25
+ description: Steps to reproduce the behavior.
26
+ placeholder: How can we replicate the issue?
27
+ - type: textarea
28
+ id: expected_behavior
29
+ attributes:
30
+ label: Expected Behavior
31
+ description: A clear and concise description of what you expected to happen.
32
+ placeholder: What should have happened?
33
+ - type: textarea
34
+ id: configused
35
+ attributes:
36
+ label: LightRAG Config Used
37
+ description: The LightRAG configuration used for the run.
38
+ placeholder: The settings content or LightRAG configuration
39
+ value: |
40
+ # Paste your config here
41
+ - type: textarea
42
+ id: screenshotslogs
43
+ attributes:
44
+ label: Logs and screenshots
45
+ description: If applicable, add screenshots and logs to help explain your problem.
46
+ placeholder: Add logs and screenshots here
47
+ - type: textarea
48
+ id: additional_information
49
+ attributes:
50
+ label: Additional Information
51
+ description: |
52
+ - LightRAG Version: e.g., v0.1.1
53
+ - Operating System: e.g., Windows 10, Ubuntu 20.04
54
+ - Python Version: e.g., 3.8
55
+ - Related Issues: e.g., #1
56
+ - Any other relevant information.
57
+ value: |
58
+ - LightRAG Version:
59
+ - Operating System:
60
+ - Python Version:
61
+ - Related Issues:
.github/ISSUE_TEMPLATE/config.yml ADDED
@@ -0,0 +1 @@
 
 
1
+ blank_issues_enabled: false
.github/ISSUE_TEMPLATE/feature_request.yml ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Feature Request
2
+ description: File a feature request
3
+ labels: ["enhancement"]
4
+ title: "[Feature Request]: <title>"
5
+
6
+ body:
7
+ - type: checkboxes
8
+ id: existingcheck
9
+ attributes:
10
+ label: Do you need to file a feature request?
11
+ description: Please help us manage our time by avoiding duplicates and common feature request with the steps below.
12
+ options:
13
+ - label: I have searched the existing feature request and this feature request is not already filed.
14
+ - label: I believe this is a legitimate feature request, not just a question or bug.
15
+ - type: textarea
16
+ id: feature_request_description
17
+ attributes:
18
+ label: Feature Request Description
19
+ description: A clear and concise description of the feature request you would like.
20
+ placeholder: What this feature request add more or improve?
21
+ - type: textarea
22
+ id: additional_context
23
+ attributes:
24
+ label: Additional Context
25
+ description: Add any other context or screenshots about the feature request here.
26
+ placeholder: Any additional information
.github/ISSUE_TEMPLATE/question.yml ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Question
2
+ description: Ask a general question
3
+ labels: ["question"]
4
+ title: "[Question]: <title>"
5
+
6
+ body:
7
+ - type: checkboxes
8
+ id: existingcheck
9
+ attributes:
10
+ label: Do you need to ask a question?
11
+ description: Please help us manage our time by avoiding duplicates and common questions with the steps below.
12
+ options:
13
+ - label: I have searched the existing question and discussions and this question is not already answered.
14
+ - label: I believe this is a legitimate question, not just a bug or feature request.
15
+ - type: textarea
16
+ id: question
17
+ attributes:
18
+ label: Your Question
19
+ description: A clear and concise description of your question.
20
+ placeholder: What is your question?
21
+ - type: textarea
22
+ id: context
23
+ attributes:
24
+ label: Additional Context
25
+ description: Provide any additional context or details that might help us understand your question better.
26
+ placeholder: Add any relevant information here
.github/pull_request_template.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--
2
+ Thanks for contributing to LightRAG!
3
+
4
+ Please ensure your pull request is ready for review before submitting.
5
+
6
+ About this template
7
+
8
+ This template helps contributors provide a clear and concise description of their changes. Feel free to adjust it as needed.
9
+ -->
10
+
11
+ ## Description
12
+
13
+ [Briefly describe the changes made in this pull request.]
14
+
15
+ ## Related Issues
16
+
17
+ [Reference any related issues or tasks addressed by this pull request.]
18
+
19
+ ## Changes Made
20
+
21
+ [List the specific changes made in this pull request.]
22
+
23
+ ## Checklist
24
+
25
+ - [ ] Changes tested locally
26
+ - [ ] Code reviewed
27
+ - [ ] Documentation updated (if necessary)
28
+ - [ ] Unit tests added (if applicable)
29
+
30
+ ## Additional Notes
31
+
32
+ [Add any additional notes or context for the reviewer(s).]
README.md CHANGED
@@ -312,7 +312,41 @@ rag = LightRAG(
312
  In order to run this experiment on low RAM GPU you should select small model and tune context window (increasing context increase memory consumption). For example, running this ollama example on repurposed mining GPU with 6Gb of RAM required to set context size to 26k while using `gemma2:2b`. It was able to find 197 entities and 19 relations on `book.txt`.
313
 
314
  </details>
 
 
 
 
 
 
 
 
 
 
315
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
316
  <details>
317
  <summary> <b>Conversation History Support</b> </summary>
318
 
 
312
  In order to run this experiment on low RAM GPU you should select small model and tune context window (increasing context increase memory consumption). For example, running this ollama example on repurposed mining GPU with 6Gb of RAM required to set context size to 26k while using `gemma2:2b`. It was able to find 197 entities and 19 relations on `book.txt`.
313
 
314
  </details>
315
+ <details>
316
+ <summary> <b>LlamaIndex</b> </summary>
317
+
318
+ LightRAG supports integration with LlamaIndex.
319
+
320
+ 1. **LlamaIndex** (`llm/llama_index_impl.py`):
321
+ - Integrates with OpenAI and other providers through LlamaIndex
322
+ - See [LlamaIndex Documentation](lightrag/llm/Readme.md) for detailed setup and examples
323
+
324
+ ### Example Usage
325
 
326
+ ```python
327
+ # Using LlamaIndex with direct OpenAI access
328
+ from lightrag import LightRAG
329
+ from lightrag.llm.llama_index_impl import llama_index_complete_if_cache, llama_index_embed
330
+ from llama_index.embeddings.openai import OpenAIEmbedding
331
+ from llama_index.llms.openai import OpenAI
332
+
333
+ rag = LightRAG(
334
+ working_dir="your/path",
335
+ llm_model_func=llama_index_complete_if_cache, # LlamaIndex-compatible completion function
336
+ embedding_func=EmbeddingFunc( # LlamaIndex-compatible embedding function
337
+ embedding_dim=1536,
338
+ max_token_size=8192,
339
+ func=lambda texts: llama_index_embed(texts, embed_model=embed_model)
340
+ ),
341
+ )
342
+ ```
343
+
344
+ #### For detailed documentation and examples, see:
345
+ - [LlamaIndex Documentation](lightrag/llm/Readme.md)
346
+ - [Direct OpenAI Example](examples/lightrag_llamaindex_direct_demo.py)
347
+ - [LiteLLM Proxy Example](examples/lightrag_llamaindex_litellm_demo.py)
348
+
349
+ </details>
350
  <details>
351
  <summary> <b>Conversation History Support</b> </summary>
352
 
docker-compose.yml CHANGED
@@ -1,5 +1,3 @@
1
- version: '3.8'
2
-
3
  services:
4
  lightrag:
5
  build: .
 
 
 
1
  services:
2
  lightrag:
3
  build: .
examples/insert_custom_kg.py CHANGED
@@ -87,18 +87,27 @@ custom_kg = {
87
  {
88
  "content": "ProductX, developed by CompanyA, has revolutionized the market with its cutting-edge features.",
89
  "source_id": "Source1",
 
 
 
 
 
 
90
  },
91
  {
92
  "content": "PersonA is a prominent researcher at UniversityB, focusing on artificial intelligence and machine learning.",
93
  "source_id": "Source2",
 
94
  },
95
  {
96
  "content": "EventY, held in CityC, attracts technology enthusiasts and companies from around the globe.",
97
  "source_id": "Source3",
 
98
  },
99
  {
100
  "content": "None",
101
  "source_id": "UNKNOWN",
 
102
  },
103
  ],
104
  }
 
87
  {
88
  "content": "ProductX, developed by CompanyA, has revolutionized the market with its cutting-edge features.",
89
  "source_id": "Source1",
90
+ "source_chunk_index": 0,
91
+ },
92
+ {
93
+ "content": "One outstanding feature of ProductX is its advanced AI capabilities.",
94
+ "source_id": "Source1",
95
+ "chunk_order_index": 1,
96
  },
97
  {
98
  "content": "PersonA is a prominent researcher at UniversityB, focusing on artificial intelligence and machine learning.",
99
  "source_id": "Source2",
100
+ "source_chunk_index": 0,
101
  },
102
  {
103
  "content": "EventY, held in CityC, attracts technology enthusiasts and companies from around the globe.",
104
  "source_id": "Source3",
105
+ "source_chunk_index": 0,
106
  },
107
  {
108
  "content": "None",
109
  "source_id": "UNKNOWN",
110
+ "source_chunk_index": 0,
111
  },
112
  ],
113
  }
examples/lightrag_api_oracle_demo.py CHANGED
@@ -98,7 +98,6 @@ async def init():
98
 
99
  # Initialize LightRAG
100
  # We use Oracle DB as the KV/vector/graph storage
101
- # You can add `addon_params={"example_number": 1, "language": "Simplfied Chinese"}` to control the prompt
102
  rag = LightRAG(
103
  enable_llm_cache=False,
104
  working_dir=WORKING_DIR,
 
98
 
99
  # Initialize LightRAG
100
  # We use Oracle DB as the KV/vector/graph storage
 
101
  rag = LightRAG(
102
  enable_llm_cache=False,
103
  working_dir=WORKING_DIR,
examples/lightrag_llamaindex_direct_demo.py ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from lightrag import LightRAG, QueryParam
3
+ from lightrag.llm.llama_index_impl import (
4
+ llama_index_complete_if_cache,
5
+ llama_index_embed,
6
+ )
7
+ from lightrag.utils import EmbeddingFunc
8
+ from llama_index.llms.openai import OpenAI
9
+ from llama_index.embeddings.openai import OpenAIEmbedding
10
+ import asyncio
11
+
12
+ # Configure working directory
13
+ WORKING_DIR = "./index_default"
14
+ print(f"WORKING_DIR: {WORKING_DIR}")
15
+
16
+ # Model configuration
17
+ LLM_MODEL = os.environ.get("LLM_MODEL", "gpt-4")
18
+ print(f"LLM_MODEL: {LLM_MODEL}")
19
+ EMBEDDING_MODEL = os.environ.get("EMBEDDING_MODEL", "text-embedding-3-large")
20
+ print(f"EMBEDDING_MODEL: {EMBEDDING_MODEL}")
21
+ EMBEDDING_MAX_TOKEN_SIZE = int(os.environ.get("EMBEDDING_MAX_TOKEN_SIZE", 8192))
22
+ print(f"EMBEDDING_MAX_TOKEN_SIZE: {EMBEDDING_MAX_TOKEN_SIZE}")
23
+
24
+ # OpenAI configuration
25
+ OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "your-api-key-here")
26
+
27
+ if not os.path.exists(WORKING_DIR):
28
+ print(f"Creating working directory: {WORKING_DIR}")
29
+ os.mkdir(WORKING_DIR)
30
+
31
+
32
+ # Initialize LLM function
33
+ async def llm_model_func(prompt, system_prompt=None, history_messages=[], **kwargs):
34
+ try:
35
+ # Initialize OpenAI if not in kwargs
36
+ if "llm_instance" not in kwargs:
37
+ llm_instance = OpenAI(
38
+ model=LLM_MODEL,
39
+ api_key=OPENAI_API_KEY,
40
+ temperature=0.7,
41
+ )
42
+ kwargs["llm_instance"] = llm_instance
43
+
44
+ response = await llama_index_complete_if_cache(
45
+ kwargs["llm_instance"],
46
+ prompt,
47
+ system_prompt=system_prompt,
48
+ history_messages=history_messages,
49
+ **kwargs,
50
+ )
51
+ return response
52
+ except Exception as e:
53
+ print(f"LLM request failed: {str(e)}")
54
+ raise
55
+
56
+
57
+ # Initialize embedding function
58
+ async def embedding_func(texts):
59
+ try:
60
+ embed_model = OpenAIEmbedding(
61
+ model=EMBEDDING_MODEL,
62
+ api_key=OPENAI_API_KEY,
63
+ )
64
+ return await llama_index_embed(texts, embed_model=embed_model)
65
+ except Exception as e:
66
+ print(f"Embedding failed: {str(e)}")
67
+ raise
68
+
69
+
70
+ # Get embedding dimension
71
+ async def get_embedding_dim():
72
+ test_text = ["This is a test sentence."]
73
+ embedding = await embedding_func(test_text)
74
+ embedding_dim = embedding.shape[1]
75
+ print(f"embedding_dim={embedding_dim}")
76
+ return embedding_dim
77
+
78
+
79
+ # Initialize RAG instance
80
+ rag = LightRAG(
81
+ working_dir=WORKING_DIR,
82
+ llm_model_func=llm_model_func,
83
+ embedding_func=EmbeddingFunc(
84
+ embedding_dim=asyncio.run(get_embedding_dim()),
85
+ max_token_size=EMBEDDING_MAX_TOKEN_SIZE,
86
+ func=embedding_func,
87
+ ),
88
+ )
89
+
90
+ # Insert example text
91
+ with open("./book.txt", "r", encoding="utf-8") as f:
92
+ rag.insert(f.read())
93
+
94
+ # Test different query modes
95
+ print("\nNaive Search:")
96
+ print(
97
+ rag.query("What are the top themes in this story?", param=QueryParam(mode="naive"))
98
+ )
99
+
100
+ print("\nLocal Search:")
101
+ print(
102
+ rag.query("What are the top themes in this story?", param=QueryParam(mode="local"))
103
+ )
104
+
105
+ print("\nGlobal Search:")
106
+ print(
107
+ rag.query("What are the top themes in this story?", param=QueryParam(mode="global"))
108
+ )
109
+
110
+ print("\nHybrid Search:")
111
+ print(
112
+ rag.query("What are the top themes in this story?", param=QueryParam(mode="hybrid"))
113
+ )
examples/lightrag_llamaindex_litellm_demo.py ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from lightrag import LightRAG, QueryParam
3
+ from lightrag.llm.llama_index_impl import (
4
+ llama_index_complete_if_cache,
5
+ llama_index_embed,
6
+ )
7
+ from lightrag.utils import EmbeddingFunc
8
+ from llama_index.llms.litellm import LiteLLM
9
+ from llama_index.embeddings.litellm import LiteLLMEmbedding
10
+ import asyncio
11
+
12
+ # Configure working directory
13
+ WORKING_DIR = "./index_default"
14
+ print(f"WORKING_DIR: {WORKING_DIR}")
15
+
16
+ # Model configuration
17
+ LLM_MODEL = os.environ.get("LLM_MODEL", "gpt-4")
18
+ print(f"LLM_MODEL: {LLM_MODEL}")
19
+ EMBEDDING_MODEL = os.environ.get("EMBEDDING_MODEL", "text-embedding-3-large")
20
+ print(f"EMBEDDING_MODEL: {EMBEDDING_MODEL}")
21
+ EMBEDDING_MAX_TOKEN_SIZE = int(os.environ.get("EMBEDDING_MAX_TOKEN_SIZE", 8192))
22
+ print(f"EMBEDDING_MAX_TOKEN_SIZE: {EMBEDDING_MAX_TOKEN_SIZE}")
23
+
24
+ # LiteLLM configuration
25
+ LITELLM_URL = os.environ.get("LITELLM_URL", "http://localhost:4000")
26
+ print(f"LITELLM_URL: {LITELLM_URL}")
27
+ LITELLM_KEY = os.environ.get("LITELLM_KEY", "sk-1234")
28
+
29
+ if not os.path.exists(WORKING_DIR):
30
+ os.mkdir(WORKING_DIR)
31
+
32
+
33
+ # Initialize LLM function
34
+ async def llm_model_func(prompt, system_prompt=None, history_messages=[], **kwargs):
35
+ try:
36
+ # Initialize LiteLLM if not in kwargs
37
+ if "llm_instance" not in kwargs:
38
+ llm_instance = LiteLLM(
39
+ model=f"openai/{LLM_MODEL}", # Format: "provider/model_name"
40
+ api_base=LITELLM_URL,
41
+ api_key=LITELLM_KEY,
42
+ temperature=0.7,
43
+ )
44
+ kwargs["llm_instance"] = llm_instance
45
+
46
+ response = await llama_index_complete_if_cache(
47
+ kwargs["llm_instance"],
48
+ prompt,
49
+ system_prompt=system_prompt,
50
+ history_messages=history_messages,
51
+ **kwargs,
52
+ )
53
+ return response
54
+ except Exception as e:
55
+ print(f"LLM request failed: {str(e)}")
56
+ raise
57
+
58
+
59
+ # Initialize embedding function
60
+ async def embedding_func(texts):
61
+ try:
62
+ embed_model = LiteLLMEmbedding(
63
+ model_name=f"openai/{EMBEDDING_MODEL}",
64
+ api_base=LITELLM_URL,
65
+ api_key=LITELLM_KEY,
66
+ )
67
+ return await llama_index_embed(texts, embed_model=embed_model)
68
+ except Exception as e:
69
+ print(f"Embedding failed: {str(e)}")
70
+ raise
71
+
72
+
73
+ # Get embedding dimension
74
+ async def get_embedding_dim():
75
+ test_text = ["This is a test sentence."]
76
+ embedding = await embedding_func(test_text)
77
+ embedding_dim = embedding.shape[1]
78
+ print(f"embedding_dim={embedding_dim}")
79
+ return embedding_dim
80
+
81
+
82
+ # Initialize RAG instance
83
+ rag = LightRAG(
84
+ working_dir=WORKING_DIR,
85
+ llm_model_func=llm_model_func,
86
+ embedding_func=EmbeddingFunc(
87
+ embedding_dim=asyncio.run(get_embedding_dim()),
88
+ max_token_size=EMBEDDING_MAX_TOKEN_SIZE,
89
+ func=embedding_func,
90
+ ),
91
+ )
92
+
93
+ # Insert example text
94
+ with open("./book.txt", "r", encoding="utf-8") as f:
95
+ rag.insert(f.read())
96
+
97
+ # Test different query modes
98
+ print("\nNaive Search:")
99
+ print(
100
+ rag.query("What are the top themes in this story?", param=QueryParam(mode="naive"))
101
+ )
102
+
103
+ print("\nLocal Search:")
104
+ print(
105
+ rag.query("What are the top themes in this story?", param=QueryParam(mode="local"))
106
+ )
107
+
108
+ print("\nGlobal Search:")
109
+ print(
110
+ rag.query("What are the top themes in this story?", param=QueryParam(mode="global"))
111
+ )
112
+
113
+ print("\nHybrid Search:")
114
+ print(
115
+ rag.query("What are the top themes in this story?", param=QueryParam(mode="hybrid"))
116
+ )
examples/lightrag_openai_compatible_stream_demo.py CHANGED
@@ -1,9 +1,8 @@
1
- import os
2
  import inspect
 
3
  from lightrag import LightRAG
4
  from lightrag.llm import openai_complete, openai_embed
5
- from lightrag.utils import EmbeddingFunc
6
- from lightrag.lightrag import always_get_an_event_loop
7
  from lightrag import QueryParam
8
 
9
  # WorkingDir
 
 
1
  import inspect
2
+ import os
3
  from lightrag import LightRAG
4
  from lightrag.llm import openai_complete, openai_embed
5
+ from lightrag.utils import EmbeddingFunc, always_get_an_event_loop
 
6
  from lightrag import QueryParam
7
 
8
  # WorkingDir
examples/lightrag_tidb_demo.py CHANGED
@@ -63,7 +63,6 @@ async def main():
63
 
64
  # Initialize LightRAG
65
  # We use TiDB DB as the KV/vector
66
- # You can add `addon_params={"example_number": 1, "language": "Simplfied Chinese"}` to control the prompt
67
  rag = LightRAG(
68
  enable_llm_cache=False,
69
  working_dir=WORKING_DIR,
 
63
 
64
  # Initialize LightRAG
65
  # We use TiDB DB as the KV/vector
 
66
  rag = LightRAG(
67
  enable_llm_cache=False,
68
  working_dir=WORKING_DIR,
examples/test_faiss.py CHANGED
@@ -70,7 +70,7 @@ def main():
70
  ),
71
  vector_storage="FaissVectorDBStorage",
72
  vector_db_storage_cls_kwargs={
73
- "cosine_better_than_threshold": 0.3 # Your desired threshold
74
  },
75
  )
76
 
 
70
  ),
71
  vector_storage="FaissVectorDBStorage",
72
  vector_db_storage_cls_kwargs={
73
+ "cosine_better_than_threshold": 0.2 # Your desired threshold
74
  },
75
  )
76
 
lightrag/__init__.py CHANGED
@@ -1,5 +1,5 @@
1
  from .lightrag import LightRAG as LightRAG, QueryParam as QueryParam
2
 
3
- __version__ = "1.1.7"
4
  __author__ = "Zirui Guo"
5
  __url__ = "https://github.com/HKUDS/LightRAG"
 
1
  from .lightrag import LightRAG as LightRAG, QueryParam as QueryParam
2
 
3
+ __version__ = "1.1.11"
4
  __author__ = "Zirui Guo"
5
  __url__ = "https://github.com/HKUDS/LightRAG"
lightrag/kg/__init__.py CHANGED
@@ -1 +1,157 @@
1
- # print ("init package vars here. ......")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ STORAGE_IMPLEMENTATIONS = {
2
+ "KV_STORAGE": {
3
+ "implementations": [
4
+ "JsonKVStorage",
5
+ "MongoKVStorage",
6
+ "RedisKVStorage",
7
+ "TiDBKVStorage",
8
+ "PGKVStorage",
9
+ "OracleKVStorage",
10
+ ],
11
+ "required_methods": ["get_by_id", "upsert"],
12
+ },
13
+ "GRAPH_STORAGE": {
14
+ "implementations": [
15
+ "NetworkXStorage",
16
+ "Neo4JStorage",
17
+ "MongoGraphStorage",
18
+ "TiDBGraphStorage",
19
+ "AGEStorage",
20
+ "GremlinStorage",
21
+ "PGGraphStorage",
22
+ "OracleGraphStorage",
23
+ ],
24
+ "required_methods": ["upsert_node", "upsert_edge"],
25
+ },
26
+ "VECTOR_STORAGE": {
27
+ "implementations": [
28
+ "NanoVectorDBStorage",
29
+ "MilvusVectorDBStorage",
30
+ "ChromaVectorDBStorage",
31
+ "TiDBVectorDBStorage",
32
+ "PGVectorStorage",
33
+ "FaissVectorDBStorage",
34
+ "QdrantVectorDBStorage",
35
+ "OracleVectorDBStorage",
36
+ "MongoVectorDBStorage",
37
+ ],
38
+ "required_methods": ["query", "upsert"],
39
+ },
40
+ "DOC_STATUS_STORAGE": {
41
+ "implementations": [
42
+ "JsonDocStatusStorage",
43
+ "PGDocStatusStorage",
44
+ "PGDocStatusStorage",
45
+ "MongoDocStatusStorage",
46
+ ],
47
+ "required_methods": ["get_docs_by_status"],
48
+ },
49
+ }
50
+
51
+ # Storage implementation environment variable without default value
52
+ STORAGE_ENV_REQUIREMENTS: dict[str, list[str]] = {
53
+ # KV Storage Implementations
54
+ "JsonKVStorage": [],
55
+ "MongoKVStorage": [],
56
+ "RedisKVStorage": ["REDIS_URI"],
57
+ "TiDBKVStorage": ["TIDB_USER", "TIDB_PASSWORD", "TIDB_DATABASE"],
58
+ "PGKVStorage": ["POSTGRES_USER", "POSTGRES_PASSWORD", "POSTGRES_DATABASE"],
59
+ "OracleKVStorage": [
60
+ "ORACLE_DSN",
61
+ "ORACLE_USER",
62
+ "ORACLE_PASSWORD",
63
+ "ORACLE_CONFIG_DIR",
64
+ ],
65
+ # Graph Storage Implementations
66
+ "NetworkXStorage": [],
67
+ "Neo4JStorage": ["NEO4J_URI", "NEO4J_USERNAME", "NEO4J_PASSWORD"],
68
+ "MongoGraphStorage": [],
69
+ "TiDBGraphStorage": ["TIDB_USER", "TIDB_PASSWORD", "TIDB_DATABASE"],
70
+ "AGEStorage": [
71
+ "AGE_POSTGRES_DB",
72
+ "AGE_POSTGRES_USER",
73
+ "AGE_POSTGRES_PASSWORD",
74
+ ],
75
+ "GremlinStorage": ["GREMLIN_HOST", "GREMLIN_PORT", "GREMLIN_GRAPH"],
76
+ "PGGraphStorage": [
77
+ "POSTGRES_USER",
78
+ "POSTGRES_PASSWORD",
79
+ "POSTGRES_DATABASE",
80
+ ],
81
+ "OracleGraphStorage": [
82
+ "ORACLE_DSN",
83
+ "ORACLE_USER",
84
+ "ORACLE_PASSWORD",
85
+ "ORACLE_CONFIG_DIR",
86
+ ],
87
+ # Vector Storage Implementations
88
+ "NanoVectorDBStorage": [],
89
+ "MilvusVectorDBStorage": [],
90
+ "ChromaVectorDBStorage": [],
91
+ "TiDBVectorDBStorage": ["TIDB_USER", "TIDB_PASSWORD", "TIDB_DATABASE"],
92
+ "PGVectorStorage": ["POSTGRES_USER", "POSTGRES_PASSWORD", "POSTGRES_DATABASE"],
93
+ "FaissVectorDBStorage": [],
94
+ "QdrantVectorDBStorage": ["QDRANT_URL"], # QDRANT_API_KEY has default value None
95
+ "OracleVectorDBStorage": [
96
+ "ORACLE_DSN",
97
+ "ORACLE_USER",
98
+ "ORACLE_PASSWORD",
99
+ "ORACLE_CONFIG_DIR",
100
+ ],
101
+ "MongoVectorDBStorage": [],
102
+ # Document Status Storage Implementations
103
+ "JsonDocStatusStorage": [],
104
+ "PGDocStatusStorage": ["POSTGRES_USER", "POSTGRES_PASSWORD", "POSTGRES_DATABASE"],
105
+ "MongoDocStatusStorage": [],
106
+ }
107
+
108
+ # Storage implementation module mapping
109
+ STORAGES = {
110
+ "NetworkXStorage": ".kg.networkx_impl",
111
+ "JsonKVStorage": ".kg.json_kv_impl",
112
+ "NanoVectorDBStorage": ".kg.nano_vector_db_impl",
113
+ "JsonDocStatusStorage": ".kg.json_doc_status_impl",
114
+ "Neo4JStorage": ".kg.neo4j_impl",
115
+ "OracleKVStorage": ".kg.oracle_impl",
116
+ "OracleGraphStorage": ".kg.oracle_impl",
117
+ "OracleVectorDBStorage": ".kg.oracle_impl",
118
+ "MilvusVectorDBStorage": ".kg.milvus_impl",
119
+ "MongoKVStorage": ".kg.mongo_impl",
120
+ "MongoDocStatusStorage": ".kg.mongo_impl",
121
+ "MongoGraphStorage": ".kg.mongo_impl",
122
+ "MongoVectorDBStorage": ".kg.mongo_impl",
123
+ "RedisKVStorage": ".kg.redis_impl",
124
+ "ChromaVectorDBStorage": ".kg.chroma_impl",
125
+ "TiDBKVStorage": ".kg.tidb_impl",
126
+ "TiDBVectorDBStorage": ".kg.tidb_impl",
127
+ "TiDBGraphStorage": ".kg.tidb_impl",
128
+ "PGKVStorage": ".kg.postgres_impl",
129
+ "PGVectorStorage": ".kg.postgres_impl",
130
+ "AGEStorage": ".kg.age_impl",
131
+ "PGGraphStorage": ".kg.postgres_impl",
132
+ "GremlinStorage": ".kg.gremlin_impl",
133
+ "PGDocStatusStorage": ".kg.postgres_impl",
134
+ "FaissVectorDBStorage": ".kg.faiss_impl",
135
+ "QdrantVectorDBStorage": ".kg.qdrant_impl",
136
+ }
137
+
138
+
139
+ def verify_storage_implementation(storage_type: str, storage_name: str) -> None:
140
+ """Verify if storage implementation is compatible with specified storage type
141
+
142
+ Args:
143
+ storage_type: Storage type (KV_STORAGE, GRAPH_STORAGE etc.)
144
+ storage_name: Storage implementation name
145
+
146
+ Raises:
147
+ ValueError: If storage implementation is incompatible or missing required methods
148
+ """
149
+ if storage_type not in STORAGE_IMPLEMENTATIONS:
150
+ raise ValueError(f"Unknown storage type: {storage_type}")
151
+
152
+ storage_info = STORAGE_IMPLEMENTATIONS[storage_type]
153
+ if storage_name not in storage_info["implementations"]:
154
+ raise ValueError(
155
+ f"Storage implementation '{storage_name}' is not compatible with {storage_type}. "
156
+ f"Compatible implementations are: {', '.join(storage_info['implementations'])}"
157
+ )
lightrag/kg/age_impl.py CHANGED
@@ -34,14 +34,9 @@ if not pm.is_installed("psycopg-pool"):
34
  if not pm.is_installed("asyncpg"):
35
  pm.install("asyncpg")
36
 
37
- try:
38
- import psycopg
39
- from psycopg.rows import namedtuple_row
40
- from psycopg_pool import AsyncConnectionPool, PoolTimeout
41
- except ImportError:
42
- raise ImportError(
43
- "`psycopg-pool, psycopg[binary,pool], asyncpg` library is not installed. Please install it via pip: `pip install psycopg-pool psycopg[binary,pool] asyncpg`."
44
- )
45
 
46
 
47
  class AGEQueryException(Exception):
 
34
  if not pm.is_installed("asyncpg"):
35
  pm.install("asyncpg")
36
 
37
+ import psycopg
38
+ from psycopg.rows import namedtuple_row
39
+ from psycopg_pool import AsyncConnectionPool, PoolTimeout
 
 
 
 
 
40
 
41
 
42
  class AGEQueryException(Exception):
lightrag/kg/chroma_impl.py CHANGED
@@ -10,13 +10,8 @@ import pipmaster as pm
10
  if not pm.is_installed("chromadb"):
11
  pm.install("chromadb")
12
 
13
- try:
14
- from chromadb import HttpClient, PersistentClient
15
- from chromadb.config import Settings
16
- except ImportError as e:
17
- raise ImportError(
18
- "`chromadb` library is not installed. Please install it via pip: `pip install chromadb`."
19
- ) from e
20
 
21
 
22
  @final
@@ -113,9 +108,9 @@ class ChromaVectorDBStorage(BaseVectorStorage):
113
  raise
114
 
115
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
 
116
  if not data:
117
- logger.warning("Empty data provided to vector DB")
118
- return []
119
 
120
  try:
121
  ids = list(data.keys())
 
10
  if not pm.is_installed("chromadb"):
11
  pm.install("chromadb")
12
 
13
+ from chromadb import HttpClient, PersistentClient
14
+ from chromadb.config import Settings
 
 
 
 
 
15
 
16
 
17
  @final
 
108
  raise
109
 
110
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
111
+ logger.info(f"Inserting {len(data)} to {self.namespace}")
112
  if not data:
113
+ return
 
114
 
115
  try:
116
  ids = list(data.keys())
lightrag/kg/faiss_impl.py CHANGED
@@ -20,12 +20,7 @@ from lightrag.base import (
20
  if not pm.is_installed("faiss"):
21
  pm.install("faiss")
22
 
23
- try:
24
- import faiss
25
- except ImportError as e:
26
- raise ImportError(
27
- "`faiss` library is not installed. Please install it via pip: `pip install faiss`."
28
- ) from e
29
 
30
 
31
  @final
@@ -84,10 +79,9 @@ class FaissVectorDBStorage(BaseVectorStorage):
84
  ...
85
  }
86
  """
87
- logger.info(f"Inserting {len(data)} vectors to {self.namespace}")
88
  if not data:
89
- logger.warning("You are inserting empty data to the vector DB")
90
- return []
91
 
92
  current_time = time.time()
93
 
 
20
  if not pm.is_installed("faiss"):
21
  pm.install("faiss")
22
 
23
+ import faiss
 
 
 
 
 
24
 
25
 
26
  @final
 
79
  ...
80
  }
81
  """
82
+ logger.info(f"Inserting {len(data)} to {self.namespace}")
83
  if not data:
84
+ return
 
85
 
86
  current_time = time.time()
87
 
lightrag/kg/gremlin_impl.py CHANGED
@@ -2,6 +2,7 @@ import asyncio
2
  import inspect
3
  import json
4
  import os
 
5
  from dataclasses import dataclass
6
  from typing import Any, Dict, List, final
7
 
@@ -20,14 +21,12 @@ from lightrag.utils import logger
20
 
21
  from ..base import BaseGraphStorage
22
 
23
- try:
24
- from gremlin_python.driver import client, serializer
25
- from gremlin_python.driver.aiohttp.transport import AiohttpTransport
26
- from gremlin_python.driver.protocol import GremlinServerError
27
- except ImportError as e:
28
- raise ImportError(
29
- "`gremlin` library is not installed. Please install it via pip: `pip install gremlin`."
30
- ) from e
31
 
32
 
33
  @final
 
2
  import inspect
3
  import json
4
  import os
5
+ import pipmaster as pm
6
  from dataclasses import dataclass
7
  from typing import Any, Dict, List, final
8
 
 
21
 
22
  from ..base import BaseGraphStorage
23
 
24
+ if not pm.is_installed("gremlinpython"):
25
+ pm.install("gremlinpython")
26
+
27
+ from gremlin_python.driver import client, serializer
28
+ from gremlin_python.driver.aiohttp.transport import AiohttpTransport
29
+ from gremlin_python.driver.protocol import GremlinServerError
 
 
30
 
31
 
32
  @final
lightrag/kg/json_doc_status_impl.py CHANGED
@@ -67,6 +67,10 @@ class JsonDocStatusStorage(DocStatusStorage):
67
  write_json(self._data, self._file_name)
68
 
69
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
 
 
 
 
70
  self._data.update(data)
71
  await self.index_done_callback()
72
 
 
67
  write_json(self._data, self._file_name)
68
 
69
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
70
+ logger.info(f"Inserting {len(data)} to {self.namespace}")
71
+ if not data:
72
+ return
73
+
74
  self._data.update(data)
75
  await self.index_done_callback()
76
 
lightrag/kg/json_kv_impl.py CHANGED
@@ -43,6 +43,9 @@ class JsonKVStorage(BaseKVStorage):
43
  return set(keys) - set(self._data.keys())
44
 
45
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
 
 
 
46
  left_data = {k: v for k, v in data.items() if k not in self._data}
47
  self._data.update(left_data)
48
 
 
43
  return set(keys) - set(self._data.keys())
44
 
45
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
46
+ logger.info(f"Inserting {len(data)} to {self.namespace}")
47
+ if not data:
48
+ return
49
  left_data = {k: v for k, v in data.items() if k not in self._data}
50
  self._data.update(left_data)
51
 
lightrag/kg/milvus_impl.py CHANGED
@@ -14,13 +14,8 @@ if not pm.is_installed("configparser"):
14
  if not pm.is_installed("pymilvus"):
15
  pm.install("pymilvus")
16
 
17
- try:
18
- import configparser
19
- from pymilvus import MilvusClient
20
- except ImportError as e:
21
- raise ImportError(
22
- "`pymilvus` library is not installed. Please install it via pip: `pip install pymilvus`."
23
- ) from e
24
 
25
  config = configparser.ConfigParser()
26
  config.read("config.ini", "utf-8")
@@ -80,11 +75,11 @@ class MilvusVectorDBStorage(BaseVectorStorage):
80
  )
81
 
82
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
83
- logger.info(f"Inserting {len(data)} vectors to {self.namespace}")
84
- if not len(data):
85
- logger.warning("You insert an empty data to vector DB")
86
- return []
87
- list_data = [
88
  {
89
  "id": k,
90
  **{k1: v1 for k1, v1 in v.items() if k1 in self.meta_fields},
 
14
  if not pm.is_installed("pymilvus"):
15
  pm.install("pymilvus")
16
 
17
+ import configparser
18
+ from pymilvus import MilvusClient
 
 
 
 
 
19
 
20
  config = configparser.ConfigParser()
21
  config.read("config.ini", "utf-8")
 
75
  )
76
 
77
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
78
+ logger.info(f"Inserting {len(data)} to {self.namespace}")
79
+ if not data:
80
+ return
81
+
82
+ list_data: list[dict[str, Any]] = [
83
  {
84
  "id": k,
85
  **{k1: v1 for k1, v1 in v.items() if k1 in self.meta_fields},
lightrag/kg/mongo_impl.py CHANGED
@@ -25,18 +25,13 @@ if not pm.is_installed("pymongo"):
25
  if not pm.is_installed("motor"):
26
  pm.install("motor")
27
 
28
- try:
29
- from motor.motor_asyncio import (
30
- AsyncIOMotorClient,
31
- AsyncIOMotorDatabase,
32
- AsyncIOMotorCollection,
33
- )
34
- from pymongo.operations import SearchIndexModel
35
- from pymongo.errors import PyMongoError
36
- except ImportError as e:
37
- raise ImportError(
38
- "`motor, pymongo` library is not installed. Please install it via pip: `pip install motor pymongo`."
39
- ) from e
40
 
41
  config = configparser.ConfigParser()
42
  config.read("config.ini", "utf-8")
@@ -113,8 +108,12 @@ class MongoKVStorage(BaseKVStorage):
113
  return keys - existing_ids
114
 
115
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
 
 
 
 
116
  if is_namespace(self.namespace, NameSpace.KV_STORE_LLM_RESPONSE_CACHE):
117
- update_tasks = []
118
  for mode, items in data.items():
119
  for k, v in items.items():
120
  key = f"{mode}_{k}"
@@ -186,7 +185,10 @@ class MongoDocStatusStorage(DocStatusStorage):
186
  return data - existing_ids
187
 
188
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
189
- update_tasks = []
 
 
 
190
  for k, v in data.items():
191
  data[k]["_id"] = k
192
  update_tasks.append(
@@ -860,10 +862,9 @@ class MongoVectorDBStorage(BaseVectorStorage):
860
  logger.debug("vector index already exist")
861
 
862
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
863
- logger.debug(f"Inserting {len(data)} vectors to {self.namespace}")
864
  if not data:
865
- logger.warning("You are inserting an empty data set to vector DB")
866
- return []
867
 
868
  list_data = [
869
  {
 
25
  if not pm.is_installed("motor"):
26
  pm.install("motor")
27
 
28
+ from motor.motor_asyncio import (
29
+ AsyncIOMotorClient,
30
+ AsyncIOMotorDatabase,
31
+ AsyncIOMotorCollection,
32
+ )
33
+ from pymongo.operations import SearchIndexModel
34
+ from pymongo.errors import PyMongoError
 
 
 
 
 
35
 
36
  config = configparser.ConfigParser()
37
  config.read("config.ini", "utf-8")
 
108
  return keys - existing_ids
109
 
110
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
111
+ logger.info(f"Inserting {len(data)} to {self.namespace}")
112
+ if not data:
113
+ return
114
+
115
  if is_namespace(self.namespace, NameSpace.KV_STORE_LLM_RESPONSE_CACHE):
116
+ update_tasks: list[Any] = []
117
  for mode, items in data.items():
118
  for k, v in items.items():
119
  key = f"{mode}_{k}"
 
185
  return data - existing_ids
186
 
187
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
188
+ logger.info(f"Inserting {len(data)} to {self.namespace}")
189
+ if not data:
190
+ return
191
+ update_tasks: list[Any] = []
192
  for k, v in data.items():
193
  data[k]["_id"] = k
194
  update_tasks.append(
 
862
  logger.debug("vector index already exist")
863
 
864
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
865
+ logger.info(f"Inserting {len(data)} to {self.namespace}")
866
  if not data:
867
+ return
 
868
 
869
  list_data = [
870
  {
lightrag/kg/nano_vector_db_impl.py CHANGED
@@ -18,12 +18,7 @@ from lightrag.base import (
18
  if not pm.is_installed("nano-vectordb"):
19
  pm.install("nano-vectordb")
20
 
21
- try:
22
- from nano_vectordb import NanoVectorDB
23
- except ImportError as e:
24
- raise ImportError(
25
- "`nano-vectordb` library is not installed. Please install it via pip: `pip install nano-vectordb`."
26
- ) from e
27
 
28
 
29
  @final
@@ -50,10 +45,9 @@ class NanoVectorDBStorage(BaseVectorStorage):
50
  )
51
 
52
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
53
- logger.info(f"Inserting {len(data)} vectors to {self.namespace}")
54
- if not len(data):
55
- logger.warning("You insert an empty data to vector DB")
56
- return []
57
 
58
  current_time = time.time()
59
  list_data = [
 
18
  if not pm.is_installed("nano-vectordb"):
19
  pm.install("nano-vectordb")
20
 
21
+ from nano_vectordb import NanoVectorDB
 
 
 
 
 
22
 
23
 
24
  @final
 
45
  )
46
 
47
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
48
+ logger.info(f"Inserting {len(data)} to {self.namespace}")
49
+ if not data:
50
+ return
 
51
 
52
  current_time = time.time()
53
  list_data = [
lightrag/kg/neo4j_impl.py CHANGED
@@ -23,18 +23,13 @@ import pipmaster as pm
23
  if not pm.is_installed("neo4j"):
24
  pm.install("neo4j")
25
 
26
- try:
27
- from neo4j import (
28
- AsyncGraphDatabase,
29
- exceptions as neo4jExceptions,
30
- AsyncDriver,
31
- AsyncManagedTransaction,
32
- GraphDatabase,
33
- )
34
- except ImportError as e:
35
- raise ImportError(
36
- "`neo4j` library is not installed. Please install it via pip: `pip install neo4j`."
37
- ) from e
38
 
39
  config = configparser.ConfigParser()
40
  config.read("config.ini", "utf-8")
 
23
  if not pm.is_installed("neo4j"):
24
  pm.install("neo4j")
25
 
26
+ from neo4j import (
27
+ AsyncGraphDatabase,
28
+ exceptions as neo4jExceptions,
29
+ AsyncDriver,
30
+ AsyncManagedTransaction,
31
+ GraphDatabase,
32
+ )
 
 
 
 
 
33
 
34
  config = configparser.ConfigParser()
35
  config.read("config.ini", "utf-8")
lightrag/kg/networkx_impl.py CHANGED
@@ -17,16 +17,12 @@ import pipmaster as pm
17
 
18
  if not pm.is_installed("networkx"):
19
  pm.install("networkx")
 
20
  if not pm.is_installed("graspologic"):
21
  pm.install("graspologic")
22
 
23
- try:
24
- from graspologic import embed
25
- import networkx as nx
26
- except ImportError as e:
27
- raise ImportError(
28
- "`networkx` library is not installed. Please install it via pip: `pip install networkx`."
29
- ) from e
30
 
31
 
32
  @final
 
17
 
18
  if not pm.is_installed("networkx"):
19
  pm.install("networkx")
20
+
21
  if not pm.is_installed("graspologic"):
22
  pm.install("graspologic")
23
 
24
+ import networkx as nx
25
+ from graspologic import embed
 
 
 
 
 
26
 
27
 
28
  @final
lightrag/kg/oracle_impl.py CHANGED
@@ -26,14 +26,8 @@ if not pm.is_installed("graspologic"):
26
  if not pm.is_installed("oracledb"):
27
  pm.install("oracledb")
28
 
29
- try:
30
- from graspologic import embed
31
- import oracledb
32
-
33
- except ImportError as e:
34
- raise ImportError(
35
- "`oracledb` library is not installed. Please install it via pip: `pip install oracledb`."
36
- ) from e
37
 
38
 
39
  class OracleDB:
@@ -51,7 +45,7 @@ class OracleDB:
51
  self.increment = 1
52
  logger.info(f"Using the label {self.workspace} for Oracle Graph as identifier")
53
  if self.user is None or self.password is None:
54
- raise ValueError("Missing database user or password in addon_params")
55
 
56
  try:
57
  oracledb.defaults.fetch_lobs = False
@@ -332,6 +326,10 @@ class OracleKVStorage(BaseKVStorage):
332
 
333
  ################ INSERT METHODS ################
334
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
 
 
 
 
335
  if is_namespace(self.namespace, NameSpace.KV_STORE_TEXT_CHUNKS):
336
  list_data = [
337
  {
 
26
  if not pm.is_installed("oracledb"):
27
  pm.install("oracledb")
28
 
29
+ from graspologic import embed
30
+ import oracledb
 
 
 
 
 
 
31
 
32
 
33
  class OracleDB:
 
45
  self.increment = 1
46
  logger.info(f"Using the label {self.workspace} for Oracle Graph as identifier")
47
  if self.user is None or self.password is None:
48
+ raise ValueError("Missing database user or password")
49
 
50
  try:
51
  oracledb.defaults.fetch_lobs = False
 
326
 
327
  ################ INSERT METHODS ################
328
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
329
+ logger.info(f"Inserting {len(data)} to {self.namespace}")
330
+ if not data:
331
+ return
332
+
333
  if is_namespace(self.namespace, NameSpace.KV_STORE_TEXT_CHUNKS):
334
  list_data = [
335
  {
lightrag/kg/postgres_impl.py CHANGED
@@ -38,14 +38,8 @@ import pipmaster as pm
38
  if not pm.is_installed("asyncpg"):
39
  pm.install("asyncpg")
40
 
41
- try:
42
- import asyncpg
43
- from asyncpg import Pool
44
-
45
- except ImportError as e:
46
- raise ImportError(
47
- "`asyncpg` library is not installed. Please install it via pip: `pip install asyncpg`."
48
- ) from e
49
 
50
 
51
  class PostgreSQLDB:
@@ -61,9 +55,7 @@ class PostgreSQLDB:
61
  self.pool: Pool | None = None
62
 
63
  if self.user is None or self.password is None or self.database is None:
64
- raise ValueError(
65
- "Missing database user, password, or database in addon_params"
66
- )
67
 
68
  async def initdb(self):
69
  try:
@@ -353,6 +345,10 @@ class PGKVStorage(BaseKVStorage):
353
 
354
  ################ INSERT METHODS ################
355
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
 
 
 
 
356
  if is_namespace(self.namespace, NameSpace.KV_STORE_TEXT_CHUNKS):
357
  pass
358
  elif is_namespace(self.namespace, NameSpace.KV_STORE_FULL_DOCS):
@@ -454,10 +450,10 @@ class PGVectorStorage(BaseVectorStorage):
454
  return upsert_sql, data
455
 
456
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
457
- logger.info(f"Inserting {len(data)} vectors to {self.namespace}")
458
- if not len(data):
459
- logger.warning("You insert an empty data to vector DB")
460
- return []
461
  current_time = time.time()
462
  list_data = [
463
  {
@@ -618,6 +614,10 @@ class PGDocStatusStorage(DocStatusStorage):
618
  Args:
619
  data: dictionary of document IDs and their status data
620
  """
 
 
 
 
621
  sql = """insert into LIGHTRAG_DOC_STATUS(workspace,id,content,content_summary,content_length,chunks_count,status)
622
  values($1,$2,$3,$4,$5,$6,$7)
623
  on conflict(id,workspace) do update set
 
38
  if not pm.is_installed("asyncpg"):
39
  pm.install("asyncpg")
40
 
41
+ import asyncpg
42
+ from asyncpg import Pool
 
 
 
 
 
 
43
 
44
 
45
  class PostgreSQLDB:
 
55
  self.pool: Pool | None = None
56
 
57
  if self.user is None or self.password is None or self.database is None:
58
+ raise ValueError("Missing database user, password, or database")
 
 
59
 
60
  async def initdb(self):
61
  try:
 
345
 
346
  ################ INSERT METHODS ################
347
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
348
+ logger.info(f"Inserting {len(data)} to {self.namespace}")
349
+ if not data:
350
+ return
351
+
352
  if is_namespace(self.namespace, NameSpace.KV_STORE_TEXT_CHUNKS):
353
  pass
354
  elif is_namespace(self.namespace, NameSpace.KV_STORE_FULL_DOCS):
 
450
  return upsert_sql, data
451
 
452
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
453
+ logger.info(f"Inserting {len(data)} to {self.namespace}")
454
+ if not data:
455
+ return
456
+
457
  current_time = time.time()
458
  list_data = [
459
  {
 
614
  Args:
615
  data: dictionary of document IDs and their status data
616
  """
617
+ logger.info(f"Inserting {len(data)} to {self.namespace}")
618
+ if not data:
619
+ return
620
+
621
  sql = """insert into LIGHTRAG_DOC_STATUS(workspace,id,content,content_summary,content_length,chunks_count,status)
622
  values($1,$2,$3,$4,$5,$6,$7)
623
  on conflict(id,workspace) do update set
lightrag/kg/qdrant_impl.py CHANGED
@@ -15,16 +15,10 @@ config.read("config.ini", "utf-8")
15
 
16
  import pipmaster as pm
17
 
18
- if not pm.is_installed("qdrant_client"):
19
- pm.install("qdrant_client")
20
 
21
- try:
22
- from qdrant_client import QdrantClient, models
23
-
24
- except ImportError:
25
- raise ImportError(
26
- "`qdrant_client` library is not installed. Please install it via pip: `pip install qdrant-client`."
27
- )
28
 
29
 
30
  def compute_mdhash_id_for_qdrant(
@@ -93,9 +87,9 @@ class QdrantVectorDBStorage(BaseVectorStorage):
93
  )
94
 
95
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
96
- if not len(data):
97
- logger.warning("You insert an empty data to vector DB")
98
- return []
99
  list_data = [
100
  {
101
  "id": k,
 
15
 
16
  import pipmaster as pm
17
 
18
+ if not pm.is_installed("qdrant-client"):
19
+ pm.install("qdrant-client")
20
 
21
+ from qdrant_client import QdrantClient, models
 
 
 
 
 
 
22
 
23
 
24
  def compute_mdhash_id_for_qdrant(
 
87
  )
88
 
89
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
90
+ logger.info(f"Inserting {len(data)} to {self.namespace}")
91
+ if not data:
92
+ return
93
  list_data = [
94
  {
95
  "id": k,
lightrag/kg/redis_impl.py CHANGED
@@ -49,6 +49,9 @@ class RedisKVStorage(BaseKVStorage):
49
  return set(keys) - existing_ids
50
 
51
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
 
 
 
52
  pipe = self._redis.pipeline()
53
 
54
  for k, v in data.items():
 
49
  return set(keys) - existing_ids
50
 
51
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
52
+ logger.info(f"Inserting {len(data)} to {self.namespace}")
53
+ if not data:
54
+ return
55
  pipe = self._redis.pipeline()
56
 
57
  for k, v in data.items():
lightrag/kg/tidb_impl.py CHANGED
@@ -20,13 +20,7 @@ if not pm.is_installed("pymysql"):
20
  if not pm.is_installed("sqlalchemy"):
21
  pm.install("sqlalchemy")
22
 
23
- try:
24
- from sqlalchemy import create_engine, text
25
-
26
- except ImportError as e:
27
- raise ImportError(
28
- "`pymysql, sqlalchemy` library is not installed. Please install it via pip: `pip install pymysql sqlalchemy`."
29
- ) from e
30
 
31
 
32
  class TiDB:
@@ -217,6 +211,9 @@ class TiDBKVStorage(BaseKVStorage):
217
 
218
  ################ INSERT full_doc AND chunks ################
219
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
 
 
 
220
  left_data = {k: v for k, v in data.items() if k not in self._data}
221
  self._data.update(left_data)
222
  if is_namespace(self.namespace, NameSpace.KV_STORE_TEXT_CHUNKS):
@@ -324,12 +321,12 @@ class TiDBVectorDBStorage(BaseVectorStorage):
324
 
325
  ###### INSERT entities And relationships ######
326
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
327
- # ignore, upsert in TiDBKVStorage already
328
- if not len(data):
329
- logger.warning("You insert an empty data to vector DB")
330
- return []
331
  if is_namespace(self.namespace, NameSpace.VECTOR_STORE_CHUNKS):
332
- return []
 
333
  logger.info(f"Inserting {len(data)} vectors to {self.namespace}")
334
 
335
  list_data = [
 
20
  if not pm.is_installed("sqlalchemy"):
21
  pm.install("sqlalchemy")
22
 
23
+ from sqlalchemy import create_engine, text
 
 
 
 
 
 
24
 
25
 
26
  class TiDB:
 
211
 
212
  ################ INSERT full_doc AND chunks ################
213
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
214
+ logger.info(f"Inserting {len(data)} to {self.namespace}")
215
+ if not data:
216
+ return
217
  left_data = {k: v for k, v in data.items() if k not in self._data}
218
  self._data.update(left_data)
219
  if is_namespace(self.namespace, NameSpace.KV_STORE_TEXT_CHUNKS):
 
321
 
322
  ###### INSERT entities And relationships ######
323
  async def upsert(self, data: dict[str, dict[str, Any]]) -> None:
324
+ logger.info(f"Inserting {len(data)} to {self.namespace}")
325
+ if not data:
326
+ return
 
327
  if is_namespace(self.namespace, NameSpace.VECTOR_STORE_CHUNKS):
328
+ return
329
+
330
  logger.info(f"Inserting {len(data)} vectors to {self.namespace}")
331
 
332
  list_data = [
lightrag/lightrag.py CHANGED
@@ -6,7 +6,13 @@ import configparser
6
  from dataclasses import asdict, dataclass, field
7
  from datetime import datetime
8
  from functools import partial
9
- from typing import Any, AsyncIterator, Callable, Iterator, cast
 
 
 
 
 
 
10
 
11
  from .base import (
12
  BaseGraphStorage,
@@ -32,221 +38,37 @@ from .operate import (
32
  from .prompt import GRAPH_FIELD_SEP
33
  from .utils import (
34
  EmbeddingFunc,
 
35
  compute_mdhash_id,
36
  convert_response_to_json,
 
37
  limit_async_func_call,
38
  logger,
39
  set_logger,
 
40
  )
41
  from .types import KnowledgeGraph
42
 
 
43
  config = configparser.ConfigParser()
44
  config.read("config.ini", "utf-8")
45
 
46
- # Storage type and implementation compatibility validation table
47
- STORAGE_IMPLEMENTATIONS = {
48
- "KV_STORAGE": {
49
- "implementations": [
50
- "JsonKVStorage",
51
- "MongoKVStorage",
52
- "RedisKVStorage",
53
- "TiDBKVStorage",
54
- "PGKVStorage",
55
- "OracleKVStorage",
56
- ],
57
- "required_methods": ["get_by_id", "upsert"],
58
- },
59
- "GRAPH_STORAGE": {
60
- "implementations": [
61
- "NetworkXStorage",
62
- "Neo4JStorage",
63
- "MongoGraphStorage",
64
- "TiDBGraphStorage",
65
- "AGEStorage",
66
- "GremlinStorage",
67
- "PGGraphStorage",
68
- "OracleGraphStorage",
69
- ],
70
- "required_methods": ["upsert_node", "upsert_edge"],
71
- },
72
- "VECTOR_STORAGE": {
73
- "implementations": [
74
- "NanoVectorDBStorage",
75
- "MilvusVectorDBStorage",
76
- "ChromaVectorDBStorage",
77
- "TiDBVectorDBStorage",
78
- "PGVectorStorage",
79
- "FaissVectorDBStorage",
80
- "QdrantVectorDBStorage",
81
- "OracleVectorDBStorage",
82
- "MongoVectorDBStorage",
83
- ],
84
- "required_methods": ["query", "upsert"],
85
- },
86
- "DOC_STATUS_STORAGE": {
87
- "implementations": [
88
- "JsonDocStatusStorage",
89
- "PGDocStatusStorage",
90
- "PGDocStatusStorage",
91
- "MongoDocStatusStorage",
92
- ],
93
- "required_methods": ["get_docs_by_status"],
94
- },
95
- }
96
-
97
- # Storage implementation environment variable without default value
98
- STORAGE_ENV_REQUIREMENTS: dict[str, list[str]] = {
99
- # KV Storage Implementations
100
- "JsonKVStorage": [],
101
- "MongoKVStorage": [],
102
- "RedisKVStorage": ["REDIS_URI"],
103
- "TiDBKVStorage": ["TIDB_USER", "TIDB_PASSWORD", "TIDB_DATABASE"],
104
- "PGKVStorage": ["POSTGRES_USER", "POSTGRES_PASSWORD", "POSTGRES_DATABASE"],
105
- "OracleKVStorage": [
106
- "ORACLE_DSN",
107
- "ORACLE_USER",
108
- "ORACLE_PASSWORD",
109
- "ORACLE_CONFIG_DIR",
110
- ],
111
- # Graph Storage Implementations
112
- "NetworkXStorage": [],
113
- "Neo4JStorage": ["NEO4J_URI", "NEO4J_USERNAME", "NEO4J_PASSWORD"],
114
- "MongoGraphStorage": [],
115
- "TiDBGraphStorage": ["TIDB_USER", "TIDB_PASSWORD", "TIDB_DATABASE"],
116
- "AGEStorage": [
117
- "AGE_POSTGRES_DB",
118
- "AGE_POSTGRES_USER",
119
- "AGE_POSTGRES_PASSWORD",
120
- ],
121
- "GremlinStorage": ["GREMLIN_HOST", "GREMLIN_PORT", "GREMLIN_GRAPH"],
122
- "PGGraphStorage": [
123
- "POSTGRES_USER",
124
- "POSTGRES_PASSWORD",
125
- "POSTGRES_DATABASE",
126
- ],
127
- "OracleGraphStorage": [
128
- "ORACLE_DSN",
129
- "ORACLE_USER",
130
- "ORACLE_PASSWORD",
131
- "ORACLE_CONFIG_DIR",
132
- ],
133
- # Vector Storage Implementations
134
- "NanoVectorDBStorage": [],
135
- "MilvusVectorDBStorage": [],
136
- "ChromaVectorDBStorage": [],
137
- "TiDBVectorDBStorage": ["TIDB_USER", "TIDB_PASSWORD", "TIDB_DATABASE"],
138
- "PGVectorStorage": ["POSTGRES_USER", "POSTGRES_PASSWORD", "POSTGRES_DATABASE"],
139
- "FaissVectorDBStorage": [],
140
- "QdrantVectorDBStorage": ["QDRANT_URL"], # QDRANT_API_KEY has default value None
141
- "OracleVectorDBStorage": [
142
- "ORACLE_DSN",
143
- "ORACLE_USER",
144
- "ORACLE_PASSWORD",
145
- "ORACLE_CONFIG_DIR",
146
- ],
147
- "MongoVectorDBStorage": [],
148
- # Document Status Storage Implementations
149
- "JsonDocStatusStorage": [],
150
- "PGDocStatusStorage": ["POSTGRES_USER", "POSTGRES_PASSWORD", "POSTGRES_DATABASE"],
151
- "MongoDocStatusStorage": [],
152
- }
153
-
154
- # Storage implementation module mapping
155
- STORAGES = {
156
- "NetworkXStorage": ".kg.networkx_impl",
157
- "JsonKVStorage": ".kg.json_kv_impl",
158
- "NanoVectorDBStorage": ".kg.nano_vector_db_impl",
159
- "JsonDocStatusStorage": ".kg.json_doc_status_impl",
160
- "Neo4JStorage": ".kg.neo4j_impl",
161
- "OracleKVStorage": ".kg.oracle_impl",
162
- "OracleGraphStorage": ".kg.oracle_impl",
163
- "OracleVectorDBStorage": ".kg.oracle_impl",
164
- "MilvusVectorDBStorage": ".kg.milvus_impl",
165
- "MongoKVStorage": ".kg.mongo_impl",
166
- "MongoDocStatusStorage": ".kg.mongo_impl",
167
- "MongoGraphStorage": ".kg.mongo_impl",
168
- "MongoVectorDBStorage": ".kg.mongo_impl",
169
- "RedisKVStorage": ".kg.redis_impl",
170
- "ChromaVectorDBStorage": ".kg.chroma_impl",
171
- "TiDBKVStorage": ".kg.tidb_impl",
172
- "TiDBVectorDBStorage": ".kg.tidb_impl",
173
- "TiDBGraphStorage": ".kg.tidb_impl",
174
- "PGKVStorage": ".kg.postgres_impl",
175
- "PGVectorStorage": ".kg.postgres_impl",
176
- "AGEStorage": ".kg.age_impl",
177
- "PGGraphStorage": ".kg.postgres_impl",
178
- "GremlinStorage": ".kg.gremlin_impl",
179
- "PGDocStatusStorage": ".kg.postgres_impl",
180
- "FaissVectorDBStorage": ".kg.faiss_impl",
181
- "QdrantVectorDBStorage": ".kg.qdrant_impl",
182
- }
183
-
184
-
185
- def lazy_external_import(module_name: str, class_name: str) -> Callable[..., Any]:
186
- """Lazily import a class from an external module based on the package of the caller."""
187
- # Get the caller's module and package
188
- import inspect
189
-
190
- caller_frame = inspect.currentframe().f_back
191
- module = inspect.getmodule(caller_frame)
192
- package = module.__package__ if module else None
193
-
194
- def import_class(*args: Any, **kwargs: Any):
195
- import importlib
196
-
197
- module = importlib.import_module(module_name, package=package)
198
- cls = getattr(module, class_name)
199
- return cls(*args, **kwargs)
200
-
201
- return import_class
202
-
203
-
204
- def always_get_an_event_loop() -> asyncio.AbstractEventLoop:
205
- """
206
- Ensure that there is always an event loop available.
207
-
208
- This function tries to get the current event loop. If the current event loop is closed or does not exist,
209
- it creates a new event loop and sets it as the current event loop.
210
-
211
- Returns:
212
- asyncio.AbstractEventLoop: The current or newly created event loop.
213
- """
214
- try:
215
- # Try to get the current event loop
216
- current_loop = asyncio.get_event_loop()
217
- if current_loop.is_closed():
218
- raise RuntimeError("Event loop is closed.")
219
- return current_loop
220
-
221
- except RuntimeError:
222
- # If no event loop exists or it is closed, create a new one
223
- logger.info("Creating a new event loop in main thread.")
224
- new_loop = asyncio.new_event_loop()
225
- asyncio.set_event_loop(new_loop)
226
- return new_loop
227
-
228
 
 
229
  @dataclass
230
  class LightRAG:
231
  """LightRAG: Simple and Fast Retrieval-Augmented Generation."""
232
 
 
 
 
233
  working_dir: str = field(
234
- default_factory=lambda: f"./lightrag_cache_{datetime.now().strftime('%Y-%m-%d-%H:%M:%S')}"
235
  )
236
  """Directory where cache and temporary files are stored."""
237
 
238
- embedding_cache_config: dict[str, Any] = field(
239
- default_factory=lambda: {
240
- "enabled": False,
241
- "similarity_threshold": 0.95,
242
- "use_llm_check": False,
243
- }
244
- )
245
- """Configuration for embedding cache.
246
- - enabled: If True, enables caching to avoid redundant computations.
247
- - similarity_threshold: Minimum similarity score to use cached embeddings.
248
- - use_llm_check: If True, validates cached embeddings using an LLM.
249
- """
250
 
251
  kv_storage: str = field(default="JsonKVStorage")
252
  """Storage backend for key-value data."""
@@ -261,32 +83,74 @@ class LightRAG:
261
  """Storage type for tracking document processing statuses."""
262
 
263
  # Logging
264
- current_log_level = logger.level
265
- log_level: int = field(default=current_log_level)
 
266
  """Logging level for the system (e.g., 'DEBUG', 'INFO', 'WARNING')."""
267
 
268
- log_dir: str = field(default=os.getcwd())
269
- """Directory where logs are stored. Defaults to the current working directory."""
 
 
 
 
 
 
 
 
 
 
270
 
271
  # Text chunking
272
- chunk_token_size: int = int(os.getenv("CHUNK_SIZE", "1200"))
 
 
273
  """Maximum number of tokens per text chunk when splitting documents."""
274
 
275
- chunk_overlap_token_size: int = int(os.getenv("CHUNK_OVERLAP_SIZE", "100"))
 
 
276
  """Number of overlapping tokens between consecutive text chunks to preserve context."""
277
 
278
- tiktoken_model_name: str = "gpt-4o-mini"
279
  """Model name used for tokenization when chunking text."""
280
 
281
- # Entity extraction
282
- entity_extract_max_gleaning: int = 1
283
- """Maximum number of entity extraction attempts for ambiguous content."""
284
-
285
- entity_summary_to_max_tokens: int = int(os.getenv("MAX_TOKEN_SUMMARY", "500"))
286
  """Maximum number of tokens used for summarizing extracted entities."""
287
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
288
  # Node embedding
289
- node_embedding_algorithm: str = "node2vec"
 
 
290
  """Algorithm used for node embedding in knowledge graphs."""
291
 
292
  node2vec_params: dict[str, int] = field(
@@ -308,116 +172,102 @@ class LightRAG:
308
  - random_seed: Seed value for reproducibility.
309
  """
310
 
311
- embedding_func: EmbeddingFunc | None = None
 
 
 
312
  """Function for computing text embeddings. Must be set before use."""
313
 
314
- embedding_batch_num: int = 32
315
  """Batch size for embedding computations."""
316
 
317
- embedding_func_max_async: int = 16
318
  """Maximum number of concurrent embedding function calls."""
319
 
 
 
 
 
 
 
 
 
 
 
 
 
 
320
  # LLM Configuration
321
- llm_model_func: Callable[..., object] | None = None
 
 
322
  """Function for interacting with the large language model (LLM). Must be set before use."""
323
 
324
- llm_model_name: str = "meta-llama/Llama-3.2-1B-Instruct"
325
  """Name of the LLM model used for generating responses."""
326
 
327
- llm_model_max_token_size: int = int(os.getenv("MAX_TOKENS", "32768"))
328
  """Maximum number of tokens allowed per LLM response."""
329
 
330
- llm_model_max_async: int = int(os.getenv("MAX_ASYNC", "16"))
331
  """Maximum number of concurrent LLM calls."""
332
 
333
  llm_model_kwargs: dict[str, Any] = field(default_factory=dict)
334
  """Additional keyword arguments passed to the LLM model function."""
335
 
336
  # Storage
 
 
337
  vector_db_storage_cls_kwargs: dict[str, Any] = field(default_factory=dict)
338
  """Additional parameters for vector database storage."""
339
 
340
  namespace_prefix: str = field(default="")
341
  """Prefix for namespacing stored data across different environments."""
342
 
343
- enable_llm_cache: bool = True
344
  """Enables caching for LLM responses to avoid redundant computations."""
345
 
346
- enable_llm_cache_for_entity_extract: bool = True
347
  """If True, enables caching for entity extraction steps to reduce LLM costs."""
348
 
349
  # Extensions
350
- addon_params: dict[str, Any] = field(default_factory=dict)
351
 
352
- # Storages Management
353
- auto_manage_storages_states: bool = True
354
- """If True, lightrag will automatically calls initialize_storages and finalize_storages at the appropriate times."""
355
-
356
- """Dictionary for additional parameters and extensions."""
357
- convert_response_to_json_func: Callable[[str], dict[str, Any]] = (
358
- convert_response_to_json
359
- )
360
-
361
- # Custom Chunking Function
362
- chunking_func: Callable[
363
- [
364
- str,
365
- str | None,
366
- bool,
367
- int,
368
- int,
369
- str,
370
- ],
371
- list[dict[str, Any]],
372
- ] = chunking_by_token_size
373
 
374
- def verify_storage_implementation(
375
- self, storage_type: str, storage_name: str
376
- ) -> None:
377
- """Verify if storage implementation is compatible with specified storage type
378
 
379
- Args:
380
- storage_type: Storage type (KV_STORAGE, GRAPH_STORAGE etc.)
381
- storage_name: Storage implementation name
382
 
383
- Raises:
384
- ValueError: If storage implementation is incompatible or missing required methods
385
- """
386
- if storage_type not in STORAGE_IMPLEMENTATIONS:
387
- raise ValueError(f"Unknown storage type: {storage_type}")
388
 
389
- storage_info = STORAGE_IMPLEMENTATIONS[storage_type]
390
- if storage_name not in storage_info["implementations"]:
391
- raise ValueError(
392
- f"Storage implementation '{storage_name}' is not compatible with {storage_type}. "
393
- f"Compatible implementations are: {', '.join(storage_info['implementations'])}"
394
- )
395
 
396
- def check_storage_env_vars(self, storage_name: str) -> None:
397
- """Check if all required environment variables for storage implementation exist
 
 
 
398
 
399
- Args:
400
- storage_name: Storage implementation name
401
 
402
- Raises:
403
- ValueError: If required environment variables are missing
404
- """
405
- required_vars = STORAGE_ENV_REQUIREMENTS.get(storage_name, [])
406
- missing_vars = [var for var in required_vars if var not in os.environ]
407
 
408
- if missing_vars:
409
- raise ValueError(
410
- f"Storage implementation '{storage_name}' requires the following "
411
- f"environment variables: {', '.join(missing_vars)}"
412
- )
413
 
414
  def __post_init__(self):
415
- os.makedirs(self.log_dir, exist_ok=True)
416
- log_file = os.path.join(self.log_dir, "lightrag.log")
417
- set_logger(log_file)
418
-
419
  logger.setLevel(self.log_level)
 
 
420
  logger.info(f"Logger initialized for working directory: {self.working_dir}")
 
421
  if not os.path.exists(self.working_dir):
422
  logger.info(f"Creating working directory {self.working_dir}")
423
  os.makedirs(self.working_dir)
@@ -432,22 +282,16 @@ class LightRAG:
432
 
433
  for storage_type, storage_name in storage_configs:
434
  # Verify storage implementation compatibility
435
- self.verify_storage_implementation(storage_type, storage_name)
436
  # Check environment variables
437
  # self.check_storage_env_vars(storage_name)
438
 
439
  # Ensure vector_db_storage_cls_kwargs has required fields
440
- default_vector_db_kwargs = {
441
- "cosine_better_than_threshold": float(os.getenv("COSINE_THRESHOLD", "0.2"))
442
- }
443
  self.vector_db_storage_cls_kwargs = {
444
- **default_vector_db_kwargs,
445
  **self.vector_db_storage_cls_kwargs,
446
  }
447
 
448
- # Life cycle
449
- self.storages_status = StoragesStatus.NOT_CREATED
450
-
451
  # Show config
452
  global_config = asdict(self)
453
  _print_config = ",\n ".join([f"{k} = {v}" for k, v in global_config.items()])
@@ -555,7 +399,7 @@ class LightRAG:
555
  )
556
  )
557
 
558
- self.storages_status = StoragesStatus.CREATED
559
 
560
  # Initialize storages
561
  if self.auto_manage_storages_states:
@@ -570,7 +414,7 @@ class LightRAG:
570
 
571
  async def initialize_storages(self):
572
  """Asynchronously initialize the storages"""
573
- if self.storages_status == StoragesStatus.CREATED:
574
  tasks = []
575
 
576
  for storage in (
@@ -588,12 +432,12 @@ class LightRAG:
588
 
589
  await asyncio.gather(*tasks)
590
 
591
- self.storages_status = StoragesStatus.INITIALIZED
592
  logger.debug("Initialized Storages")
593
 
594
  async def finalize_storages(self):
595
  """Asynchronously finalize the storages"""
596
- if self.storages_status == StoragesStatus.INITIALIZED:
597
  tasks = []
598
 
599
  for storage in (
@@ -611,7 +455,7 @@ class LightRAG:
611
 
612
  await asyncio.gather(*tasks)
613
 
614
- self.storages_status = StoragesStatus.FINALIZED
615
  logger.debug("Finalized Storages")
616
 
617
  async def get_graph_labels(self):
@@ -687,7 +531,7 @@ class LightRAG:
687
  return
688
 
689
  update_storage = True
690
- logger.info(f"[New Docs] inserting {len(new_docs)} docs")
691
 
692
  inserting_chunks: dict[str, Any] = {}
693
  for chunk_text in text_chunks:
@@ -780,108 +624,122 @@ class LightRAG:
780
  4. Update the document status
781
  """
782
  # 1. Get all pending, failed, and abnormally terminated processing documents.
783
- to_process_docs: dict[str, DocProcessingStatus] = {}
 
 
 
 
 
784
 
785
- processing_docs = await self.doc_status.get_docs_by_status(DocStatus.PROCESSING)
786
  to_process_docs.update(processing_docs)
787
- failed_docs = await self.doc_status.get_docs_by_status(DocStatus.FAILED)
788
  to_process_docs.update(failed_docs)
789
- pendings_docs = await self.doc_status.get_docs_by_status(DocStatus.PENDING)
790
- to_process_docs.update(pendings_docs)
791
 
792
  if not to_process_docs:
793
  logger.info("All documents have been processed or are duplicates")
794
  return
795
 
796
  # 2. split docs into chunks, insert chunks, update doc status
797
- batch_size = self.addon_params.get("insert_batch_size", 10)
798
  docs_batches = [
799
- list(to_process_docs.items())[i : i + batch_size]
800
- for i in range(0, len(to_process_docs), batch_size)
801
  ]
802
 
803
  logger.info(f"Number of batches to process: {len(docs_batches)}.")
804
 
 
805
  # 3. iterate over batches
806
  for batch_idx, docs_batch in enumerate(docs_batches):
807
- # 4. iterate over batch
808
- for doc_id_processing_status in docs_batch:
809
- doc_id, status_doc = doc_id_processing_status
810
- # Update status in processing
811
- doc_status_id = compute_mdhash_id(status_doc.content, prefix="doc-")
812
- await self.doc_status.upsert(
813
- {
814
- doc_status_id: {
815
- "status": DocStatus.PROCESSING,
816
- "updated_at": datetime.now().isoformat(),
817
- "content": status_doc.content,
818
- "content_summary": status_doc.content_summary,
819
- "content_length": status_doc.content_length,
820
- "created_at": status_doc.created_at,
 
 
 
821
  }
 
 
 
 
 
 
 
 
822
  }
823
- )
824
- # Generate chunks from document
825
- chunks: dict[str, Any] = {
826
- compute_mdhash_id(dp["content"], prefix="chunk-"): {
827
- **dp,
828
- "full_doc_id": doc_id,
829
- }
830
- for dp in self.chunking_func(
831
- status_doc.content,
832
- split_by_character,
833
- split_by_character_only,
834
- self.chunk_overlap_token_size,
835
- self.chunk_token_size,
836
- self.tiktoken_model_name,
837
- )
838
- }
839
-
840
- # Process document (text chunks and full docs) in parallel
841
- tasks = [
842
- self.chunks_vdb.upsert(chunks),
843
- self._process_entity_relation_graph(chunks),
844
- self.full_docs.upsert({doc_id: {"content": status_doc.content}}),
845
- self.text_chunks.upsert(chunks),
846
- ]
847
- try:
848
- await asyncio.gather(*tasks)
849
- await self.doc_status.upsert(
850
- {
851
- doc_status_id: {
852
- "status": DocStatus.PROCESSED,
853
- "chunks_count": len(chunks),
854
- "content": status_doc.content,
855
- "content_summary": status_doc.content_summary,
856
- "content_length": status_doc.content_length,
857
- "created_at": status_doc.created_at,
858
- "updated_at": datetime.now().isoformat(),
859
  }
860
- }
861
- )
862
- await self._insert_done()
863
-
864
- except Exception as e:
865
- logger.error(f"Failed to process document {doc_id}: {str(e)}")
866
- await self.doc_status.upsert(
867
- {
868
- doc_status_id: {
869
- "status": DocStatus.FAILED,
870
- "error": str(e),
871
- "content": status_doc.content,
872
- "content_summary": status_doc.content_summary,
873
- "content_length": status_doc.content_length,
874
- "created_at": status_doc.created_at,
875
- "updated_at": datetime.now().isoformat(),
 
 
 
 
 
876
  }
877
- }
878
- )
879
- continue
880
- logger.info(f"Completed batch {batch_idx + 1} of {len(docs_batches)}.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
881
 
882
  async def _process_entity_relation_graph(self, chunk: dict[str, Any]) -> None:
883
  try:
884
- new_kg = await extract_entities(
885
  chunk,
886
  knowledge_graph_inst=self.chunk_entity_relation_graph,
887
  entity_vdb=self.entities_vdb,
@@ -889,12 +747,6 @@ class LightRAG:
889
  llm_response_cache=self.llm_response_cache,
890
  global_config=asdict(self),
891
  )
892
- if new_kg is None:
893
- logger.info("No new entities or relationships extracted.")
894
- else:
895
- logger.info("New entities or relationships extracted.")
896
- self.chunk_entity_relation_graph = new_kg
897
-
898
  except Exception as e:
899
  logger.error("Failed to extract entities and relationships")
900
  raise e
@@ -914,6 +766,7 @@ class LightRAG:
914
  if storage_inst is not None
915
  ]
916
  await asyncio.gather(*tasks)
 
917
 
918
  def insert_custom_kg(self, custom_kg: dict[str, Any]) -> None:
919
  loop = always_get_an_event_loop()
@@ -926,11 +779,28 @@ class LightRAG:
926
  all_chunks_data: dict[str, dict[str, str]] = {}
927
  chunk_to_source_map: dict[str, str] = {}
928
  for chunk_data in custom_kg.get("chunks", {}):
929
- chunk_content = chunk_data["content"]
930
  source_id = chunk_data["source_id"]
931
- chunk_id = compute_mdhash_id(chunk_content.strip(), prefix="chunk-")
 
 
 
 
 
 
 
 
 
 
932
 
933
- chunk_entry = {"content": chunk_content.strip(), "source_id": source_id}
 
 
 
 
 
 
 
934
  all_chunks_data[chunk_id] = chunk_entry
935
  chunk_to_source_map[source_id] = chunk_id
936
  update_storage = True
@@ -1177,7 +1047,6 @@ class LightRAG:
1177
  # ---------------------
1178
  # STEP 1: Keyword Extraction
1179
  # ---------------------
1180
- # We'll assume 'extract_keywords_only(...)' returns (hl_keywords, ll_keywords).
1181
  hl_keywords, ll_keywords = await extract_keywords_only(
1182
  text=query,
1183
  param=param,
@@ -1603,3 +1472,21 @@ class LightRAG:
1603
  result["vector_data"] = vector_data[0] if vector_data else None
1604
 
1605
  return result
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  from dataclasses import asdict, dataclass, field
7
  from datetime import datetime
8
  from functools import partial
9
+ from typing import Any, AsyncIterator, Callable, Iterator, cast, final
10
+
11
+ from lightrag.kg import (
12
+ STORAGE_ENV_REQUIREMENTS,
13
+ STORAGES,
14
+ verify_storage_implementation,
15
+ )
16
 
17
  from .base import (
18
  BaseGraphStorage,
 
38
  from .prompt import GRAPH_FIELD_SEP
39
  from .utils import (
40
  EmbeddingFunc,
41
+ always_get_an_event_loop,
42
  compute_mdhash_id,
43
  convert_response_to_json,
44
+ lazy_external_import,
45
  limit_async_func_call,
46
  logger,
47
  set_logger,
48
+ encode_string_by_tiktoken,
49
  )
50
  from .types import KnowledgeGraph
51
 
52
+ # TODO: TO REMOVE @Yannick
53
  config = configparser.ConfigParser()
54
  config.read("config.ini", "utf-8")
55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
+ @final
58
  @dataclass
59
  class LightRAG:
60
  """LightRAG: Simple and Fast Retrieval-Augmented Generation."""
61
 
62
+ # Directory
63
+ # ---
64
+
65
  working_dir: str = field(
66
+ default=f"./lightrag_cache_{datetime.now().strftime('%Y-%m-%d-%H:%M:%S')}"
67
  )
68
  """Directory where cache and temporary files are stored."""
69
 
70
+ # Storage
71
+ # ---
 
 
 
 
 
 
 
 
 
 
72
 
73
  kv_storage: str = field(default="JsonKVStorage")
74
  """Storage backend for key-value data."""
 
83
  """Storage type for tracking document processing statuses."""
84
 
85
  # Logging
86
+ # ---
87
+
88
+ log_level: int = field(default=logger.level)
89
  """Logging level for the system (e.g., 'DEBUG', 'INFO', 'WARNING')."""
90
 
91
+ log_file_path: str = field(default=os.path.join(os.getcwd(), "lightrag.log"))
92
+ """Log file path."""
93
+
94
+ # Entity extraction
95
+ # ---
96
+
97
+ entity_extract_max_gleaning: int = field(default=1)
98
+ """Maximum number of entity extraction attempts for ambiguous content."""
99
+
100
+ entity_summary_to_max_tokens: int = field(
101
+ default=int(os.getenv("MAX_TOKEN_SUMMARY", 500))
102
+ )
103
 
104
  # Text chunking
105
+ # ---
106
+
107
+ chunk_token_size: int = field(default=int(os.getenv("CHUNK_SIZE", 1200)))
108
  """Maximum number of tokens per text chunk when splitting documents."""
109
 
110
+ chunk_overlap_token_size: int = field(
111
+ default=int(os.getenv("CHUNK_OVERLAP_SIZE", 100))
112
+ )
113
  """Number of overlapping tokens between consecutive text chunks to preserve context."""
114
 
115
+ tiktoken_model_name: str = field(default="gpt-4o-mini")
116
  """Model name used for tokenization when chunking text."""
117
 
 
 
 
 
 
118
  """Maximum number of tokens used for summarizing extracted entities."""
119
 
120
+ chunking_func: Callable[
121
+ [
122
+ str,
123
+ str | None,
124
+ bool,
125
+ int,
126
+ int,
127
+ str,
128
+ ],
129
+ list[dict[str, Any]],
130
+ ] = field(default_factory=lambda: chunking_by_token_size)
131
+ """
132
+ Custom chunking function for splitting text into chunks before processing.
133
+
134
+ The function should take the following parameters:
135
+
136
+ - `content`: The text to be split into chunks.
137
+ - `split_by_character`: The character to split the text on. If None, the text is split into chunks of `chunk_token_size` tokens.
138
+ - `split_by_character_only`: If True, the text is split only on the specified character.
139
+ - `chunk_token_size`: The maximum number of tokens per chunk.
140
+ - `chunk_overlap_token_size`: The number of overlapping tokens between consecutive chunks.
141
+ - `tiktoken_model_name`: The name of the tiktoken model to use for tokenization.
142
+
143
+ The function should return a list of dictionaries, where each dictionary contains the following keys:
144
+ - `tokens`: The number of tokens in the chunk.
145
+ - `content`: The text content of the chunk.
146
+
147
+ Defaults to `chunking_by_token_size` if not specified.
148
+ """
149
+
150
  # Node embedding
151
+ # ---
152
+
153
+ node_embedding_algorithm: str = field(default="node2vec")
154
  """Algorithm used for node embedding in knowledge graphs."""
155
 
156
  node2vec_params: dict[str, int] = field(
 
172
  - random_seed: Seed value for reproducibility.
173
  """
174
 
175
+ # Embedding
176
+ # ---
177
+
178
+ embedding_func: EmbeddingFunc | None = field(default=None)
179
  """Function for computing text embeddings. Must be set before use."""
180
 
181
+ embedding_batch_num: int = field(default=32)
182
  """Batch size for embedding computations."""
183
 
184
+ embedding_func_max_async: int = field(default=16)
185
  """Maximum number of concurrent embedding function calls."""
186
 
187
+ embedding_cache_config: dict[str, Any] = field(
188
+ default_factory=lambda: {
189
+ "enabled": False,
190
+ "similarity_threshold": 0.95,
191
+ "use_llm_check": False,
192
+ }
193
+ )
194
+ """Configuration for embedding cache.
195
+ - enabled: If True, enables caching to avoid redundant computations.
196
+ - similarity_threshold: Minimum similarity score to use cached embeddings.
197
+ - use_llm_check: If True, validates cached embeddings using an LLM.
198
+ """
199
+
200
  # LLM Configuration
201
+ # ---
202
+
203
+ llm_model_func: Callable[..., object] | None = field(default=None)
204
  """Function for interacting with the large language model (LLM). Must be set before use."""
205
 
206
+ llm_model_name: str = field(default="gpt-4o-mini")
207
  """Name of the LLM model used for generating responses."""
208
 
209
+ llm_model_max_token_size: int = field(default=int(os.getenv("MAX_TOKENS", 32768)))
210
  """Maximum number of tokens allowed per LLM response."""
211
 
212
+ llm_model_max_async: int = field(default=int(os.getenv("MAX_ASYNC", 16)))
213
  """Maximum number of concurrent LLM calls."""
214
 
215
  llm_model_kwargs: dict[str, Any] = field(default_factory=dict)
216
  """Additional keyword arguments passed to the LLM model function."""
217
 
218
  # Storage
219
+ # ---
220
+
221
  vector_db_storage_cls_kwargs: dict[str, Any] = field(default_factory=dict)
222
  """Additional parameters for vector database storage."""
223
 
224
  namespace_prefix: str = field(default="")
225
  """Prefix for namespacing stored data across different environments."""
226
 
227
+ enable_llm_cache: bool = field(default=True)
228
  """Enables caching for LLM responses to avoid redundant computations."""
229
 
230
+ enable_llm_cache_for_entity_extract: bool = field(default=True)
231
  """If True, enables caching for entity extraction steps to reduce LLM costs."""
232
 
233
  # Extensions
234
+ # ---
235
 
236
+ max_parallel_insert: int = field(default=int(os.getenv("MAX_PARALLEL_INSERT", 20)))
237
+ """Maximum number of parallel insert operations."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
238
 
239
+ addon_params: dict[str, Any] = field(default_factory=dict)
 
 
 
240
 
241
+ # Storages Management
242
+ # ---
 
243
 
244
+ auto_manage_storages_states: bool = field(default=True)
245
+ """If True, lightrag will automatically calls initialize_storages and finalize_storages at the appropriate times."""
 
 
 
246
 
247
+ # Storages Management
248
+ # ---
 
 
 
 
249
 
250
+ convert_response_to_json_func: Callable[[str], dict[str, Any]] = field(
251
+ default_factory=lambda: convert_response_to_json
252
+ )
253
+ """
254
+ Custom function for converting LLM responses to JSON format.
255
 
256
+ The default function is :func:`.utils.convert_response_to_json`.
257
+ """
258
 
259
+ cosine_better_than_threshold: float = field(
260
+ default=float(os.getenv("COSINE_THRESHOLD", 0.2))
261
+ )
 
 
262
 
263
+ _storages_status: StoragesStatus = field(default=StoragesStatus.NOT_CREATED)
 
 
 
 
264
 
265
  def __post_init__(self):
 
 
 
 
266
  logger.setLevel(self.log_level)
267
+ os.makedirs(os.path.dirname(self.log_file_path), exist_ok=True)
268
+ set_logger(self.log_file_path)
269
  logger.info(f"Logger initialized for working directory: {self.working_dir}")
270
+
271
  if not os.path.exists(self.working_dir):
272
  logger.info(f"Creating working directory {self.working_dir}")
273
  os.makedirs(self.working_dir)
 
282
 
283
  for storage_type, storage_name in storage_configs:
284
  # Verify storage implementation compatibility
285
+ verify_storage_implementation(storage_type, storage_name)
286
  # Check environment variables
287
  # self.check_storage_env_vars(storage_name)
288
 
289
  # Ensure vector_db_storage_cls_kwargs has required fields
 
 
 
290
  self.vector_db_storage_cls_kwargs = {
291
+ "cosine_better_than_threshold": self.cosine_better_than_threshold,
292
  **self.vector_db_storage_cls_kwargs,
293
  }
294
 
 
 
 
295
  # Show config
296
  global_config = asdict(self)
297
  _print_config = ",\n ".join([f"{k} = {v}" for k, v in global_config.items()])
 
399
  )
400
  )
401
 
402
+ self._storages_status = StoragesStatus.CREATED
403
 
404
  # Initialize storages
405
  if self.auto_manage_storages_states:
 
414
 
415
  async def initialize_storages(self):
416
  """Asynchronously initialize the storages"""
417
+ if self._storages_status == StoragesStatus.CREATED:
418
  tasks = []
419
 
420
  for storage in (
 
432
 
433
  await asyncio.gather(*tasks)
434
 
435
+ self._storages_status = StoragesStatus.INITIALIZED
436
  logger.debug("Initialized Storages")
437
 
438
  async def finalize_storages(self):
439
  """Asynchronously finalize the storages"""
440
+ if self._storages_status == StoragesStatus.INITIALIZED:
441
  tasks = []
442
 
443
  for storage in (
 
455
 
456
  await asyncio.gather(*tasks)
457
 
458
+ self._storages_status = StoragesStatus.FINALIZED
459
  logger.debug("Finalized Storages")
460
 
461
  async def get_graph_labels(self):
 
531
  return
532
 
533
  update_storage = True
534
+ logger.info(f"Inserting {len(new_docs)} docs")
535
 
536
  inserting_chunks: dict[str, Any] = {}
537
  for chunk_text in text_chunks:
 
624
  4. Update the document status
625
  """
626
  # 1. Get all pending, failed, and abnormally terminated processing documents.
627
+ # Run the asynchronous status retrievals in parallel using asyncio.gather
628
+ processing_docs, failed_docs, pending_docs = await asyncio.gather(
629
+ self.doc_status.get_docs_by_status(DocStatus.PROCESSING),
630
+ self.doc_status.get_docs_by_status(DocStatus.FAILED),
631
+ self.doc_status.get_docs_by_status(DocStatus.PENDING),
632
+ )
633
 
634
+ to_process_docs: dict[str, DocProcessingStatus] = {}
635
  to_process_docs.update(processing_docs)
 
636
  to_process_docs.update(failed_docs)
637
+ to_process_docs.update(pending_docs)
 
638
 
639
  if not to_process_docs:
640
  logger.info("All documents have been processed or are duplicates")
641
  return
642
 
643
  # 2. split docs into chunks, insert chunks, update doc status
 
644
  docs_batches = [
645
+ list(to_process_docs.items())[i : i + self.max_parallel_insert]
646
+ for i in range(0, len(to_process_docs), self.max_parallel_insert)
647
  ]
648
 
649
  logger.info(f"Number of batches to process: {len(docs_batches)}.")
650
 
651
+ batches: list[Any] = []
652
  # 3. iterate over batches
653
  for batch_idx, docs_batch in enumerate(docs_batches):
654
+
655
+ async def batch(
656
+ batch_idx: int,
657
+ docs_batch: list[tuple[str, DocProcessingStatus]],
658
+ size_batch: int,
659
+ ) -> None:
660
+ logger.info(f"Start processing batch {batch_idx + 1} of {size_batch}.")
661
+ # 4. iterate over batch
662
+ for doc_id_processing_status in docs_batch:
663
+ doc_id, status_doc = doc_id_processing_status
664
+ # Update status in processing
665
+ doc_status_id = compute_mdhash_id(status_doc.content, prefix="doc-")
666
+ # Generate chunks from document
667
+ chunks: dict[str, Any] = {
668
+ compute_mdhash_id(dp["content"], prefix="chunk-"): {
669
+ **dp,
670
+ "full_doc_id": doc_id,
671
  }
672
+ for dp in self.chunking_func(
673
+ status_doc.content,
674
+ split_by_character,
675
+ split_by_character_only,
676
+ self.chunk_overlap_token_size,
677
+ self.chunk_token_size,
678
+ self.tiktoken_model_name,
679
+ )
680
  }
681
+ # Process document (text chunks and full docs) in parallel
682
+ tasks = [
683
+ self.doc_status.upsert(
684
+ {
685
+ doc_status_id: {
686
+ "status": DocStatus.PROCESSING,
687
+ "updated_at": datetime.now().isoformat(),
688
+ "content": status_doc.content,
689
+ "content_summary": status_doc.content_summary,
690
+ "content_length": status_doc.content_length,
691
+ "created_at": status_doc.created_at,
692
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
693
  }
694
+ ),
695
+ self.chunks_vdb.upsert(chunks),
696
+ self._process_entity_relation_graph(chunks),
697
+ self.full_docs.upsert(
698
+ {doc_id: {"content": status_doc.content}}
699
+ ),
700
+ self.text_chunks.upsert(chunks),
701
+ ]
702
+ try:
703
+ await asyncio.gather(*tasks)
704
+ await self.doc_status.upsert(
705
+ {
706
+ doc_status_id: {
707
+ "status": DocStatus.PROCESSED,
708
+ "chunks_count": len(chunks),
709
+ "content": status_doc.content,
710
+ "content_summary": status_doc.content_summary,
711
+ "content_length": status_doc.content_length,
712
+ "created_at": status_doc.created_at,
713
+ "updated_at": datetime.now().isoformat(),
714
+ }
715
  }
716
+ )
717
+ except Exception as e:
718
+ logger.error(f"Failed to process document {doc_id}: {str(e)}")
719
+ await self.doc_status.upsert(
720
+ {
721
+ doc_status_id: {
722
+ "status": DocStatus.FAILED,
723
+ "error": str(e),
724
+ "content": status_doc.content,
725
+ "content_summary": status_doc.content_summary,
726
+ "content_length": status_doc.content_length,
727
+ "created_at": status_doc.created_at,
728
+ "updated_at": datetime.now().isoformat(),
729
+ }
730
+ }
731
+ )
732
+ continue
733
+ logger.info(f"Completed batch {batch_idx + 1} of {len(docs_batches)}.")
734
+
735
+ batches.append(batch(batch_idx, docs_batch, len(docs_batches)))
736
+
737
+ await asyncio.gather(*batches)
738
+ await self._insert_done()
739
 
740
  async def _process_entity_relation_graph(self, chunk: dict[str, Any]) -> None:
741
  try:
742
+ await extract_entities(
743
  chunk,
744
  knowledge_graph_inst=self.chunk_entity_relation_graph,
745
  entity_vdb=self.entities_vdb,
 
747
  llm_response_cache=self.llm_response_cache,
748
  global_config=asdict(self),
749
  )
 
 
 
 
 
 
750
  except Exception as e:
751
  logger.error("Failed to extract entities and relationships")
752
  raise e
 
766
  if storage_inst is not None
767
  ]
768
  await asyncio.gather(*tasks)
769
+ logger.info("All Insert done")
770
 
771
  def insert_custom_kg(self, custom_kg: dict[str, Any]) -> None:
772
  loop = always_get_an_event_loop()
 
779
  all_chunks_data: dict[str, dict[str, str]] = {}
780
  chunk_to_source_map: dict[str, str] = {}
781
  for chunk_data in custom_kg.get("chunks", {}):
782
+ chunk_content = chunk_data["content"].strip()
783
  source_id = chunk_data["source_id"]
784
+ tokens = len(
785
+ encode_string_by_tiktoken(
786
+ chunk_content, model_name=self.tiktoken_model_name
787
+ )
788
+ )
789
+ chunk_order_index = (
790
+ 0
791
+ if "chunk_order_index" not in chunk_data.keys()
792
+ else chunk_data["chunk_order_index"]
793
+ )
794
+ chunk_id = compute_mdhash_id(chunk_content, prefix="chunk-")
795
 
796
+ chunk_entry = {
797
+ "content": chunk_content,
798
+ "source_id": source_id,
799
+ "tokens": tokens,
800
+ "chunk_order_index": chunk_order_index,
801
+ "full_doc_id": source_id,
802
+ "status": DocStatus.PROCESSED,
803
+ }
804
  all_chunks_data[chunk_id] = chunk_entry
805
  chunk_to_source_map[source_id] = chunk_id
806
  update_storage = True
 
1047
  # ---------------------
1048
  # STEP 1: Keyword Extraction
1049
  # ---------------------
 
1050
  hl_keywords, ll_keywords = await extract_keywords_only(
1051
  text=query,
1052
  param=param,
 
1472
  result["vector_data"] = vector_data[0] if vector_data else None
1473
 
1474
  return result
1475
+
1476
+ def check_storage_env_vars(self, storage_name: str) -> None:
1477
+ """Check if all required environment variables for storage implementation exist
1478
+
1479
+ Args:
1480
+ storage_name: Storage implementation name
1481
+
1482
+ Raises:
1483
+ ValueError: If required environment variables are missing
1484
+ """
1485
+ required_vars = STORAGE_ENV_REQUIREMENTS.get(storage_name, [])
1486
+ missing_vars = [var for var in required_vars if var not in os.environ]
1487
+
1488
+ if missing_vars:
1489
+ raise ValueError(
1490
+ f"Storage implementation '{storage_name}' requires the following "
1491
+ f"environment variables: {', '.join(missing_vars)}"
1492
+ )
lightrag/llm/Readme.md ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ 1. **LlamaIndex** (`llm/llama_index.py`):
3
+ - Provides integration with OpenAI and other providers through LlamaIndex
4
+ - Supports both direct API access and proxy services like LiteLLM
5
+ - Handles embeddings and completions with consistent interfaces
6
+ - See example implementations:
7
+ - [Direct OpenAI Usage](../../examples/lightrag_llamaindex_direct_demo.py)
8
+ - [LiteLLM Proxy Usage](../../examples/lightrag_llamaindex_litellm_demo.py)
9
+
10
+ <details>
11
+ <summary> <b>Using LlamaIndex</b> </summary>
12
+
13
+ LightRAG supports LlamaIndex for embeddings and completions in two ways: direct OpenAI usage or through LiteLLM proxy.
14
+
15
+ ### Setup
16
+
17
+ First, install the required dependencies:
18
+ ```bash
19
+ pip install llama-index-llms-litellm llama-index-embeddings-litellm
20
+ ```
21
+
22
+ ### Standard OpenAI Usage
23
+
24
+ ```python
25
+ from lightrag import LightRAG
26
+ from lightrag.llm.llama_index_impl import llama_index_complete_if_cache, llama_index_embed
27
+ from llama_index.embeddings.openai import OpenAIEmbedding
28
+ from llama_index.llms.openai import OpenAI
29
+ from lightrag.utils import EmbeddingFunc
30
+
31
+ # Initialize with direct OpenAI access
32
+ async def llm_model_func(prompt, system_prompt=None, history_messages=[], **kwargs):
33
+ try:
34
+ # Initialize OpenAI if not in kwargs
35
+ if 'llm_instance' not in kwargs:
36
+ llm_instance = OpenAI(
37
+ model="gpt-4",
38
+ api_key="your-openai-key",
39
+ temperature=0.7,
40
+ )
41
+ kwargs['llm_instance'] = llm_instance
42
+
43
+ response = await llama_index_complete_if_cache(
44
+ kwargs['llm_instance'],
45
+ prompt,
46
+ system_prompt=system_prompt,
47
+ history_messages=history_messages,
48
+ **kwargs,
49
+ )
50
+ return response
51
+ except Exception as e:
52
+ logger.error(f"LLM request failed: {str(e)}")
53
+ raise
54
+
55
+ # Initialize LightRAG with OpenAI
56
+ rag = LightRAG(
57
+ working_dir="your/path",
58
+ llm_model_func=llm_model_func,
59
+ embedding_func=EmbeddingFunc(
60
+ embedding_dim=1536,
61
+ max_token_size=8192,
62
+ func=lambda texts: llama_index_embed(
63
+ texts,
64
+ embed_model=OpenAIEmbedding(
65
+ model="text-embedding-3-large",
66
+ api_key="your-openai-key"
67
+ )
68
+ ),
69
+ ),
70
+ )
71
+ ```
72
+
73
+ ### Using LiteLLM Proxy
74
+
75
+ 1. Use any LLM provider through LiteLLM
76
+ 2. Leverage LlamaIndex's embedding and completion capabilities
77
+ 3. Maintain consistent configuration across services
78
+
79
+ ```python
80
+ from lightrag import LightRAG
81
+ from lightrag.llm.llama_index_impl import llama_index_complete_if_cache, llama_index_embed
82
+ from llama_index.llms.litellm import LiteLLM
83
+ from llama_index.embeddings.litellm import LiteLLMEmbedding
84
+ from lightrag.utils import EmbeddingFunc
85
+
86
+ # Initialize with LiteLLM proxy
87
+ async def llm_model_func(prompt, system_prompt=None, history_messages=[], **kwargs):
88
+ try:
89
+ # Initialize LiteLLM if not in kwargs
90
+ if 'llm_instance' not in kwargs:
91
+ llm_instance = LiteLLM(
92
+ model=f"openai/{settings.LLM_MODEL}", # Format: "provider/model_name"
93
+ api_base=settings.LITELLM_URL,
94
+ api_key=settings.LITELLM_KEY,
95
+ temperature=0.7,
96
+ )
97
+ kwargs['llm_instance'] = llm_instance
98
+
99
+ response = await llama_index_complete_if_cache(
100
+ kwargs['llm_instance'],
101
+ prompt,
102
+ system_prompt=system_prompt,
103
+ history_messages=history_messages,
104
+ **kwargs,
105
+ )
106
+ return response
107
+ except Exception as e:
108
+ logger.error(f"LLM request failed: {str(e)}")
109
+ raise
110
+
111
+ # Initialize LightRAG with LiteLLM
112
+ rag = LightRAG(
113
+ working_dir="your/path",
114
+ llm_model_func=llm_model_func,
115
+ embedding_func=EmbeddingFunc(
116
+ embedding_dim=1536,
117
+ max_token_size=8192,
118
+ func=lambda texts: llama_index_embed(
119
+ texts,
120
+ embed_model=LiteLLMEmbedding(
121
+ model_name=f"openai/{settings.EMBEDDING_MODEL}",
122
+ api_base=settings.LITELLM_URL,
123
+ api_key=settings.LITELLM_KEY,
124
+ )
125
+ ),
126
+ ),
127
+ )
128
+ ```
129
+
130
+ ### Environment Variables
131
+
132
+ For OpenAI direct usage:
133
+ ```bash
134
+ OPENAI_API_KEY=your-openai-key
135
+ ```
136
+
137
+ For LiteLLM proxy:
138
+ ```bash
139
+ # LiteLLM Configuration
140
+ LITELLM_URL=http://litellm:4000
141
+ LITELLM_KEY=your-litellm-key
142
+
143
+ # Model Configuration
144
+ LLM_MODEL=gpt-4
145
+ EMBEDDING_MODEL=text-embedding-3-large
146
+ EMBEDDING_MAX_TOKEN_SIZE=8192
147
+ ```
148
+
149
+ ### Key Differences
150
+ 1. **Direct OpenAI**:
151
+ - Simpler setup
152
+ - Direct API access
153
+ - Requires OpenAI API key
154
+
155
+ 2. **LiteLLM Proxy**:
156
+ - Model provider agnostic
157
+ - Centralized API key management
158
+ - Support for multiple providers
159
+ - Better cost control and monitoring
160
+
161
+ </details>
lightrag/llm/llama_index_impl.py ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pipmaster as pm
2
+ from llama_index.core.llms import (
3
+ ChatMessage,
4
+ MessageRole,
5
+ ChatResponse,
6
+ )
7
+ from typing import List, Optional
8
+ from lightrag.utils import logger
9
+
10
+ # Install required dependencies
11
+ if not pm.is_installed("llama-index"):
12
+ pm.install("llama-index")
13
+
14
+ from llama_index.core.embeddings import BaseEmbedding
15
+ from llama_index.core.settings import Settings as LlamaIndexSettings
16
+ from tenacity import (
17
+ retry,
18
+ stop_after_attempt,
19
+ wait_exponential,
20
+ retry_if_exception_type,
21
+ )
22
+ from lightrag.utils import (
23
+ wrap_embedding_func_with_attrs,
24
+ locate_json_string_body_from_string,
25
+ )
26
+ from lightrag.exceptions import (
27
+ APIConnectionError,
28
+ RateLimitError,
29
+ APITimeoutError,
30
+ )
31
+ import numpy as np
32
+
33
+
34
+ def configure_llama_index(settings: LlamaIndexSettings = None, **kwargs):
35
+ """
36
+ Configure LlamaIndex settings.
37
+
38
+ Args:
39
+ settings: LlamaIndex Settings instance. If None, uses default settings.
40
+ **kwargs: Additional settings to override/configure
41
+ """
42
+ if settings is None:
43
+ settings = LlamaIndexSettings()
44
+
45
+ # Update settings with any provided kwargs
46
+ for key, value in kwargs.items():
47
+ if hasattr(settings, key):
48
+ setattr(settings, key, value)
49
+ else:
50
+ logger.warning(f"Unknown LlamaIndex setting: {key}")
51
+
52
+ # Set as global settings
53
+ LlamaIndexSettings.set_global(settings)
54
+ return settings
55
+
56
+
57
+ def format_chat_messages(messages):
58
+ """Format chat messages into LlamaIndex format."""
59
+ formatted_messages = []
60
+
61
+ for msg in messages:
62
+ role = msg.get("role", "user")
63
+ content = msg.get("content", "")
64
+
65
+ if role == "system":
66
+ formatted_messages.append(
67
+ ChatMessage(role=MessageRole.SYSTEM, content=content)
68
+ )
69
+ elif role == "assistant":
70
+ formatted_messages.append(
71
+ ChatMessage(role=MessageRole.ASSISTANT, content=content)
72
+ )
73
+ elif role == "user":
74
+ formatted_messages.append(
75
+ ChatMessage(role=MessageRole.USER, content=content)
76
+ )
77
+ else:
78
+ logger.warning(f"Unknown role {role}, treating as user message")
79
+ formatted_messages.append(
80
+ ChatMessage(role=MessageRole.USER, content=content)
81
+ )
82
+
83
+ return formatted_messages
84
+
85
+
86
+ @retry(
87
+ stop=stop_after_attempt(3),
88
+ wait=wait_exponential(multiplier=1, min=4, max=60),
89
+ retry=retry_if_exception_type(
90
+ (RateLimitError, APIConnectionError, APITimeoutError)
91
+ ),
92
+ )
93
+ async def llama_index_complete_if_cache(
94
+ model: str,
95
+ prompt: str,
96
+ system_prompt: Optional[str] = None,
97
+ history_messages: List[dict] = [],
98
+ **kwargs,
99
+ ) -> str:
100
+ """Complete the prompt using LlamaIndex."""
101
+ try:
102
+ # Format messages for chat
103
+ formatted_messages = []
104
+
105
+ # Add system message if provided
106
+ if system_prompt:
107
+ formatted_messages.append(
108
+ ChatMessage(role=MessageRole.SYSTEM, content=system_prompt)
109
+ )
110
+
111
+ # Add history messages
112
+ for msg in history_messages:
113
+ formatted_messages.append(
114
+ ChatMessage(
115
+ role=MessageRole.USER
116
+ if msg["role"] == "user"
117
+ else MessageRole.ASSISTANT,
118
+ content=msg["content"],
119
+ )
120
+ )
121
+
122
+ # Add current prompt
123
+ formatted_messages.append(ChatMessage(role=MessageRole.USER, content=prompt))
124
+
125
+ # Get LLM instance from kwargs
126
+ if "llm_instance" not in kwargs:
127
+ raise ValueError("llm_instance must be provided in kwargs")
128
+ llm = kwargs["llm_instance"]
129
+
130
+ # Get response
131
+ response: ChatResponse = await llm.achat(messages=formatted_messages)
132
+
133
+ # In newer versions, the response is in message.content
134
+ content = response.message.content
135
+ return content
136
+
137
+ except Exception as e:
138
+ logger.error(f"Error in llama_index_complete_if_cache: {str(e)}")
139
+ raise
140
+
141
+
142
+ async def llama_index_complete(
143
+ prompt,
144
+ system_prompt=None,
145
+ history_messages=None,
146
+ keyword_extraction=False,
147
+ settings: LlamaIndexSettings = None,
148
+ **kwargs,
149
+ ) -> str:
150
+ """
151
+ Main completion function for LlamaIndex
152
+
153
+ Args:
154
+ prompt: Input prompt
155
+ system_prompt: Optional system prompt
156
+ history_messages: Optional chat history
157
+ keyword_extraction: Whether to extract keywords from response
158
+ settings: Optional LlamaIndex settings
159
+ **kwargs: Additional arguments
160
+ """
161
+ if history_messages is None:
162
+ history_messages = []
163
+
164
+ keyword_extraction = kwargs.pop("keyword_extraction", None)
165
+ result = await llama_index_complete_if_cache(
166
+ kwargs.get("llm_instance"),
167
+ prompt,
168
+ system_prompt=system_prompt,
169
+ history_messages=history_messages,
170
+ **kwargs,
171
+ )
172
+ if keyword_extraction:
173
+ return locate_json_string_body_from_string(result)
174
+ return result
175
+
176
+
177
+ @wrap_embedding_func_with_attrs(embedding_dim=1536, max_token_size=8192)
178
+ @retry(
179
+ stop=stop_after_attempt(3),
180
+ wait=wait_exponential(multiplier=1, min=4, max=60),
181
+ retry=retry_if_exception_type(
182
+ (RateLimitError, APIConnectionError, APITimeoutError)
183
+ ),
184
+ )
185
+ async def llama_index_embed(
186
+ texts: list[str],
187
+ embed_model: BaseEmbedding = None,
188
+ settings: LlamaIndexSettings = None,
189
+ **kwargs,
190
+ ) -> np.ndarray:
191
+ """
192
+ Generate embeddings using LlamaIndex
193
+
194
+ Args:
195
+ texts: List of texts to embed
196
+ embed_model: LlamaIndex embedding model
197
+ settings: Optional LlamaIndex settings
198
+ **kwargs: Additional arguments
199
+ """
200
+ if settings:
201
+ configure_llama_index(settings)
202
+
203
+ if embed_model is None:
204
+ raise ValueError("embed_model must be provided")
205
+
206
+ # Use _get_text_embeddings for batch processing
207
+ embeddings = embed_model._get_text_embeddings(texts)
208
+ return np.array(embeddings)
lightrag/operate.py CHANGED
@@ -329,7 +329,7 @@ async def extract_entities(
329
  relationships_vdb: BaseVectorStorage,
330
  global_config: dict[str, str],
331
  llm_response_cache: BaseKVStorage | None = None,
332
- ) -> BaseGraphStorage | None:
333
  use_llm_func: callable = global_config["llm_model_func"]
334
  entity_extract_max_gleaning = global_config["entity_extract_max_gleaning"]
335
  enable_llm_cache_for_entity_extract: bool = global_config[
@@ -491,11 +491,9 @@ async def extract_entities(
491
  already_processed += 1
492
  already_entities += len(maybe_nodes)
493
  already_relations += len(maybe_edges)
494
- now_ticks = PROMPTS["process_tickers"][
495
- already_processed % len(PROMPTS["process_tickers"])
496
- ]
497
  logger.debug(
498
- f"{now_ticks} Processed {already_processed} chunks, {already_entities} entities(duplicated), {already_relations} relations(duplicated)\r",
499
  )
500
  return dict(maybe_nodes), dict(maybe_edges)
501
 
@@ -524,16 +522,18 @@ async def extract_entities(
524
  ]
525
  )
526
 
527
- if not len(all_entities_data) and not len(all_relationships_data):
528
- logger.warning(
529
- "Didn't extract any entities and relationships, maybe your LLM is not working"
530
- )
531
- return None
532
 
533
- if not len(all_entities_data):
534
- logger.warning("Didn't extract any entities")
535
- if not len(all_relationships_data):
536
- logger.warning("Didn't extract any relationships")
 
 
 
 
537
 
538
  if entity_vdb is not None:
539
  data_for_vdb = {
@@ -562,8 +562,6 @@ async def extract_entities(
562
  }
563
  await relationships_vdb.upsert(data_for_vdb)
564
 
565
- return knowledge_graph_inst
566
-
567
 
568
  async def kg_query(
569
  query: str,
@@ -1328,15 +1326,12 @@ async def _get_edge_data(
1328
  ),
1329
  )
1330
 
1331
- if not all([n is not None for n in edge_datas]):
1332
- logger.warning("Some edges are missing, maybe the storage is damaged")
1333
-
1334
  edge_datas = [
1335
  {
1336
  "src_id": k["src_id"],
1337
  "tgt_id": k["tgt_id"],
1338
  "rank": d,
1339
- "created_at": k.get("__created_at__", None), # 从 KV 存储中获取时间元数据
1340
  **v,
1341
  }
1342
  for k, v, d in zip(results, edge_datas, edge_degree)
@@ -1345,16 +1340,11 @@ async def _get_edge_data(
1345
  edge_datas = sorted(
1346
  edge_datas, key=lambda x: (x["rank"], x["weight"]), reverse=True
1347
  )
1348
- len_edge_datas = len(edge_datas)
1349
  edge_datas = truncate_list_by_token_size(
1350
  edge_datas,
1351
  key=lambda x: x["description"],
1352
  max_token_size=query_param.max_token_for_global_context,
1353
  )
1354
- logger.debug(
1355
- f"Truncate relations from {len_edge_datas} to {len(edge_datas)} (max tokens:{query_param.max_token_for_global_context})"
1356
- )
1357
-
1358
  use_entities, use_text_units = await asyncio.gather(
1359
  _find_most_related_entities_from_relationships(
1360
  edge_datas, query_param, knowledge_graph_inst
 
329
  relationships_vdb: BaseVectorStorage,
330
  global_config: dict[str, str],
331
  llm_response_cache: BaseKVStorage | None = None,
332
+ ) -> None:
333
  use_llm_func: callable = global_config["llm_model_func"]
334
  entity_extract_max_gleaning = global_config["entity_extract_max_gleaning"]
335
  enable_llm_cache_for_entity_extract: bool = global_config[
 
491
  already_processed += 1
492
  already_entities += len(maybe_nodes)
493
  already_relations += len(maybe_edges)
494
+
 
 
495
  logger.debug(
496
+ f"Processed {already_processed} chunks, {already_entities} entities(duplicated), {already_relations} relations(duplicated)\r",
497
  )
498
  return dict(maybe_nodes), dict(maybe_edges)
499
 
 
522
  ]
523
  )
524
 
525
+ if not (all_entities_data or all_relationships_data):
526
+ logger.info("Didn't extract any entities and relationships.")
527
+ return
 
 
528
 
529
+ if not all_entities_data:
530
+ logger.info("Didn't extract any entities")
531
+ if not all_relationships_data:
532
+ logger.info("Didn't extract any relationships")
533
+
534
+ logger.info(
535
+ f"New entities or relationships extracted, entities:{all_entities_data}, relationships:{all_relationships_data}"
536
+ )
537
 
538
  if entity_vdb is not None:
539
  data_for_vdb = {
 
562
  }
563
  await relationships_vdb.upsert(data_for_vdb)
564
 
 
 
565
 
566
  async def kg_query(
567
  query: str,
 
1326
  ),
1327
  )
1328
 
 
 
 
1329
  edge_datas = [
1330
  {
1331
  "src_id": k["src_id"],
1332
  "tgt_id": k["tgt_id"],
1333
  "rank": d,
1334
+ "created_at": k.get("__created_at__", None),
1335
  **v,
1336
  }
1337
  for k, v, d in zip(results, edge_datas, edge_degree)
 
1340
  edge_datas = sorted(
1341
  edge_datas, key=lambda x: (x["rank"], x["weight"]), reverse=True
1342
  )
 
1343
  edge_datas = truncate_list_by_token_size(
1344
  edge_datas,
1345
  key=lambda x: x["description"],
1346
  max_token_size=query_param.max_token_for_global_context,
1347
  )
 
 
 
 
1348
  use_entities, use_text_units = await asyncio.gather(
1349
  _find_most_related_entities_from_relationships(
1350
  edge_datas, query_param, knowledge_graph_inst
lightrag/prompt.py CHANGED
@@ -9,15 +9,14 @@ PROMPTS["DEFAULT_LANGUAGE"] = "English"
9
  PROMPTS["DEFAULT_TUPLE_DELIMITER"] = "<|>"
10
  PROMPTS["DEFAULT_RECORD_DELIMITER"] = "##"
11
  PROMPTS["DEFAULT_COMPLETION_DELIMITER"] = "<|COMPLETE|>"
12
- PROMPTS["process_tickers"] = ["⠋", "⠙", "⠹", "⠸", "⠼", "⠴", "⠦", "⠧", "⠇", "⠏"]
13
 
14
  PROMPTS["DEFAULT_ENTITY_TYPES"] = ["organization", "person", "geo", "event", "category"]
15
 
16
- PROMPTS["entity_extraction"] = """-Goal-
17
  Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.
18
  Use {language} as output language.
19
 
20
- -Steps-
21
  1. Identify all entities. For each identified entity, extract the following information:
22
  - entity_name: Name of the entity, use same language as input text. If English, capitalized the name.
23
  - entity_type: One of the following types: [{entity_types}]
@@ -41,18 +40,17 @@ Format the content-level key words as ("content_keywords"{tuple_delimiter}<high_
41
  5. When finished, output {completion_delimiter}
42
 
43
  ######################
44
- -Examples-
45
  ######################
46
  {examples}
47
 
48
  #############################
49
- -Real Data-
50
  ######################
51
  Entity_types: {entity_types}
52
  Text: {input_text}
53
  ######################
54
- Output:
55
- """
56
 
57
  PROMPTS["entity_extraction_examples"] = [
58
  """Example 1:
@@ -137,7 +135,7 @@ Make sure it is written in third person, and include the entity names so we the
137
  Use {language} as output language.
138
 
139
  #######
140
- -Data-
141
  Entities: {entity_name}
142
  Description List: {description_list}
143
  #######
@@ -205,12 +203,12 @@ Given the query and conversation history, list both high-level and low-level key
205
  - "low_level_keywords" for specific entities or details
206
 
207
  ######################
208
- -Examples-
209
  ######################
210
  {examples}
211
 
212
  #############################
213
- -Real Data-
214
  ######################
215
  Conversation History:
216
  {history}
 
9
  PROMPTS["DEFAULT_TUPLE_DELIMITER"] = "<|>"
10
  PROMPTS["DEFAULT_RECORD_DELIMITER"] = "##"
11
  PROMPTS["DEFAULT_COMPLETION_DELIMITER"] = "<|COMPLETE|>"
 
12
 
13
  PROMPTS["DEFAULT_ENTITY_TYPES"] = ["organization", "person", "geo", "event", "category"]
14
 
15
+ PROMPTS["entity_extraction"] = """---Goal---
16
  Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.
17
  Use {language} as output language.
18
 
19
+ ---Steps---
20
  1. Identify all entities. For each identified entity, extract the following information:
21
  - entity_name: Name of the entity, use same language as input text. If English, capitalized the name.
22
  - entity_type: One of the following types: [{entity_types}]
 
40
  5. When finished, output {completion_delimiter}
41
 
42
  ######################
43
+ ---Examples---
44
  ######################
45
  {examples}
46
 
47
  #############################
48
+ ---Real Data---
49
  ######################
50
  Entity_types: {entity_types}
51
  Text: {input_text}
52
  ######################
53
+ Output:"""
 
54
 
55
  PROMPTS["entity_extraction_examples"] = [
56
  """Example 1:
 
135
  Use {language} as output language.
136
 
137
  #######
138
+ ---Data---
139
  Entities: {entity_name}
140
  Description List: {description_list}
141
  #######
 
203
  - "low_level_keywords" for specific entities or details
204
 
205
  ######################
206
+ ---Examples---
207
  ######################
208
  {examples}
209
 
210
  #############################
211
+ ---Real Data---
212
  ######################
213
  Conversation History:
214
  {history}
lightrag/utils.py CHANGED
@@ -713,3 +713,47 @@ def get_conversation_turns(
713
  )
714
 
715
  return "\n".join(formatted_turns)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
713
  )
714
 
715
  return "\n".join(formatted_turns)
716
+
717
+
718
+ def always_get_an_event_loop() -> asyncio.AbstractEventLoop:
719
+ """
720
+ Ensure that there is always an event loop available.
721
+
722
+ This function tries to get the current event loop. If the current event loop is closed or does not exist,
723
+ it creates a new event loop and sets it as the current event loop.
724
+
725
+ Returns:
726
+ asyncio.AbstractEventLoop: The current or newly created event loop.
727
+ """
728
+ try:
729
+ # Try to get the current event loop
730
+ current_loop = asyncio.get_event_loop()
731
+ if current_loop.is_closed():
732
+ raise RuntimeError("Event loop is closed.")
733
+ return current_loop
734
+
735
+ except RuntimeError:
736
+ # If no event loop exists or it is closed, create a new one
737
+ logger.info("Creating a new event loop in main thread.")
738
+ new_loop = asyncio.new_event_loop()
739
+ asyncio.set_event_loop(new_loop)
740
+ return new_loop
741
+
742
+
743
+ def lazy_external_import(module_name: str, class_name: str) -> Callable[..., Any]:
744
+ """Lazily import a class from an external module based on the package of the caller."""
745
+ # Get the caller's module and package
746
+ import inspect
747
+
748
+ caller_frame = inspect.currentframe().f_back
749
+ module = inspect.getmodule(caller_frame)
750
+ package = module.__package__ if module else None
751
+
752
+ def import_class(*args: Any, **kwargs: Any):
753
+ import importlib
754
+
755
+ module = importlib.import_module(module_name, package=package)
756
+ cls = getattr(module, class_name)
757
+ return cls(*args, **kwargs)
758
+
759
+ return import_class
reproduce/Step_3.py CHANGED
@@ -1,7 +1,7 @@
1
  import re
2
  import json
3
- import asyncio
4
  from lightrag import LightRAG, QueryParam
 
5
 
6
 
7
  def extract_queries(file_path):
@@ -23,15 +23,6 @@ async def process_query(query_text, rag_instance, query_param):
23
  return None, {"query": query_text, "error": str(e)}
24
 
25
 
26
- def always_get_an_event_loop() -> asyncio.AbstractEventLoop:
27
- try:
28
- loop = asyncio.get_event_loop()
29
- except RuntimeError:
30
- loop = asyncio.new_event_loop()
31
- asyncio.set_event_loop(loop)
32
- return loop
33
-
34
-
35
  def run_queries_and_save_to_json(
36
  queries, rag_instance, query_param, output_file, error_file
37
  ):
 
1
  import re
2
  import json
 
3
  from lightrag import LightRAG, QueryParam
4
+ from lightrag.utils import always_get_an_event_loop
5
 
6
 
7
  def extract_queries(file_path):
 
23
  return None, {"query": query_text, "error": str(e)}
24
 
25
 
 
 
 
 
 
 
 
 
 
26
  def run_queries_and_save_to_json(
27
  queries, rag_instance, query_param, output_file, error_file
28
  ):
reproduce/Step_3_openai_compatible.py CHANGED
@@ -1,10 +1,9 @@
1
  import os
2
  import re
3
  import json
4
- import asyncio
5
  from lightrag import LightRAG, QueryParam
6
  from lightrag.llm.openai import openai_complete_if_cache, openai_embed
7
- from lightrag.utils import EmbeddingFunc
8
  import numpy as np
9
 
10
 
@@ -55,15 +54,6 @@ async def process_query(query_text, rag_instance, query_param):
55
  return None, {"query": query_text, "error": str(e)}
56
 
57
 
58
- def always_get_an_event_loop() -> asyncio.AbstractEventLoop:
59
- try:
60
- loop = asyncio.get_event_loop()
61
- except RuntimeError:
62
- loop = asyncio.new_event_loop()
63
- asyncio.set_event_loop(loop)
64
- return loop
65
-
66
-
67
  def run_queries_and_save_to_json(
68
  queries, rag_instance, query_param, output_file, error_file
69
  ):
 
1
  import os
2
  import re
3
  import json
 
4
  from lightrag import LightRAG, QueryParam
5
  from lightrag.llm.openai import openai_complete_if_cache, openai_embed
6
+ from lightrag.utils import EmbeddingFunc, always_get_an_event_loop
7
  import numpy as np
8
 
9
 
 
54
  return None, {"query": query_text, "error": str(e)}
55
 
56
 
 
 
 
 
 
 
 
 
 
57
  def run_queries_and_save_to_json(
58
  queries, rag_instance, query_param, output_file, error_file
59
  ):