File size: 10,623 Bytes
6cb9473
941e8f6
 
 
 
6cb9473
941e8f6
6cb9473
 
 
941e8f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
---
title: Semantic Diffing for Evolving Knowledge Graphs
emoji: πŸ”€
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
---

# Semantic Diffing for Evolving Knowledge Graphs

A system for tracking structural changes in knowledge graphs as documents evolve over time. This project extracts entities and relationships from multiple document versions, constructs graph representations, and identifies semantic differences such as added or removed entities and relationships.

The system enables comparison between document snapshots and generates both structured graph diffs and natural-language summaries of detected changes.

---

# Overview

Knowledge graphs evolve as new information becomes available. Tracking changes between versions is critical in domains such as enterprise knowledge management, legal systems, compliance workflows, and technical documentation.

This project implements:

* Entity and relationship extraction from document versions
* Knowledge graph construction using NetworkX
* Graph-level semantic diffing
* Identification of added and removed nodes and edges
* Natural-language summarization of detected changes

---

# Key Features

* Extract entities and relationships from document text
* Build graph representations for multiple document versions
* Compare knowledge graph snapshots
* Detect added entities
* Detect removed entities
* Detect added relationships
* Detect removed relationships
* Generate structured graph diffs
* Produce natural-language summaries of changes
* Visualize knowledge graph snapshots

---

# Frontend

The project ships with a full interactive, animated frontend at `frontend/index.html`, served directly by `app.py`. It includes:

* A live-diff demo that runs instantly against bundled sample data β€” no API key needed to try it
* An optional advanced panel for entering a Groq API key to run the diff live against `/api/diff`
* Side-by-side force-directed graph views (D3) of the two knowledge graph versions, color-coded to match `graph_utils.py`'s own diff palette
* A terminal-style animated diff console rendering added/removed/unchanged entities and relations
* A walkthrough of the 5-stage pipeline, architecture breakdown, use cases, and roadmap

To use it, just run `python app.py` and open `http://localhost:5050`.

---

# How It Works

1. Upload two document versions:

   * Baseline document (v1)
   * Updated document (v2)

2. Each document is processed independently:

   * Text is parsed
   * Entities are extracted
   * Relationships are extracted

3. A knowledge graph is created for each version.

4. Graph diffing identifies:

   * New entities
   * Removed entities
   * New relationships
   * Removed relationships

5. A natural-language summary describes the detected changes.

---

# System Architecture

```text
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚   Document v1       β”‚
                 β”‚   (Baseline)        β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚ Entity & Relation   β”‚
                 β”‚ Extraction (LLM)    β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚ Knowledge Graph v1  β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜


                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚   Document v2       β”‚
                 β”‚   (Updated)         β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚ Entity & Relation   β”‚
                 β”‚ Extraction (LLM)    β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚ Knowledge Graph v2  β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚ Graph Diff Engine   β”‚
                 β”‚ - Added Nodes       β”‚
                 β”‚ - Removed Nodes     β”‚
                 β”‚ - Added Edges       β”‚
                 β”‚ - Removed Edges     β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚ Change Summary      β”‚
                 β”‚ Natural Language    β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

# Installation

Clone the repository:

```bash
git clone https://github.com/your-username/semantic_diffing.git
cd semantic_diffing
```

Install dependencies:

```bash
pip install -r requirements.txt
```

Set the Groq API key:

Linux / macOS:

```bash
export GROQ_API_KEY=your_key_here
```

Windows:

```powershell
set GROQ_API_KEY=your_key_here
```

Run the application:

```bash
python app.py
```

Then open `http://localhost:5050` in your browser. This serves a full interactive frontend β€” including an animated live-diff demo, force-directed graph views, and a sample dataset that works out of the box even without an API key.

---

# Input Format

Supported formats:

* `.txt` documents

Two versions are required:

* Baseline document (v1)
* Updated document (v2)

---

# Sample Data

Sample documents are included in the `data/` directory:

* `doc_v1.txt`
  Baseline version of a fictional company description.

* `doc_v2.txt`
  Updated version containing new entities and relationships.

These files allow quick testing of semantic diffing functionality.

---

# Project Structure

```text
semantic_diffing/
β”‚
β”œβ”€β”€ app.py
β”‚   Flask entry point β€” serves the frontend and the /api/diff endpoint
β”‚
β”œβ”€β”€ semantic_diff.py
β”‚   Entity and relationship extraction
β”‚   Graph diff computation
β”‚
β”œβ”€β”€ graph_utils.py
β”‚   NetworkX graph construction
β”‚   Graph visualization
β”‚
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ index.html
β”‚   β”‚   Full interactive single-page frontend
β”‚   └── static/
β”‚       β”œβ”€β”€ css/style.css
β”‚       β”œβ”€β”€ js/app.js
β”‚       β”‚   Animation, demo orchestration, D3 graph rendering
β”‚       └── js/demo-data.js
β”‚           Bundled offline fixture so the demo works without an API key
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ doc_v1.txt
β”‚   └── doc_v2.txt
β”‚
β”œβ”€β”€ requirements.txt
β”‚   Python dependencies
β”‚
└── README.md
```

---

# Core Modules

## semantic_diff.py

Responsible for:

* LLM-based entity extraction
* Relationship extraction
* Graph comparison logic
* Detection of semantic differences
* Generation of change summaries

Key operations:

* Extract entities
* Extract relationships
* Compute node differences
* Compute edge differences
* Generate structured diff output

---

## graph_utils.py

Responsible for:

* Building knowledge graphs using NetworkX
* Representing entities as nodes
* Representing relationships as edges
* Visualizing graph snapshots
* Highlighting added and removed elements

---

## app.py

Acts as the main execution script.

Responsible for:

* Loading document versions
* Triggering extraction pipeline
* Building graphs
* Running diff computation
* Displaying outputs

---

# Example Output (Graph Diff JSON)

```json
{
  "added_entities": [
    "AI Research Division",
    "Cloud Infrastructure Team"
  ],
  "removed_entities": [
    "Legacy Systems Department"
  ],
  "added_relationships": [
    {
      "source": "ABC Corporation",
      "relation": "launched",
      "target": "AI Research Division"
    }
  ],
  "removed_relationships": [
    {
      "source": "ABC Corporation",
      "relation": "maintains",
      "target": "Legacy Systems Department"
    }
  ]
}
```

---

# Example LLM Extraction Prompt

```text
You are an information extraction system.

Extract structured entities and relationships from the text.

Return output in JSON format using:

{
  "entities": [],
  "relationships": []
}

Rules:

1. Entities should represent meaningful objects such as:
   - Organizations
   - Departments
   - Products
   - Teams
   - Locations

2. Relationships should represent interactions between entities.

Text:

{DOCUMENT_TEXT}
```

---

# Example Diff Summary

```text
Changes detected between document versions:

- Two new entities were introduced: AI Research Division and Cloud Infrastructure Team.
- One entity was removed: Legacy Systems Department.
- A new relationship was added linking ABC Corporation to AI Research Division.
- A maintenance relationship with Legacy Systems Department was removed.
```

---

# Technologies Used

* Python
* NetworkX
* Matplotlib
* Large Language Models (LLMs)
* Groq API
* Natural Language Processing (NLP)
* Graph Theory

---

# Design Considerations

* Separate graphs are built per document version.
* Diffing operates at both node and edge levels.
* Structured outputs enable downstream analytics.
* Modular design allows extension to multi-version comparison.

---

# Limitations

* Extraction accuracy depends on LLM output quality.
* Large graphs may increase visualization complexity.
* Relationship normalization may require domain tuning.
* Currently supports two-version comparison only.

---

# Future Improvements

* Multi-version timeline diffing
* Graph history tracking
* Knowledge graph persistence
* Interactive graph exploration
* Graph database integration (Neo4j)
* Graph embedding similarity metrics
* Change severity scoring
* Support for additional document formats

---

# Use Cases

This system can be applied to:

* Enterprise knowledge tracking
* Policy change monitoring
* Technical documentation updates
* Compliance auditing
* Legal contract version comparison
* Organizational change tracking
* Knowledge management systems