Spaces:

JavaPD
/

sem-diff

Running

App Files Files Community

sem-diff / README.md

JavaPD

Initial deploy

941e8f6 1 day ago

preview code

Raw

History Blame Contribute Delete

10.6 kB

metadata

title: Semantic Diffing for Evolving Knowledge Graphs
emoji: 🔀
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false

Semantic Diffing for Evolving Knowledge Graphs

A system for tracking structural changes in knowledge graphs as documents evolve over time. This project extracts entities and relationships from multiple document versions, constructs graph representations, and identifies semantic differences such as added or removed entities and relationships.

The system enables comparison between document snapshots and generates both structured graph diffs and natural-language summaries of detected changes.

Overview

Knowledge graphs evolve as new information becomes available. Tracking changes between versions is critical in domains such as enterprise knowledge management, legal systems, compliance workflows, and technical documentation.

This project implements:

Entity and relationship extraction from document versions
Knowledge graph construction using NetworkX
Graph-level semantic diffing
Identification of added and removed nodes and edges
Natural-language summarization of detected changes

Key Features

Extract entities and relationships from document text
Build graph representations for multiple document versions
Compare knowledge graph snapshots
Detect added entities
Detect removed entities
Detect added relationships
Detect removed relationships
Generate structured graph diffs
Produce natural-language summaries of changes
Visualize knowledge graph snapshots

Frontend

The project ships with a full interactive, animated frontend at frontend/index.html, served directly by app.py. It includes:

A live-diff demo that runs instantly against bundled sample data — no API key needed to try it
An optional advanced panel for entering a Groq API key to run the diff live against /api/diff
Side-by-side force-directed graph views (D3) of the two knowledge graph versions, color-coded to match graph_utils.py's own diff palette
A terminal-style animated diff console rendering added/removed/unchanged entities and relations
A walkthrough of the 5-stage pipeline, architecture breakdown, use cases, and roadmap

To use it, just run python app.py and open http://localhost:5050.

How It Works

Upload two document versions:
- Baseline document (v1)
- Updated document (v2)
Each document is processed independently:
- Text is parsed
- Entities are extracted
- Relationships are extracted
A knowledge graph is created for each version.
Graph diffing identifies:
- New entities
- Removed entities
- New relationships
- Removed relationships
A natural-language summary describes the detected changes.

System Architecture

                 ┌─────────────────────┐
                 │   Document v1       │
                 │   (Baseline)        │
                 └─────────┬───────────┘
                           │
                           ▼
                 ┌─────────────────────┐
                 │ Entity & Relation   │
                 │ Extraction (LLM)    │
                 └─────────┬───────────┘
                           │
                           ▼
                 ┌─────────────────────┐
                 │ Knowledge Graph v1  │
                 └─────────────────────┘


                 ┌─────────────────────┐
                 │   Document v2       │
                 │   (Updated)         │
                 └─────────┬───────────┘
                           │
                           ▼
                 ┌─────────────────────┐
                 │ Entity & Relation   │
                 │ Extraction (LLM)    │
                 └─────────┬───────────┘
                           │
                           ▼
                 ┌─────────────────────┐
                 │ Knowledge Graph v2  │
                 └─────────┬───────────┘
                           │
                           ▼
                 ┌─────────────────────┐
                 │ Graph Diff Engine   │
                 │ - Added Nodes       │
                 │ - Removed Nodes     │
                 │ - Added Edges       │
                 │ - Removed Edges     │
                 └─────────┬───────────┘
                           │
                           ▼
                 ┌─────────────────────┐
                 │ Change Summary      │
                 │ Natural Language    │
                 └─────────────────────┘

Installation

Clone the repository:

git clone https://github.com/your-username/semantic_diffing.git
cd semantic_diffing

Install dependencies:

pip install -r requirements.txt

Set the Groq API key:

Linux / macOS:

export GROQ_API_KEY=your_key_here

Windows:

set GROQ_API_KEY=your_key_here

Run the application:

python app.py

Then open http://localhost:5050 in your browser. This serves a full interactive frontend — including an animated live-diff demo, force-directed graph views, and a sample dataset that works out of the box even without an API key.

Input Format

Supported formats:

.txt documents

Two versions are required:

Baseline document (v1)
Updated document (v2)

Sample Data

Sample documents are included in the data/ directory:

doc_v1.txt Baseline version of a fictional company description.
doc_v2.txt Updated version containing new entities and relationships.

These files allow quick testing of semantic diffing functionality.

Project Structure

semantic_diffing/
│
├── app.py
│   Flask entry point — serves the frontend and the /api/diff endpoint
│
├── semantic_diff.py
│   Entity and relationship extraction
│   Graph diff computation
│
├── graph_utils.py
│   NetworkX graph construction
│   Graph visualization
│
├── frontend/
│   ├── index.html
│   │   Full interactive single-page frontend
│   └── static/
│       ├── css/style.css
│       ├── js/app.js
│       │   Animation, demo orchestration, D3 graph rendering
│       └── js/demo-data.js
│           Bundled offline fixture so the demo works without an API key
│
├── data/
│   ├── doc_v1.txt
│   └── doc_v2.txt
│
├── requirements.txt
│   Python dependencies
│
└── README.md

Core Modules

semantic_diff.py

Responsible for:

LLM-based entity extraction
Relationship extraction
Graph comparison logic
Detection of semantic differences
Generation of change summaries

Key operations:

Extract entities
Extract relationships
Compute node differences
Compute edge differences
Generate structured diff output

graph_utils.py

Responsible for:

Building knowledge graphs using NetworkX
Representing entities as nodes
Representing relationships as edges
Visualizing graph snapshots
Highlighting added and removed elements

app.py

Acts as the main execution script.

Responsible for:

Loading document versions
Triggering extraction pipeline
Building graphs
Running diff computation
Displaying outputs

Example Output (Graph Diff JSON)

{
  "added_entities": [
    "AI Research Division",
    "Cloud Infrastructure Team"
  ],
  "removed_entities": [
    "Legacy Systems Department"
  ],
  "added_relationships": [
    {
      "source": "ABC Corporation",
      "relation": "launched",
      "target": "AI Research Division"
    }
  ],
  "removed_relationships": [
    {
      "source": "ABC Corporation",
      "relation": "maintains",
      "target": "Legacy Systems Department"
    }
  ]
}

Example LLM Extraction Prompt

You are an information extraction system.

Extract structured entities and relationships from the text.

Return output in JSON format using:

{
  "entities": [],
  "relationships": []
}

Rules:

1. Entities should represent meaningful objects such as:
   - Organizations
   - Departments
   - Products
   - Teams
   - Locations

2. Relationships should represent interactions between entities.

Text:

{DOCUMENT_TEXT}

Example Diff Summary

Changes detected between document versions:

- Two new entities were introduced: AI Research Division and Cloud Infrastructure Team.
- One entity was removed: Legacy Systems Department.
- A new relationship was added linking ABC Corporation to AI Research Division.
- A maintenance relationship with Legacy Systems Department was removed.

Technologies Used

Python
NetworkX
Matplotlib
Large Language Models (LLMs)
Groq API
Natural Language Processing (NLP)
Graph Theory

Design Considerations

Separate graphs are built per document version.
Diffing operates at both node and edge levels.
Structured outputs enable downstream analytics.
Modular design allows extension to multi-version comparison.

Limitations

Extraction accuracy depends on LLM output quality.
Large graphs may increase visualization complexity.
Relationship normalization may require domain tuning.
Currently supports two-version comparison only.

Future Improvements

Multi-version timeline diffing
Graph history tracking
Knowledge graph persistence
Interactive graph exploration
Graph database integration (Neo4j)
Graph embedding similarity metrics
Change severity scoring
Support for additional document formats

Use Cases

This system can be applied to:

Enterprise knowledge tracking
Policy change monitoring
Technical documentation updates
Compliance auditing
Legal contract version comparison
Organizational change tracking
Knowledge management systems