KCH / docs /projects /FileOrganizer.md
bsamadi's picture
Update to pixi env
c032460

Course Project: FileOrganizer

A CLI tool that uses local LLMs and AI agents to intelligently organize files, with special focus on research paper management.


Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     FileOrganizer CLI                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  $ fileorg scan ~/Downloads                                     β”‚
β”‚  $ fileorg organize ~/Papers --strategy=by-topic                β”‚
β”‚  $ fileorg deduplicate ~/Research --similarity=0.9              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Architecture

Files ──► Content Analysis ──► AI Classification ──► Organized Structure
              β”‚                        β”‚
              β–Ό                        β–Ό
        PDF Extraction          Docker Model Runner
        Metadata Tools            (Local LLM)
              β”‚                        β”‚
              └────────►MCPβ—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Files/PDFs  │────►│   Content    │────►│  MCP Server  β”‚
β”‚   (Input)    β”‚     β”‚  Extraction  β”‚     β”‚   (Tools)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                                  β”‚
                                                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Organized   │◄────│  Agent Crew  │◄────│  Local LLM   β”‚
β”‚  Structure   β”‚     β”‚  (CrewAI)    β”‚     β”‚   (Docker)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Agent System

Agent Role Tools Output
Scanner Agent Discovers files, extracts metadata File I/O, PDF extraction, hash generation File inventory, metadata catalog
Classifier Agent Categorizes files by content and context LLM analysis, embeddings, similarity Category assignments, topic tags
Organizer Agent Creates folder structure and moves files File operations, naming strategies Organized directory tree
Deduplicator Agent Finds and handles duplicate files Hash comparison, content similarity Duplicate reports, cleanup actions

Agent Workflow

User Request: "Organize research papers by topic"
                    β”‚
                    β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   Scanner Agent     β”‚
         β”‚  "What files do we  β”‚
         β”‚   have and what     β”‚
         β”‚   are they about?"  β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚ File Inventory
                    β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  Classifier Agent   β”‚
         β”‚  "What topics and   β”‚
         β”‚   categories emerge β”‚
         β”‚   from the content?"β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚ Categories
                    β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  Organizer Agent    β”‚
         β”‚  "Create folder     β”‚
         β”‚   structure and     β”‚
         β”‚   move files"       β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚ Organization Plan
                    β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ Deduplicator Agent  β”‚
         β”‚  "Find and handle   β”‚
         β”‚   duplicate files"  β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
          Organized Directory

CLI Commands

fileorg scan

Scan a directory and analyze its contents.

# Scan a directory
fileorg scan ~/Downloads

# Scan with detailed analysis
fileorg scan ~/Papers --analyze-content

# Scan and export inventory
fileorg scan ~/Research --export inventory.json

# Scan specific file types
fileorg scan ~/Documents --types pdf,docx,txt

Options:

Flag Description Default
--analyze-content Extract and analyze file contents false
--export Export inventory to JSON/CSV None
--types Comma-separated file extensions to scan All
--recursive Scan subdirectories true
--max-depth Maximum directory depth 10

fileorg organize

Organize files using AI-powered strategies.

# Organize by topic (AI-powered)
fileorg organize ~/Papers --strategy=by-topic

# Organize by date
fileorg organize ~/Photos --strategy=by-date --format="%Y/%m"

# Organize with custom naming
fileorg organize ~/Papers --rename --pattern="{year}_{author}_{title}"

# Dry run to preview changes
fileorg organize ~/Downloads --dry-run

# Interactive mode
fileorg organize ~/Research --interactive

Options:

Flag Description Default
--strategy Organization strategy: by-topic, by-date, by-type, by-author, smart smart
--rename Rename files intelligently false
--pattern Naming pattern for renamed files {original}
--dry-run Preview changes without executing false
--interactive Confirm each action false
--output Output directory Same as input

fileorg deduplicate

Find and handle duplicate files.

# Find duplicates by hash
fileorg deduplicate ~/Downloads

# Find similar files (content-based)
fileorg deduplicate ~/Papers --similarity=0.9

# Auto-delete duplicates (keep newest)
fileorg deduplicate ~/Photos --auto-delete --keep=newest

# Move duplicates to folder
fileorg deduplicate ~/Documents --move-to=./duplicates

Options:

Flag Description Default
--similarity Similarity threshold (0.0-1.0) for content matching 1.0 (exact)
--method Detection method: hash, content, metadata hash
--auto-delete Automatically delete duplicates false
--keep Which to keep: newest, oldest, largest, smallest newest
--move-to Move duplicates to directory instead of deleting None

fileorg research

Special commands for research paper management.

# Extract metadata from PDFs
fileorg research extract ~/Papers

# Generate bibliography
fileorg research bibliography ~/Papers --format=bibtex --output=refs.bib

# Find related papers
fileorg research related "attention mechanisms" --in ~/Papers

# Create reading list
fileorg research reading-list ~/Papers --topic "transformers" --order=citations

Options:

Flag Description Default
--format Bibliography format: bibtex, apa, mla bibtex
--output Output file path stdout
--order Sort order: date, citations, relevance relevance

fileorg config

Manage configuration settings.

# Show current config
fileorg config show

# Set LLM model
fileorg config set llm.model "llama3.2:3b"

# Set default strategy
fileorg config set organize.default_strategy "by-topic"

# Reset to defaults
fileorg config reset

fileorg stats

Show statistics about files and organization.

# Show directory statistics
fileorg stats ~/Papers

# Show organization suggestions
fileorg stats ~/Downloads --suggest

# Export statistics
fileorg stats ~/Research --export stats.json

Configuration

Configuration is stored in ~/.config/fileorg/config.toml or ./fileorg.toml in the project directory.

[fileorg]
version = "1.0.0"

[llm]
provider = "docker"           # docker, ollama, openai
model = "llama3.2:3b"
temperature = 0.7
max_tokens = 4096
base_url = "http://localhost:11434"

[llm.docker]
runtime = "nvidia"            # nvidia, cpu
memory_limit = "8g"

[agents]
verbose = false
max_iterations = 10

[agents.scanner]
role = "File Scanner"
goal = "Discover and catalog all files with metadata"

[agents.classifier]
role = "Content Classifier"
goal = "Categorize files by content and context"

[agents.organizer]
role = "File Organizer"
goal = "Create optimal folder structure and organize files"

[agents.deduplicator]
role = "Duplicate Detector"
goal = "Find and handle duplicate files efficiently"

[organize]
default_strategy = "smart"
create_backups = true
backup_dir = "./.fileorg_backup"

[organize.naming]
sanitize = true
max_length = 255
replace_spaces = "_"

[research]
extract_metadata = true
auto_rename = true
naming_pattern = "{year}_{author}_{title}"
generate_bibliography = true

[deduplication]
default_method = "hash"
similarity_threshold = 0.95
auto_delete = false
keep_strategy = "newest"

[pdf]
extract_text = true
extract_metadata = true
ocr_enabled = false           # Enable OCR for scanned PDFs

[observability]
enabled = true
provider = "langfuse"         # langfuse, langsmith, console
trace_agents = true
log_tokens = true

Docker Stack

docker-compose.yml

version: "3.9"

services:
  # Local LLM via Docker Model Runner
  llm:
    image: ollama/ollama:latest
    runtime: nvidia
    environment:
      - OLLAMA_HOST=0.0.0.0
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3

  # MCP Server for file operations and PDF tools
  mcp-server:
    build:
      context: ./src/fileorg/mcp
      dockerfile: Dockerfile
    environment:
      - MCP_PORT=3000
    volumes:
      - ./workspace:/workspace
    ports:
      - "3000:3000"
    depends_on:
      - llm

  # Main application (for containerized usage)
  fileorg:
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      - LLM_BASE_URL=http://llm:11434
      - MCP_SERVER_URL=http://mcp-server:3000
    volumes:
      - ./workspace:/workspace
      - ./config:/config:ro
    depends_on:
      llm:
        condition: service_healthy
      mcp-server:
        condition: service_started
    profiles:
      - cli

volumes:
  ollama_data:

Running the Stack

# Start LLM and MCP server
docker compose up -d llm mcp-server

# Pull the model (first time only)
docker compose exec llm ollama pull llama3.2:3b

# Run FileOrganizer commands
docker compose run --rm fileorg scan /workspace/papers
docker compose run --rm fileorg organize /workspace/papers --strategy=by-topic
docker compose run --rm fileorg deduplicate /workspace/downloads

# Or run locally with Docker backend
fileorg scan ~/Papers
fileorg organize ~/Papers --strategy=by-topic
fileorg deduplicate ~/Downloads

Project Structure

fileorg/
β”œβ”€β”€ pyproject.toml              # pixi/uv project config
β”œβ”€β”€ pixi.lock
β”œβ”€β”€ docker-compose.yml          # Full stack orchestration
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ fileorg.toml                # Default configuration
β”œβ”€β”€ README.md
β”‚
β”œβ”€β”€ src/
β”‚   └── fileorg/
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ __main__.py         # Entry point
β”‚       β”œβ”€β”€ cli.py              # Typer CLI commands
β”‚       β”œβ”€β”€ config.py           # TOML configuration loader
β”‚       β”‚
β”‚       β”œβ”€β”€ scanner/            # File discovery and analysis
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ discovery.py    # File system traversal
β”‚       β”‚   β”œβ”€β”€ metadata.py     # Metadata extraction
β”‚       β”‚   β”œβ”€β”€ pdf_reader.py   # PDF text/metadata extraction
β”‚       β”‚   └── hashing.py      # File hashing utilities
β”‚       β”‚
β”‚       β”œβ”€β”€ classifier/         # Content classification
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ embeddings.py   # Generate embeddings
β”‚       β”‚   β”œβ”€β”€ clustering.py   # Topic clustering
β”‚       β”‚   β”œβ”€β”€ categorizer.py  # AI-powered categorization
β”‚       β”‚   └── similarity.py   # Content similarity
β”‚       β”‚
β”‚       β”œβ”€β”€ organizer/          # File organization
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ strategies.py   # Organization strategies
β”‚       β”‚   β”œβ”€β”€ naming.py       # File naming logic
β”‚       β”‚   β”œβ”€β”€ structure.py    # Directory structure creation
β”‚       β”‚   └── mover.py        # Safe file operations
β”‚       β”‚
β”‚       β”œβ”€β”€ deduplicator/       # Duplicate detection
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ hash_based.py   # Hash-based detection
β”‚       β”‚   β”œβ”€β”€ content_based.py # Content similarity detection
β”‚       β”‚   └── handler.py      # Duplicate handling
β”‚       β”‚
β”‚       β”œβ”€β”€ research/           # Research paper tools
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ extractor.py    # PDF metadata extraction
β”‚       β”‚   β”œβ”€β”€ bibliography.py # Bibliography generation
β”‚       β”‚   β”œβ”€β”€ citation.py     # Citation parsing
β”‚       β”‚   └── scholar.py      # Academic search integration
β”‚       β”‚
β”‚       β”œβ”€β”€ agents/             # CrewAI agents
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ crew.py         # Crew orchestration
β”‚       β”‚   β”œβ”€β”€ scanner.py      # Scanner agent
β”‚       β”‚   β”œβ”€β”€ classifier.py   # Classifier agent
β”‚       β”‚   β”œβ”€β”€ organizer.py    # Organizer agent
β”‚       β”‚   └── deduplicator.py # Deduplicator agent
β”‚       β”‚
β”‚       β”œβ”€β”€ tools/              # Agent tools
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ file_tools.py   # File operation tools
β”‚       β”‚   β”œβ”€β”€ pdf_tools.py    # PDF processing tools
β”‚       β”‚   β”œβ”€β”€ search_tools.py # Search and query tools
β”‚       β”‚   └── analysis.py     # Content analysis tools
β”‚       β”‚
β”‚       β”œβ”€β”€ mcp/                # MCP server
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ server.py       # MCP server implementation
β”‚       β”‚   β”œβ”€β”€ tools.py        # MCP tool definitions
β”‚       β”‚   └── Dockerfile      # MCP server container
β”‚       β”‚
β”‚       β”œβ”€β”€ llm/                # LLM integration
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ client.py       # LLM client (Docker/Ollama/OpenAI)
β”‚       β”‚   └── prompts.py      # Prompt templates
β”‚       β”‚
β”‚       └── observability/      # Logging & tracing
β”‚           β”œβ”€β”€ __init__.py
β”‚           β”œβ”€β”€ tracing.py      # Distributed tracing
β”‚           └── metrics.py      # Token/cost tracking
β”‚
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ conftest.py             # Pytest fixtures
β”‚   β”œβ”€β”€ test_cli.py
β”‚   β”œβ”€β”€ test_scanner.py
β”‚   β”œβ”€β”€ test_classifier.py
β”‚   β”œβ”€β”€ test_organizer.py
β”‚   β”œβ”€β”€ test_deduplicator.py
β”‚   β”œβ”€β”€ test_research.py
β”‚   └── fixtures/
β”‚       β”œβ”€β”€ sample_papers/
β”‚       β”‚   β”œβ”€β”€ paper1.pdf
β”‚       β”‚   β”œβ”€β”€ paper2.pdf
β”‚       β”‚   └── paper3.pdf
β”‚       β”œβ”€β”€ sample_files/
β”‚       └── expected_outputs/
β”‚
β”œβ”€β”€ workspace/                  # Working directory
β”‚   └── .gitkeep
β”‚
└── docs/                       # Documentation (Quarto)
    β”œβ”€β”€ _quarto.yml
    β”œβ”€β”€ index.qmd
    └── chapters/

Technology Stack

Category Tools
Package Management pixi, uv
CLI Framework Typer, Rich
Local LLM Docker Model Runner, Ollama
LLM Framework LangChain
Multi-Agent CrewAI
MCP Docker MCP Toolkit
PDF Processing PyPDF2, pdfplumber, pypdf
Embeddings sentence-transformers
File Operations pathlib, shutil
Hashing hashlib, xxhash
Metadata exifread, mutagen
Similarity scikit-learn, faiss
Observability Langfuse, OpenTelemetry
Testing pytest, DeepEval
Containerization Docker, Docker Compose

Example Usage

End-to-End Workflow

# 1. Start the Docker stack
docker compose up -d

# 2. Scan your messy Downloads folder
fileorg scan ~/Downloads --analyze-content --export downloads_inventory.json

# 3. Organize files by type and date
fileorg organize ~/Downloads --strategy=smart --dry-run
# Review the plan, then execute
fileorg organize ~/Downloads --strategy=smart

# 4. Organize research papers by topic
fileorg scan ~/Papers --types=pdf --analyze-content
fileorg organize ~/Papers --strategy=by-topic --rename --pattern="{year}_{author}_{title}"

# 5. Find and handle duplicates
fileorg deduplicate ~/Papers --similarity=0.95 --move-to=./duplicates

# 6. Extract metadata and generate bibliography
fileorg research extract ~/Papers
fileorg research bibliography ~/Papers --format=bibtex --output=references.bib

# 7. Create a reading list on a specific topic
fileorg research reading-list ~/Papers --topic "transformers" --order=citations

# 8. View statistics
fileorg stats ~/Papers

Research Paper Organization Example

# Before:
~/Papers/
β”œβ”€β”€ paper_final.pdf
β”œβ”€β”€ attention_is_all_you_need.pdf
β”œβ”€β”€ bert_paper.pdf
β”œβ”€β”€ gpt3.pdf
β”œβ”€β”€ vision_transformer.pdf
β”œβ”€β”€ download (1).pdf
β”œβ”€β”€ download (2).pdf
└── thesis_draft_v5.pdf

# Run organization
fileorg organize ~/Papers --strategy=by-topic --rename

# After:
~/Papers/
β”œβ”€β”€ Natural_Language_Processing/
β”‚   β”œβ”€β”€ Transformers/
β”‚   β”‚   β”œβ”€β”€ 2017_Vaswani_Attention_Is_All_You_Need.pdf
β”‚   β”‚   β”œβ”€β”€ 2018_Devlin_BERT_Pretraining.pdf
β”‚   β”‚   └── 2020_Brown_GPT3_Language_Models.pdf
β”‚   └── Other/
β”‚       └── 2023_Smith_Thesis_Draft.pdf
β”œβ”€β”€ Computer_Vision/
β”‚   └── Transformers/
β”‚       └── 2020_Dosovitskiy_Vision_Transformer.pdf
└── Uncategorized/
    └── 2024_Unknown_Document.pdf

Duplicate Detection Example

# Find exact duplicates
fileorg deduplicate ~/Downloads
# Found 15 duplicate files (45 MB)
# β€’ download.pdf (3 copies)
# β€’ image.jpg (2 copies)
# β€’ report.docx (2 copies)

# Find similar papers (different versions)
fileorg deduplicate ~/Papers --similarity=0.9 --method=content
# Found 3 similar file groups:
# β€’ attention_paper.pdf, attention_is_all_you_need.pdf (95% similar)
# β€’ bert_preprint.pdf, bert_final.pdf (98% similar)

# Auto-cleanup (keep newest)
fileorg deduplicate ~/Downloads --auto-delete --keep=newest
# βœ“ Deleted 15 duplicate files, freed 45 MB

Learning Outcomes

By building FileOrganizer, learners will be able to:

  1. βœ… Set up modern Python projects with pixi and reproducible environments
  2. βœ… Build professional CLI tools with Typer and Rich
  3. βœ… Run local LLMs using Docker Model Runner
  4. βœ… Process and extract content from PDF files
  5. βœ… Build MCP servers to connect AI agents to file systems
  6. βœ… Design multi-agent systems with CrewAI
  7. βœ… Implement content-based similarity and clustering
  8. βœ… Generate embeddings for semantic search
  9. βœ… Handle file operations safely with backups and dry-run modes
  10. βœ… Implement observability for AI applications
  11. βœ… Test non-deterministic systems effectively
  12. βœ… Deploy self-hosted AI applications with Docker Compose

Advanced Features

Smart Organization Strategy

The smart strategy uses AI to analyze file content and context to determine the best organization approach:

# Pseudocode for smart strategy
def smart_organize(files):
    # 1. Analyze file types and content
    file_analysis = scanner_agent.analyze(files)
    
    # 2. Determine optimal strategy
    if mostly_pdfs_with_academic_content:
        strategy = "by-topic-hierarchical"
    elif mostly_media_files:
        strategy = "by-date-and-type"
    elif mixed_work_documents:
        strategy = "by-project-and-date"
    
    # 3. Execute with AI-powered categorization
    classifier_agent.categorize(files, strategy)
    organizer_agent.execute(strategy)

Research Paper Features

Special handling for academic PDFs:

  • Metadata Extraction: Title, authors, year, abstract, keywords
  • Citation Parsing: Extract and parse references
  • Smart Naming: {year}_{first_author}_{short_title}.pdf
  • Topic Clustering: Group papers by research area
  • Citation Network: Identify related papers
  • Bibliography Generation: BibTeX, APA, MLA formats

Deduplication Strategies

Multiple methods for finding duplicates:

  1. Hash-based: Exact file matches (fastest)
  2. Content-based: Similar content using embeddings
  3. Metadata-based: Same title/author but different files
  4. Fuzzy matching: Handle renamed or modified files

This project serves as the main example in the Learning Path for building AI-powered CLI tools.