Spaces:

KashiAI
/

KCH

Sleeping

App Files Files Community

KCH / docs /projects /FileOrganizer.md

bsamadi

Update to pixi env

c032460 about 2 months ago

preview code

raw

history blame contribute delete

22.3 kB

Course Project: FileOrganizer

A CLI tool that uses local LLMs and AI agents to intelligently organize files, with special focus on research paper management.

Overview

┌─────────────────────────────────────────────────────────────────┐
│                     FileOrganizer CLI                            │
├─────────────────────────────────────────────────────────────────┤
│  $ fileorg scan ~/Downloads                                     │
│  $ fileorg organize ~/Papers --strategy=by-topic                │
│  $ fileorg deduplicate ~/Research --similarity=0.9              │
└─────────────────────────────────────────────────────────────────┘

Architecture

Files ──► Content Analysis ──► AI Classification ──► Organized Structure
              │                        │
              ▼                        ▼
        PDF Extraction          Docker Model Runner
        Metadata Tools            (Local LLM)
              │                        │
              └────────►MCP◄───────────┘

Data Flow

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Files/PDFs  │────►│   Content    │────►│  MCP Server  │
│   (Input)    │     │  Extraction  │     │   (Tools)    │
└──────────────┘     └──────────────┘     └──────┬───────┘
                                                  │
                                                  ▼
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Organized   │◄────│  Agent Crew  │◄────│  Local LLM   │
│  Structure   │     │  (CrewAI)    │     │   (Docker)   │
└──────────────┘     └──────────────┘     └──────────────┘

Agent System

Agent	Role	Tools	Output
Scanner Agent	Discovers files, extracts metadata	File I/O, PDF extraction, hash generation	File inventory, metadata catalog
Classifier Agent	Categorizes files by content and context	LLM analysis, embeddings, similarity	Category assignments, topic tags
Organizer Agent	Creates folder structure and moves files	File operations, naming strategies	Organized directory tree
Deduplicator Agent	Finds and handles duplicate files	Hash comparison, content similarity	Duplicate reports, cleanup actions

Agent Workflow

User Request: "Organize research papers by topic"
                    │
                    ▼
         ┌─────────────────────┐
         │   Scanner Agent     │
         │  "What files do we  │
         │   have and what     │
         │   are they about?"  │
         └──────────┬──────────┘
                    │ File Inventory
                    ▼
         ┌─────────────────────┐
         │  Classifier Agent   │
         │  "What topics and   │
         │   categories emerge │
         │   from the content?"│
         └──────────┬──────────┘
                    │ Categories
                    ▼
         ┌─────────────────────┐
         │  Organizer Agent    │
         │  "Create folder     │
         │   structure and     │
         │   move files"       │
         └──────────┬──────────┘
                    │ Organization Plan
                    ▼
         ┌─────────────────────┐
         │ Deduplicator Agent  │
         │  "Find and handle   │
         │   duplicate files"  │
         └──────────┬──────────┘
                    │
                    ▼
          Organized Directory

CLI Commands

`fileorg scan`

Scan a directory and analyze its contents.

# Scan a directory
fileorg scan ~/Downloads

# Scan with detailed analysis
fileorg scan ~/Papers --analyze-content

# Scan and export inventory
fileorg scan ~/Research --export inventory.json

# Scan specific file types
fileorg scan ~/Documents --types pdf,docx,txt

Options:

Flag	Description	Default
`--analyze-content`	Extract and analyze file contents	`false`
`--export`	Export inventory to JSON/CSV	None
`--types`	Comma-separated file extensions to scan	All
`--recursive`	Scan subdirectories	`true`
`--max-depth`	Maximum directory depth	`10`

`fileorg organize`

Organize files using AI-powered strategies.

# Organize by topic (AI-powered)
fileorg organize ~/Papers --strategy=by-topic

# Organize by date
fileorg organize ~/Photos --strategy=by-date --format="%Y/%m"

# Organize with custom naming
fileorg organize ~/Papers --rename --pattern="{year}_{author}_{title}"

# Dry run to preview changes
fileorg organize ~/Downloads --dry-run

# Interactive mode
fileorg organize ~/Research --interactive

Options:

Flag	Description	Default
`--strategy`	Organization strategy: `by-topic`, `by-date`, `by-type`, `by-author`, `smart`	`smart`
`--rename`	Rename files intelligently	`false`
`--pattern`	Naming pattern for renamed files	`{original}`
`--dry-run`	Preview changes without executing	`false`
`--interactive`	Confirm each action	`false`
`--output`	Output directory	Same as input

`fileorg deduplicate`

Find and handle duplicate files.

# Find duplicates by hash
fileorg deduplicate ~/Downloads

# Find similar files (content-based)
fileorg deduplicate ~/Papers --similarity=0.9

# Auto-delete duplicates (keep newest)
fileorg deduplicate ~/Photos --auto-delete --keep=newest

# Move duplicates to folder
fileorg deduplicate ~/Documents --move-to=./duplicates

Options:

Flag	Description	Default
`--similarity`	Similarity threshold (0.0-1.0) for content matching	`1.0` (exact)
`--method`	Detection method: `hash`, `content`, `metadata`	`hash`
`--auto-delete`	Automatically delete duplicates	`false`
`--keep`	Which to keep: `newest`, `oldest`, `largest`, `smallest`	`newest`
`--move-to`	Move duplicates to directory instead of deleting	None

`fileorg research`

Special commands for research paper management.

# Extract metadata from PDFs
fileorg research extract ~/Papers

# Generate bibliography
fileorg research bibliography ~/Papers --format=bibtex --output=refs.bib

# Find related papers
fileorg research related "attention mechanisms" --in ~/Papers

# Create reading list
fileorg research reading-list ~/Papers --topic "transformers" --order=citations

Options:

Flag	Description	Default
`--format`	Bibliography format: `bibtex`, `apa`, `mla`	`bibtex`
`--output`	Output file path	`stdout`
`--order`	Sort order: `date`, `citations`, `relevance`	`relevance`

`fileorg config`

Manage configuration settings.

# Show current config
fileorg config show

# Set LLM model
fileorg config set llm.model "llama3.2:3b"

# Set default strategy
fileorg config set organize.default_strategy "by-topic"

# Reset to defaults
fileorg config reset

`fileorg stats`

Show statistics about files and organization.

# Show directory statistics
fileorg stats ~/Papers

# Show organization suggestions
fileorg stats ~/Downloads --suggest

# Export statistics
fileorg stats ~/Research --export stats.json

Configuration

Configuration is stored in ~/.config/fileorg/config.toml or ./fileorg.toml in the project directory.

[fileorg]
version = "1.0.0"

[llm]
provider = "docker"           # docker, ollama, openai
model = "llama3.2:3b"
temperature = 0.7
max_tokens = 4096
base_url = "http://localhost:11434"

[llm.docker]
runtime = "nvidia"            # nvidia, cpu
memory_limit = "8g"

[agents]
verbose = false
max_iterations = 10

[agents.scanner]
role = "File Scanner"
goal = "Discover and catalog all files with metadata"

[agents.classifier]
role = "Content Classifier"
goal = "Categorize files by content and context"

[agents.organizer]
role = "File Organizer"
goal = "Create optimal folder structure and organize files"

[agents.deduplicator]
role = "Duplicate Detector"
goal = "Find and handle duplicate files efficiently"

[organize]
default_strategy = "smart"
create_backups = true
backup_dir = "./.fileorg_backup"

[organize.naming]
sanitize = true
max_length = 255
replace_spaces = "_"

[research]
extract_metadata = true
auto_rename = true
naming_pattern = "{year}_{author}_{title}"
generate_bibliography = true

[deduplication]
default_method = "hash"
similarity_threshold = 0.95
auto_delete = false
keep_strategy = "newest"

[pdf]
extract_text = true
extract_metadata = true
ocr_enabled = false           # Enable OCR for scanned PDFs

[observability]
enabled = true
provider = "langfuse"         # langfuse, langsmith, console
trace_agents = true
log_tokens = true

Docker Stack

docker-compose.yml

version: "3.9"

services:
  # Local LLM via Docker Model Runner
  llm:
    image: ollama/ollama:latest
    runtime: nvidia
    environment:
      - OLLAMA_HOST=0.0.0.0
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3

  # MCP Server for file operations and PDF tools
  mcp-server:
    build:
      context: ./src/fileorg/mcp
      dockerfile: Dockerfile
    environment:
      - MCP_PORT=3000
    volumes:
      - ./workspace:/workspace
    ports:
      - "3000:3000"
    depends_on:
      - llm

  # Main application (for containerized usage)
  fileorg:
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      - LLM_BASE_URL=http://llm:11434
      - MCP_SERVER_URL=http://mcp-server:3000
    volumes:
      - ./workspace:/workspace
      - ./config:/config:ro
    depends_on:
      llm:
        condition: service_healthy
      mcp-server:
        condition: service_started
    profiles:
      - cli

volumes:
  ollama_data:

Running the Stack

# Start LLM and MCP server
docker compose up -d llm mcp-server

# Pull the model (first time only)
docker compose exec llm ollama pull llama3.2:3b

# Run FileOrganizer commands
docker compose run --rm fileorg scan /workspace/papers
docker compose run --rm fileorg organize /workspace/papers --strategy=by-topic
docker compose run --rm fileorg deduplicate /workspace/downloads

# Or run locally with Docker backend
fileorg scan ~/Papers
fileorg organize ~/Papers --strategy=by-topic
fileorg deduplicate ~/Downloads

Project Structure

fileorg/
├── pyproject.toml              # pixi/uv project config
├── pixi.lock
├── docker-compose.yml          # Full stack orchestration
├── Dockerfile
├── fileorg.toml                # Default configuration
├── README.md
│
├── src/
│   └── fileorg/
│       ├── __init__.py
│       ├── __main__.py         # Entry point
│       ├── cli.py              # Typer CLI commands
│       ├── config.py           # TOML configuration loader
│       │
│       ├── scanner/            # File discovery and analysis
│       │   ├── __init__.py
│       │   ├── discovery.py    # File system traversal
│       │   ├── metadata.py     # Metadata extraction
│       │   ├── pdf_reader.py   # PDF text/metadata extraction
│       │   └── hashing.py      # File hashing utilities
│       │
│       ├── classifier/         # Content classification
│       │   ├── __init__.py
│       │   ├── embeddings.py   # Generate embeddings
│       │   ├── clustering.py   # Topic clustering
│       │   ├── categorizer.py  # AI-powered categorization
│       │   └── similarity.py   # Content similarity
│       │
│       ├── organizer/          # File organization
│       │   ├── __init__.py
│       │   ├── strategies.py   # Organization strategies
│       │   ├── naming.py       # File naming logic
│       │   ├── structure.py    # Directory structure creation
│       │   └── mover.py        # Safe file operations
│       │
│       ├── deduplicator/       # Duplicate detection
│       │   ├── __init__.py
│       │   ├── hash_based.py   # Hash-based detection
│       │   ├── content_based.py # Content similarity detection
│       │   └── handler.py      # Duplicate handling
│       │
│       ├── research/           # Research paper tools
│       │   ├── __init__.py
│       │   ├── extractor.py    # PDF metadata extraction
│       │   ├── bibliography.py # Bibliography generation
│       │   ├── citation.py     # Citation parsing
│       │   └── scholar.py      # Academic search integration
│       │
│       ├── agents/             # CrewAI agents
│       │   ├── __init__.py
│       │   ├── crew.py         # Crew orchestration
│       │   ├── scanner.py      # Scanner agent
│       │   ├── classifier.py   # Classifier agent
│       │   ├── organizer.py    # Organizer agent
│       │   └── deduplicator.py # Deduplicator agent
│       │
│       ├── tools/              # Agent tools
│       │   ├── __init__.py
│       │   ├── file_tools.py   # File operation tools
│       │   ├── pdf_tools.py    # PDF processing tools
│       │   ├── search_tools.py # Search and query tools
│       │   └── analysis.py     # Content analysis tools
│       │
│       ├── mcp/                # MCP server
│       │   ├── __init__.py
│       │   ├── server.py       # MCP server implementation
│       │   ├── tools.py        # MCP tool definitions
│       │   └── Dockerfile      # MCP server container
│       │
│       ├── llm/                # LLM integration
│       │   ├── __init__.py
│       │   ├── client.py       # LLM client (Docker/Ollama/OpenAI)
│       │   └── prompts.py      # Prompt templates
│       │
│       └── observability/      # Logging & tracing
│           ├── __init__.py
│           ├── tracing.py      # Distributed tracing
│           └── metrics.py      # Token/cost tracking
│
├── tests/
│   ├── __init__.py
│   ├── conftest.py             # Pytest fixtures
│   ├── test_cli.py
│   ├── test_scanner.py
│   ├── test_classifier.py
│   ├── test_organizer.py
│   ├── test_deduplicator.py
│   ├── test_research.py
│   └── fixtures/
│       ├── sample_papers/
│       │   ├── paper1.pdf
│       │   ├── paper2.pdf
│       │   └── paper3.pdf
│       ├── sample_files/
│       └── expected_outputs/
│
├── workspace/                  # Working directory
│   └── .gitkeep
│
└── docs/                       # Documentation (Quarto)
    ├── _quarto.yml
    ├── index.qmd
    └── chapters/

Technology Stack

Category	Tools
Package Management	pixi, uv
CLI Framework	Typer, Rich
Local LLM	Docker Model Runner, Ollama
LLM Framework	LangChain
Multi-Agent	CrewAI
MCP	Docker MCP Toolkit
PDF Processing	PyPDF2, pdfplumber, pypdf
Embeddings	sentence-transformers
File Operations	pathlib, shutil
Hashing	hashlib, xxhash
Metadata	exifread, mutagen
Similarity	scikit-learn, faiss
Observability	Langfuse, OpenTelemetry
Testing	pytest, DeepEval
Containerization	Docker, Docker Compose

Example Usage

End-to-End Workflow

# 1. Start the Docker stack
docker compose up -d

# 2. Scan your messy Downloads folder
fileorg scan ~/Downloads --analyze-content --export downloads_inventory.json

# 3. Organize files by type and date
fileorg organize ~/Downloads --strategy=smart --dry-run
# Review the plan, then execute
fileorg organize ~/Downloads --strategy=smart

# 4. Organize research papers by topic
fileorg scan ~/Papers --types=pdf --analyze-content
fileorg organize ~/Papers --strategy=by-topic --rename --pattern="{year}_{author}_{title}"

# 5. Find and handle duplicates
fileorg deduplicate ~/Papers --similarity=0.95 --move-to=./duplicates

# 6. Extract metadata and generate bibliography
fileorg research extract ~/Papers
fileorg research bibliography ~/Papers --format=bibtex --output=references.bib

# 7. Create a reading list on a specific topic
fileorg research reading-list ~/Papers --topic "transformers" --order=citations

# 8. View statistics
fileorg stats ~/Papers

Research Paper Organization Example

# Before:
~/Papers/
├── paper_final.pdf
├── attention_is_all_you_need.pdf
├── bert_paper.pdf
├── gpt3.pdf
├── vision_transformer.pdf
├── download (1).pdf
├── download (2).pdf
└── thesis_draft_v5.pdf

# Run organization
fileorg organize ~/Papers --strategy=by-topic --rename

# After:
~/Papers/
├── Natural_Language_Processing/
│   ├── Transformers/
│   │   ├── 2017_Vaswani_Attention_Is_All_You_Need.pdf
│   │   ├── 2018_Devlin_BERT_Pretraining.pdf
│   │   └── 2020_Brown_GPT3_Language_Models.pdf
│   └── Other/
│       └── 2023_Smith_Thesis_Draft.pdf
├── Computer_Vision/
│   └── Transformers/
│       └── 2020_Dosovitskiy_Vision_Transformer.pdf
└── Uncategorized/
    └── 2024_Unknown_Document.pdf

Duplicate Detection Example

# Find exact duplicates
fileorg deduplicate ~/Downloads
# Found 15 duplicate files (45 MB)
# • download.pdf (3 copies)
# • image.jpg (2 copies)
# • report.docx (2 copies)

# Find similar papers (different versions)
fileorg deduplicate ~/Papers --similarity=0.9 --method=content
# Found 3 similar file groups:
# • attention_paper.pdf, attention_is_all_you_need.pdf (95% similar)
# • bert_preprint.pdf, bert_final.pdf (98% similar)

# Auto-cleanup (keep newest)
fileorg deduplicate ~/Downloads --auto-delete --keep=newest
# ✓ Deleted 15 duplicate files, freed 45 MB

Learning Outcomes

By building FileOrganizer, learners will be able to:

✅ Set up modern Python projects with pixi and reproducible environments
✅ Build professional CLI tools with Typer and Rich
✅ Run local LLMs using Docker Model Runner
✅ Process and extract content from PDF files
✅ Build MCP servers to connect AI agents to file systems
✅ Design multi-agent systems with CrewAI
✅ Implement content-based similarity and clustering
✅ Generate embeddings for semantic search
✅ Handle file operations safely with backups and dry-run modes
✅ Implement observability for AI applications
✅ Test non-deterministic systems effectively
✅ Deploy self-hosted AI applications with Docker Compose

Advanced Features

Smart Organization Strategy

The smart strategy uses AI to analyze file content and context to determine the best organization approach:

# Pseudocode for smart strategy
def smart_organize(files):
    # 1. Analyze file types and content
    file_analysis = scanner_agent.analyze(files)
    
    # 2. Determine optimal strategy
    if mostly_pdfs_with_academic_content:
        strategy = "by-topic-hierarchical"
    elif mostly_media_files:
        strategy = "by-date-and-type"
    elif mixed_work_documents:
        strategy = "by-project-and-date"
    
    # 3. Execute with AI-powered categorization
    classifier_agent.categorize(files, strategy)
    organizer_agent.execute(strategy)

Research Paper Features

Special handling for academic PDFs:

Metadata Extraction: Title, authors, year, abstract, keywords
Citation Parsing: Extract and parse references
Smart Naming: {year}_{first_author}_{short_title}.pdf
Topic Clustering: Group papers by research area
Citation Network: Identify related papers
Bibliography Generation: BibTeX, APA, MLA formats

Deduplication Strategies

Multiple methods for finding duplicates:

Hash-based: Exact file matches (fastest)
Content-based: Similar content using embeddings
Metadata-based: Same title/author but different files
Fuzzy matching: Handle renamed or modified files

This project serves as the main example in the Learning Path for building AI-powered CLI tools.