ArunCore / data /github /README.md
Neural Arun
ArunCore Deployment
9ae77d7

data/github/

The curated knowledge base for every GitHub project. This is the highest-signal section of the ArunCore dataset β€” it contains structured, reviewed documentation for each project that proves real engineering capability.


How It Was Built

  1. All repos were cloned into temp/repos/ locally
  2. Source code, READMEs, and requirements files were analysed file by file
  3. Five dataset files were generated per Tier 1 project (written from code analysis, not just READMEs)
  4. Arun filled in the decisions.md with the why behind each architectural choice
  5. Final files were reviewed and formatted before being committed here

Folder Structure

One subfolder per project:

github/
β”œβ”€β”€ legal_RAG_system/          ← Tier 1: full 5-file treatment
β”œβ”€β”€ real_state_listing_scraper/ ← Tier 1: full 5-file treatment
β”œβ”€β”€ personal_ai_agent/          ← Tier 1: full 5-file treatment
β”œβ”€β”€ result_anomaly/             ← Tier 1: full 5-file treatment
β”œβ”€β”€ Agentic_AI_Projects/        ← Tier 2: metadata + readme only
β”œβ”€β”€ web_wizard/                 ← Tier 2: metadata + readme only
└── neural_arun_labs/           ← Tier 2: metadata + readme only

File Types Per Project

File Tier 1 Tier 2 Purpose
metadata.json βœ… βœ… Machine-readable project facts: URL, stack, status, visibility
readme.md βœ… βœ… Clean problem/solution/features (no install noise)
architecture.md βœ… ❌ System design: components, data flow, design patterns used
code_summaries.json βœ… ❌ Per-module summaries with GitHub file URLs
decisions.md βœ… ❌ Key architectural decisions + reasoning (written by Arun)

Tiering Criteria

Tier 1 β€” Full treatment: Projects that demonstrate real, original engineering problem-solving. Flagship work. These are the projects that define Arun professionally.

Tier 2 β€” Lightweight: Projects showing breadth (multiple domains) or active learning, but not deep enough in complexity to warrant full architecture documentation.


How This Dataset Is Used

During ingestion, each file is:

  1. Chunked with metadata tags (source, project_name, type, tech_stack, status, visibility)
  2. Embedded using OpenAI text-embedding-3-small
  3. Stored in the vector database

When a user asks "What RAG projects have you built?", the retrieval engine pulls from legal_RAG_system/architecture.md and legal_RAG_system/readme.md. When asked "Why did you use ChromaDB?", it retrieves from legal_RAG_system/decisions.md.