Spaces:

neural-arun
/

ArunCore

Sleeping

App Files Files Community

ArunCore / data /github /README.md

Neural Arun

ArunCore Deployment

9ae77d7 about 1 month ago

preview code

raw

history blame contribute delete

2.64 kB

data/github/

The curated knowledge base for every GitHub project. This is the highest-signal section of the ArunCore dataset — it contains structured, reviewed documentation for each project that proves real engineering capability.

How It Was Built

All repos were cloned into temp/repos/ locally
Source code, READMEs, and requirements files were analysed file by file
Five dataset files were generated per Tier 1 project (written from code analysis, not just READMEs)
Arun filled in the decisions.md with the why behind each architectural choice
Final files were reviewed and formatted before being committed here

Folder Structure

One subfolder per project:

github/
├── legal_RAG_system/          ← Tier 1: full 5-file treatment
├── real_state_listing_scraper/ ← Tier 1: full 5-file treatment
├── personal_ai_agent/          ← Tier 1: full 5-file treatment
├── result_anomaly/             ← Tier 1: full 5-file treatment
├── Agentic_AI_Projects/        ← Tier 2: metadata + readme only
├── web_wizard/                 ← Tier 2: metadata + readme only
└── neural_arun_labs/           ← Tier 2: metadata + readme only

File Types Per Project

File	Tier 1	Tier 2	Purpose
`metadata.json`	✅	✅	Machine-readable project facts: URL, stack, status, visibility
`readme.md`	✅	✅	Clean problem/solution/features (no install noise)
`architecture.md`	✅	❌	System design: components, data flow, design patterns used
`code_summaries.json`	✅	❌	Per-module summaries with GitHub file URLs
`decisions.md`	✅	❌	Key architectural decisions + reasoning (written by Arun)

Tiering Criteria

Tier 1 — Full treatment: Projects that demonstrate real, original engineering problem-solving. Flagship work. These are the projects that define Arun professionally.

Tier 2 — Lightweight: Projects showing breadth (multiple domains) or active learning, but not deep enough in complexity to warrant full architecture documentation.

How This Dataset Is Used

During ingestion, each file is:

Chunked with metadata tags (source, project_name, type, tech_stack, status, visibility)
Embedded using OpenAI text-embedding-3-small
Stored in the vector database

When a user asks "What RAG projects have you built?", the retrieval engine pulls from legal_RAG_system/architecture.md and legal_RAG_system/readme.md. When asked "Why did you use ChromaDB?", it retrieves from legal_RAG_system/decisions.md.