Stockker / docs /code_explanation.md
umer6016's picture
Upload folder using huggingface_hub
ce3d808 verified

Codebase Explanation & Walkthrough

This document provides a detailed technical explanation of the "AI Stock Prediction System" (Stockker). Use this to understand the underlying logic and architecture.

1. System Architecture (The Big Picture)

The system operates as a Microservices-based Pipeline with four distinct stages:

  1. Ingestion Layer: Fetches raw market data (Open, High, Low, Close, Volume) from Alpha Vantage.
  2. Processing Layer: Transforms raw data into technical indicators (Features).
  3. Training Orchestration: A Prefect pipeline that trains, evaluates, and saves models.
  4. Inference API: A FastAPI server that loads these models to provide real-time predictions.

2. Key Components Explained

A. Data Ingestion & Processing

File: src/processing/features.py Feature engineering is critical for financial ML. We treat the market data as a time-series problem.

  • SMA (Simple Moving Average): Calculates the trend over 20 and 50 days.
  • RSI (Relative Strength Index): A momentum oscillator (0-100) to identify overbought/oversold conditions.
  • MACD (Moving Average Convergence Divergence): Tracks momentum changes.
  • Target Variables:
    • target_price: Next day's closing price (Regression).
    • target_direction: 1 (Up) or 0 (Down) for next day (Classification).

B. Machine Learning Models (Ensemble Approach)

File: src/models/train.py Instead of relying on a single algorithm, we use Ensemble Learning for robustness.

1. Regression (Predicting Price)

We use a Voting Regressor, which averages predictions from three models:

  • Linear Regression: Captures simple linear trends.
  • Random Forest Regressor: Captures complex, non-linear patterns (100 Decision Trees).
  • SVR (Support Vector Regressor): Uses an RBF kernel to find the optimal hyperplane in high-dimensional space. Why? Combining these reduces the variance and error of any single model.

2. Classification (Predicting Direction)

We use a Voting Classifier (Soft Voting) combining:

  • Random Forest Classifier: Robust against overfitting.
  • SVC (Support Vector Classifier): Good at separating classes with clear margins. Soft Voting means we average the probabilities of each model class, not just their final votes, leading to more nuanced predictions.

3. Unsupervised Learning (Market Analysis)

  • K-Means Clustering: Groups market days into 3 clusters (e.g., Low Volatility, High Volatility) based on volatility and RSI.
  • PCA: Reduces our 4 dimensions (SMA20, SMA50, RSI, MACD) into 2 principal components for 2D visualization.

C. Orchestration (The Pipeline)

File: src/orchestration/flows.py We use Prefect to manage the workflow.

  • main_pipeline: Loops through our stock list (AAPL, GOOGL, MSFT, AMZN, TSLA, NVDA).
  • For each stock, it sequentially runs: Fetch -> Process -> Train -> Evaluate -> Notify Discord.
  • Error Handling: If one stock fails, the pipeline logs the error (via Discord) and continues to the next, ensuring resilience.

D. The API (Model Serving)

File: src/api/main.py

  • Dynamic Loading: On startup (@app.on_event("startup")), the API scans the models/ directory. It dynamically loads whatever models it finds (e.g., models/NVDA/regression_model.pkl), making the system easily extensible to new stocks without changing code.
  • Enpoints: Exposes REST endpoints (/predict/price, /predict/direction) that the frontend consumes.

3. Infrastructure & DevOps

Docker

File: Dockerfile & docker-compose.yml

  • We containerize the application to ensure it runs identically on your laptop and the cloud.
  • The docker-compose setup includes a Postgres service, which is used exclusively by Prefect to store flow run history. The main app uses a file-based system (CSVs/PKLs) for simplicity and portability.

CI/CD (GitHub Actions)

File: .github/workflows/deploy_to_hf.yml

  • On every push to main, GitHub Actions automatically:
    1. Runs pytest to verify code correctness.
    2. Pushes the code to the Hugging Face Space, triggering a new deployment.

4. Why This Architecture?

  • Modularity: Separation of concerns (Ingestion vs Training vs Serving) makes debugging easy.
  • Scalability: Adding a new stock (like NVDA) only required adding a string to the list; the pipeline handled the rest.
  • Reliability: Ensembles prevent "putting all eggs in one basket" model-wise.