YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

KairoAI - Complete Project Documentation

Indian Sign Language Learning App with AI-Powered Hand Detection

Author: Megh Modi
Created: December 18, 2025
Version: 1.0.0
Status: Planning & Architecture Phase


Table of Contents

  1. Executive Summary
  2. Project Vision & Goals
  3. Technical Architecture
  4. Technology Stack
  5. Understanding the AI Pipeline
  6. Data Flow & Pipeline
  7. MediaPipe Explained
  8. DNN Model Explained
  9. Platform Channels Explained
  10. Dataset Creation Guide
  11. Model Training Guide
  12. Implementation Roadmap
  13. Code Structure
  14. Challenges & Solutions
  15. Feasibility Assessment
  16. Resources & Learning Path

1. Executive Summary

What is KairoAI?

KairoAI is an Indian Sign Language (ISL) learning application designed specifically for children. The app uses real-time hand gesture detection via the device camera to teach ISL alphabets and words, providing instant feedback to students.

Core Innovation

The app combines three powerful technologies:

  • Flutter for cross-platform UI
  • MediaPipe for hand detection (running natively on Android)
  • TensorFlow Lite for sign language classification

Key Differentiator

Unlike traditional learning apps, KairoAI provides real-time visual feedback by:

  1. Showing the user what sign to make
  2. Detecting their hand position using the camera
  3. Validating if they're making the correct sign
  4. Providing instant feedback (success/try again)

2. Project Vision & Goals

Primary Goal

Create an accessible, engaging platform for children to learn Indian Sign Language through interactive, AI-powered lessons.

Target Users

  • Primary: Children aged 6-14 learning ISL
  • Secondary: Parents and educators teaching ISL
  • Tertiary: Anyone interested in learning ISL

Core Features

1. Lesson Mode

  • Display a target alphabet (e.g., "A") or word (e.g., "MEGH")
  • Open device camera
  • Detect student's hand sign in real-time
  • Validate against expected sign
  • Show success animation/sound on correct detection
  • Provide guidance hints on incorrect attempts

2. Quiz Mode

  • Present random alphabets or words
  • Student performs signs sequentially
  • Each detected letter is validated in order
  • Progress only on correct detection
  • Track accuracy and completion time

3. Progress Tracking

  • Store lesson completion in Firebase Firestore
  • Track quiz scores and accuracy
  • Visualize learning progress over time
  • Gamification elements (badges, streaks)

3. Technical Architecture

High-Level Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        FLUTTER LAYER (UI)                        β”‚
β”‚                         Written in Dart                          β”‚
β”‚                                                                  β”‚
β”‚  β€’ Lessons UI          β€’ Quiz UI           β€’ Progress Dashboard β”‚
β”‚  β€’ Camera Preview      β€’ Feedback Animations                    β”‚
β”‚  β€’ Firebase Integration (Auth, Firestore)                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β”‚ Platform Channels (Bridge)
                         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     KOTLIN LAYER (Android Native)                β”‚
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚  β”‚   CameraX    β”‚β†’ β”‚  MediaPipe   β”‚β†’ β”‚  TensorFlow  β”‚          β”‚
β”‚  β”‚              β”‚  β”‚    Hands     β”‚  β”‚     Lite     β”‚          β”‚
β”‚  β”‚ Capture      β”‚  β”‚ Detect hand  β”‚  β”‚ Classify     β”‚          β”‚
β”‚  β”‚ frames       β”‚  β”‚ Extract 21   β”‚  β”‚ sign         β”‚          β”‚
β”‚  β”‚              β”‚  β”‚ landmarks    β”‚  β”‚              β”‚          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚                                                                  β”‚
β”‚  Returns: { letter: "A", confidence: 0.95, handDetected: true } β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Architecture Principles

Why Hybrid Architecture?

Component Layer Reason
UI & Navigation Flutter (Dart) Cross-platform, fast development, beautiful UI
Camera & ML Kotlin (Native) Direct hardware access, optimized performance
Firebase Flutter (Dart) Easy integration, real-time sync

Key Design Decision

DO NOT attempt to run MediaPipe or camera processing in Dart.

Why?

  • Flutter cannot directly access native camera APIs efficiently
  • MediaPipe requires native Android/iOS libraries
  • ML inference is faster in native code
  • Better battery performance with native implementation

4. Technology Stack

Flutter Side (Dart)

Core Dependencies

dependencies:
  flutter:
    sdk: flutter
  
  # Firebase
  firebase_core: ^2.24.2        # Firebase initialization
  firebase_auth: ^4.16.0        # User authentication
  cloud_firestore: ^4.14.0      # Database for progress tracking
  
  # State Management
  provider: ^6.1.1              # For managing app state
  
  # Navigation
  go_router: ^13.0.0            # Declarative routing
  
  # UI/UX Enhancements
  lottie: ^3.0.0                # Success animations
  audioplayers: ^5.2.1          # Sound effects
  
  # Utilities
  equatable: ^2.0.5             # Value comparison

Why These Libraries?

Library Purpose Alternative
provider Simple state management riverpod, bloc
go_router Type-safe routing auto_route, manual routing
lottie Beautiful animations flare, custom animations

Android/Kotlin Side

Build Configuration

// android/app/build.gradle.kts

plugins {
    id("com.android.application")
    id("kotlin-android")
    id("dev.flutter.flutter-gradle-plugin")
}

android {
    namespace = "com.kairo.ai"
    compileSdk = 34
    
    defaultConfig {
        applicationId = "com.kairo.ai"
        minSdk = 26        // Required for CameraX
        targetSdk = 34
        versionCode = 1
        versionName = "1.0"
    }

    compileOptions {
        sourceCompatibility = JavaVersion.VERSION_17
        targetCompatibility = JavaVersion.VERSION_17
    }

    kotlinOptions {
        jvmTarget = "17"
    }

    // Required for TFLite model files
    aaptOptions {
        noCompress("tflite")
    }
}

Dependencies

dependencies {
    // MediaPipe Tasks Vision (Hand Landmark Detection)
    implementation("com.google.mediapipe:tasks-vision:0.10.14")
    
    // TensorFlow Lite
    implementation("org.tensorflow:tensorflow-lite:2.14.0")
    implementation("org.tensorflow:tensorflow-lite-support:0.4.4")
    
    // CameraX (Camera API)
    implementation("androidx.camera:camera-core:1.3.1")
    implementation("androidx.camera:camera-camera2:1.3.1")
    implementation("androidx.camera:camera-lifecycle:1.3.1")
    implementation("androidx.camera:camera-view:1.3.1")
    
    // Coroutines for async operations
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-android:1.7.3")
}

Python Side (Model Training)

Required Libraries

tensorflow==2.15.0           # Deep learning framework
mediapipe==0.10.9            # Hand landmark extraction
opencv-python==4.8.1.78      # Image processing
numpy==1.26.2                # Numerical operations
pandas==2.1.3                # Data manipulation
scikit-learn==1.3.2          # ML utilities
matplotlib==3.8.2            # Visualization

Installation

pip install tensorflow mediapipe opencv-python numpy pandas scikit-learn matplotlib

5. Understanding the AI Pipeline

What is AI Doing in This App?

The AI has one primary job: "Look at the camera and tell me which ISL letter the user is showing"

The Problem Breakdown

Traditional Approach (Pure Image Classification)

Camera Image β†’ CNN Model β†’ Letter
Problem: Slow, requires huge dataset, background sensitive

Our Smart Approach (Landmark-based Classification)

Camera Image β†’ MediaPipe (find hand) β†’ Extract landmarks β†’ DNN Model β†’ Letter
Benefit: Fast, small dataset, background-independent

Why Two AI Models?

Model 1: MediaPipe Hands (Google's Pre-trained Model)

Job: Find the hand and identify 21 key points

Input: Camera frame (640Γ—480 pixels)
Output: 21 landmark points (x, y, z coordinates)

Example landmarks:
Point 0: Wrist
Point 1-4: Thumb (base to tip)
Point 5-8: Index finger
Point 9-12: Middle finger
Point 13-16: Ring finger
Point 17-20: Pinky finger

Why use it?

  • Already trained by Google on millions of images
  • Works in real-time on mobile devices
  • Handles different hand sizes, skin tones, lighting
  • Free to use

Model 2: Your Custom TFLite Model (Train Yourself)

Job: Classify the 21 landmark points into ISL letters

Input: 63 numbers (21 points Γ— 3 coordinates)
Output: Letter (A-Z) + confidence score

Example:
Input: [0.45, 0.82, 0.01, 0.52, 0.75, ...]
Output: { letter: "A", confidence: 0.95 }

Why train your own?

  • ISL signs are unique (different from ASL)
  • You control accuracy by adding more training data
  • Model is tiny (~50-100 KB)
  • Fast inference (~1-5ms)

6. Data Flow & Pipeline

Complete Pipeline: Camera β†’ Detection β†’ Flutter UI

Step-by-Step Data Transformation

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STEP 1: CAMERA CAPTURE (CameraX)                                 β”‚
β”‚                                                                  β”‚
β”‚ Input:  Nothing (hardware)                                       β”‚
β”‚ Process: Open camera, capture frames at 30 FPS                  β”‚
β”‚ Output: Bitmap (640Γ—480 RGB image)                              β”‚
β”‚ Data Size: 921,600 values (640 Γ— 480 Γ— 3)                       β”‚
β”‚                                                                  β”‚
β”‚ Visual:                                                          β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                             β”‚
β”‚ β”‚ β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ β”‚                                             β”‚
β”‚ β”‚ β–‘β–‘β–‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘ β”‚  ← Raw camera frame                        β”‚
β”‚ β”‚ β–‘β–‘β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ β”‚     (user's hand visible)                  β”‚
β”‚ β”‚ β–‘β–‘β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ β”‚                                             β”‚
β”‚ β”‚ β–‘β–‘β–‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘ β”‚                                             β”‚
β”‚ β”‚ β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ β”‚                                             β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STEP 2: HAND DETECTION (MediaPipe)                               β”‚
β”‚                                                                  β”‚
β”‚ Input:  Bitmap (921,600 values)                                  β”‚
β”‚ Process:                                                         β”‚
β”‚   1. Detect if hand exists in frame                             β”‚
β”‚   2. Locate 21 anatomical landmarks                             β”‚
β”‚   3. Extract (x, y, z) for each point                           β”‚
β”‚ Output: FloatArray[63] = [x0,y0,z0, x1,y1,z1, ..., x20,y20,z20]β”‚
β”‚ Data Size: 63 float values                                      β”‚
β”‚ Reduction: 921,600 β†’ 63 (99.99% reduction!)                     β”‚
β”‚                                                                  β”‚
β”‚ Visual - 21 Landmark Points:                                     β”‚
β”‚         8   12  16  20  (fingertips)                            β”‚
β”‚         |   |   |   |                                           β”‚
β”‚     7   11  15  19  |                                           β”‚
β”‚     |   |   |   |   |                                           β”‚
β”‚     6   10  14  18  |                                           β”‚
β”‚     |   |   |   |   |                                           β”‚
β”‚     5───9───13──17β”€β”€β”˜                                           β”‚
β”‚      \                                                           β”‚
β”‚       4───3───2───1 (thumb)                                     β”‚
β”‚            \                                                     β”‚
β”‚             0 (wrist)                                            β”‚
β”‚                                                                  β”‚
β”‚ Example Output for Letter "A":                                  β”‚
β”‚ [0.45, 0.82, 0.01,  ← Point 0 (wrist)                          β”‚
β”‚  0.52, 0.75, 0.02,  ← Point 1 (thumb base)                     β”‚
β”‚  0.58, 0.65, 0.03,  ← Point 2 (thumb middle)                   β”‚
β”‚  ...                                                             β”‚
β”‚  0.63, 0.42, 0.01]  ← Point 20 (pinky tip)                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STEP 3: SIGN CLASSIFICATION (TensorFlow Lite)                    β”‚
β”‚                                                                  β”‚
β”‚ Input:  FloatArray[63]                                           β”‚
β”‚ Process:                                                         β”‚
β”‚   Neural Network Layers:                                         β”‚
β”‚   [63] β†’ [128 neurons] β†’ [64 neurons] β†’ [32 neurons] β†’ [26]    β”‚
β”‚         Dense Layer    Dense Layer    Dense Layer    Softmax    β”‚
β”‚                                                                  β”‚
β”‚ Output: Probabilities for each letter                           β”‚
β”‚   [p_A, p_B, p_C, ..., p_Z]                                     β”‚
β”‚                                                                  β”‚
β”‚ Example:                                                         β”‚
β”‚ Input:  [0.45, 0.82, 0.01, ...]                                 β”‚
β”‚ Output: [0.01, 0.03, 0.95, 0.00, ...]                          β”‚
β”‚          (1%   3%    95%   0%   ...)                            β”‚
β”‚                                                                  β”‚
β”‚ Best Prediction: Letter "C" with 95% confidence                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STEP 4: SEND TO FLUTTER (Platform Channel - EventChannel)        β”‚
β”‚                                                                  β”‚
β”‚ Kotlin prepares data as Map:                                     β”‚
β”‚ {                                                                β”‚
β”‚   "letter": "C",                                                 β”‚
β”‚   "confidence": 0.95,                                            β”‚
β”‚   "handDetected": true,                                          β”‚
β”‚   "timestamp": 1702907345000                                     β”‚
β”‚ }                                                                β”‚
β”‚                                                                  β”‚
β”‚ Sends through EventChannel (continuous stream)                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STEP 5: FLUTTER RECEIVES & DISPLAYS (Dart)                       β”‚
β”‚                                                                  β”‚
β”‚ EventChannel stream listener receives data                      β”‚
β”‚                                                                  β”‚
β”‚ setState(() {                                                    β”‚
β”‚   currentLetter = "C";                                           β”‚
β”‚   confidence = 0.95;                                             β”‚
β”‚   handDetected = true;                                           β”‚
β”‚ });                                                              β”‚
β”‚                                                                  β”‚
β”‚ UI Updates:                                                      β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                  β”‚
β”‚ β”‚    🎯 Target: Letter A     β”‚                                  β”‚
β”‚ β”‚                            β”‚                                  β”‚
β”‚ β”‚    πŸ“· [Camera Preview]     β”‚                                  β”‚
β”‚ β”‚                            β”‚                                  β”‚
β”‚ β”‚  ❌ Detected: C (95%)      β”‚ ← Red (incorrect)               β”‚
β”‚ β”‚  πŸ’‘ Hint: Try again!       β”‚                                  β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                  β”‚
β”‚                                                                  β”‚
β”‚ If correct (C == target):                                        β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                  β”‚
β”‚ β”‚  βœ… Correct! (95%)         β”‚ ← Green (success)               β”‚
β”‚ β”‚  πŸŽ‰ [Success Animation]    β”‚                                  β”‚
β”‚ β”‚  πŸ”Š [Success Sound]        β”‚                                  β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Processing Speed

Step Processing Time Frequency
Camera Frame ~33ms 30 FPS
MediaPipe Detection ~10-20ms Per frame
TFLite Classification ~1-5ms Per detection
Platform Channel <1ms Per result
Total Latency ~50ms ~20 detections/sec

This means users get near-instant feedback!


7. MediaPipe Explained

What is MediaPipe?

MediaPipe is Google's open-source framework for building multimodal (video, audio, text) applied machine learning pipelines.

MediaPipe Hands Solution

Specifically designed to detect and track hands in real-time.

Key Capabilities:

  • Detects up to 2 hands simultaneously
  • Works in various lighting conditions
  • Handles different hand sizes and skin tones
  • Runs efficiently on mobile devices
  • Provides 21 3D landmark points per hand

The 21 Hand Landmarks

Landmark Numbering System:

        8   12  16  20
        β”‚   β”‚   β”‚   β”‚
    7───11──15──19──│
    β”‚   β”‚   β”‚   β”‚   β”‚
    6───10──14──18──│
    β”‚   β”‚   β”‚   β”‚   β”‚
    5───9───13──17β”€β”€β”˜
     \
      4───3───2───1
           \
            0

Point 0:  Wrist
Point 1:  Thumb CMC (base)
Point 2:  Thumb MCP
Point 3:  Thumb IP
Point 4:  Thumb tip

Point 5:  Index finger MCP
Point 6:  Index finger PIP
Point 7:  Index finger DIP
Point 8:  Index finger tip

Point 9:  Middle finger MCP
Point 10: Middle finger PIP
Point 11: Middle finger DIP
Point 12: Middle finger tip

Point 13: Ring finger MCP
Point 14: Ring finger PIP
Point 15: Ring finger DIP
Point 16: Ring finger tip

Point 17: Pinky MCP
Point 18: Pinky PIP
Point 19: Pinky DIP
Point 20: Pinky tip

Coordinate System

Each landmark has 3 coordinates:

X Coordinate

  • Range: 0.0 to 1.0
  • 0.0 = left edge of image
  • 1.0 = right edge of image
  • Normalized (independent of image resolution)

Y Coordinate

  • Range: 0.0 to 1.0
  • 0.0 = top edge of image
  • 1.0 = bottom edge of image
  • Normalized (independent of image resolution)

Z Coordinate

  • Approximate depth from wrist
  • Smaller values = closer to camera
  • Relative to wrist (Point 0)
  • Units: roughly in same scale as X

Example Coordinates

Letter "A" (closed fist with thumb up):

Point 0 (Wrist):        x=0.50, y=0.70, z=0.00
Point 1 (Thumb base):   x=0.48, y=0.65, z=0.02
Point 2 (Thumb mid):    x=0.46, y=0.58, z=0.03
Point 3 (Thumb bend):   x=0.44, y=0.52, z=0.04
Point 4 (Thumb tip):    x=0.42, y=0.45, z=0.05
Point 5 (Index base):   x=0.54, y=0.66, z=0.01
Point 6 (Index mid):    x=0.56, y=0.68, z=0.00
Point 7 (Index bend):   x=0.57, y=0.69, z=-0.01
Point 8 (Index tip):    x=0.58, y=0.70, z=-0.02
...

MediaPipe Integration Code

// File: android/app/src/main/kotlin/com/kairo/ai/ml/HandLandmarkDetector.kt

import com.google.mediapipe.tasks.vision.handlandmarker.HandLandmarker
import com.google.mediapipe.tasks.vision.handlandmarker.HandLandmarkerResult
import com.google.mediapipe.framework.image.BitmapImageBuilder
import com.google.mediapipe.framework.image.MPImage
import android.graphics.Bitmap
import android.content.Context

class HandLandmarkDetector(context: Context) {
    
    private val handLandmarker: HandLandmarker
    
    init {
        // Configure MediaPipe Hands
        val options = HandLandmarker.HandLandmarkerOptions.builder()
            .setBaseOptions(
                BaseOptions.builder()
                    .setModelAssetPath("hand_landmarker.task")  // MediaPipe's pre-trained model
                    .build()
            )
            .setNumHands(1)  // Detect only one hand
            .setMinHandDetectionConfidence(0.5f)  // 50% confidence threshold
            .setMinHandPresenceConfidence(0.5f)
            .setMinTrackingConfidence(0.5f)
            .build()
        
        handLandmarker = HandLandmarker.createFromOptions(context, options)
    }
    
    /**
     * Detect hand landmarks from a camera frame
     * 
     * @param bitmap The camera frame
     * @return FloatArray of 63 values [x0,y0,z0, x1,y1,z1, ..., x20,y20,z20]
     *         or null if no hand detected
     */
    fun detectLandmarks(bitmap: Bitmap): FloatArray? {
        // Convert Android Bitmap to MediaPipe Image
        val mpImage: MPImage = BitmapImageBuilder(bitmap).build()
        
        // Run hand detection
        val result: HandLandmarkerResult = handLandmarker.detect(mpImage)
        
        // Check if any hands were detected
        if (result.landmarks().isEmpty()) {
            return null  // No hand found
        }
        
        // Get landmarks from first detected hand
        val handLandmarks = result.landmarks()[0]
        
        // Convert to flat array
        val landmarkArray = FloatArray(63)
        
        for (i in 0 until 21) {
            val landmark = handLandmarks[i]
            landmarkArray[i * 3 + 0] = landmark.x()
            landmarkArray[i * 3 + 1] = landmark.y()
            landmarkArray[i * 3 + 2] = landmark.z()
        }
        
        return landmarkArray
    }
    
    /**
     * Optional: Normalize landmarks relative to wrist
     * This makes the model robust to hand position/size
     */
    fun normalizeLandmarks(landmarks: FloatArray): FloatArray {
        val normalized = FloatArray(63)
        
        // Get wrist coordinates (point 0)
        val wristX = landmarks[0]
        val wristY = landmarks[1]
        val wristZ = landmarks[2]
        
        // Normalize all points relative to wrist
        for (i in 0 until 21) {
            normalized[i * 3 + 0] = landmarks[i * 3 + 0] - wristX
            normalized[i * 3 + 1] = landmarks[i * 3 + 1] - wristY
            normalized[i * 3 + 2] = landmarks[i * 3 + 2] - wristZ
        }
        
        return normalized
    }
}

8. DNN Model Explained

What is a DNN (Dense Neural Network)?

A DNN is a type of artificial neural network where every neuron in one layer is connected to every neuron in the next layer.

Simple Analogy

Think of it as a decision-making chain:

Your Brain Recognizing a Friend:

Eyes see features β†’ Brain processes patterns β†’ Brain decides who it is
(height, hair,     (tall + brown hair      ("It's John!")
 glasses, voice)    + glasses = pattern)

DNN for Sign Language

Input: 63 numbers β†’ Hidden layers process β†’ Output: Letter prediction
(hand landmarks)    (find patterns)          (A-Z with confidence)

DNN vs CNN vs Other Models

Model Type Input Best For When to Use
DNN (Dense) Numbers/Features Structured data, pre-extracted features When you have landmark coordinates
CNN (Convolutional) Images Raw image classification When working with raw pixels
RNN/LSTM Sequences Time series, text When order matters (sentences, videos)
Transformer Sequences Language, large-scale Complex NLP tasks (ChatGPT)

Why DNN for KairoAI?

Comparison: CNN vs DNN Approach

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    CNN APPROACH (Not Used)                       β”‚
β”‚                                                                  β”‚
β”‚  Camera Frame (640Γ—480Γ—3 = 921,600 values)                      β”‚
β”‚       ↓                                                          β”‚
β”‚  Convolutional Layers (find edges, shapes)                      β”‚
β”‚       ↓                                                          β”‚
β”‚  Pooling Layers (reduce size)                                   β”‚
β”‚       ↓                                                          β”‚
β”‚  More Conv Layers (find complex patterns)                       β”‚
β”‚       ↓                                                          β”‚
β”‚  Dense Layers (classify)                                        β”‚
β”‚       ↓                                                          β”‚
β”‚  Output: Letter                                                 β”‚
β”‚                                                                  β”‚
β”‚  Problems:                                                       β”‚
β”‚  ❌ Slow (100-200ms inference)                                  β”‚
β”‚  ❌ Large model (10-50 MB)                                      β”‚
β”‚  ❌ Needs 10,000+ images per class                              β”‚
β”‚  ❌ Background variations affect accuracy                       β”‚
β”‚  ❌ Sensitive to lighting, hand size, position                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              DNN APPROACH (Our Choice) βœ…                        β”‚
β”‚                                                                  β”‚
β”‚  MediaPipe Output (63 landmark values)                          β”‚
β”‚       ↓                                                          β”‚
β”‚  Dense Layer 1 (128 neurons, ReLU activation)                   β”‚
β”‚       ↓                                                          β”‚
β”‚  Dense Layer 2 (64 neurons, ReLU activation)                    β”‚
β”‚       ↓                                                          β”‚
β”‚  Dense Layer 3 (32 neurons, ReLU activation)                    β”‚
β”‚       ↓                                                          β”‚
β”‚  Output Layer (26 neurons, Softmax activation)                  β”‚
β”‚       ↓                                                          β”‚
β”‚  Output: Letter + Confidence                                    β”‚
β”‚                                                                  β”‚
β”‚  Benefits:                                                       β”‚
β”‚  βœ… Fast (1-5ms inference)                                      β”‚
β”‚  βœ… Tiny model (~50-100 KB)                                     β”‚
β”‚  βœ… Needs only 500-1000 images per class                        β”‚
β”‚  βœ… Background doesn't matter (landmarks only)                  β”‚
β”‚  βœ… Works in any lighting, any hand size                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Model Architecture

Layer-by-Layer Breakdown

# Our DNN Model Architecture

Input Layer:
  Shape: (63,)
  Description: 63 landmark coordinates
  Example: [0.45, 0.82, 0.01, 0.52, 0.75, ...]

    ↓ (fully connected)

Hidden Layer 1:
  Neurons: 128
  Activation: ReLU (Rectified Linear Unit)
  Dropout: 30% (prevents overfitting)
  Description: Learns basic patterns (finger positions)

    ↓ (fully connected)

Hidden Layer 2:
  Neurons: 64
  Activation: ReLU
  Dropout: 30%
  Description: Learns complex patterns (hand shapes)

    ↓ (fully connected)

Hidden Layer 3:
  Neurons: 32
  Activation: ReLU
  Dropout: 20%
  Description: Learns letter-specific features

    ↓ (fully connected)

Output Layer:
  Neurons: 26
  Activation: Softmax
  Description: Probability for each letter (A-Z)
  Example Output: [0.01, 0.03, 0.95, 0.00, ..., 0.01]
                   (1%   3%    95%   0%        1% )
                    A     B     C     D    ...  Z

What Each Layer Does

1. Input Layer (63 neurons)

  • Receives raw landmark coordinates
  • No processing, just passes data forward

2. Hidden Layer 1 (128 neurons)

  • Learns basic geometric relationships
  • Examples:
    • "Are fingers spread apart?"
    • "Is thumb extended?"
    • "What's the palm orientation?"

3. Hidden Layer 2 (64 neurons)

  • Combines basic patterns into complex ones
  • Examples:
    • "Thumb up + fingers curled = might be 'A'"
    • "All fingers extended = might be 'B'"

4. Hidden Layer 3 (32 neurons)

  • Fine-tunes letter-specific features
  • Distinguishes similar signs
  • Examples:
    • "Is this 'M' or 'N'?" (very similar in ISL)

5. Output Layer (26 neurons)

  • Each neuron represents one letter
  • Softmax ensures probabilities sum to 1.0
  • Highest probability = predicted letter

Activation Functions

ReLU (Rectified Linear Unit)

Formula: f(x) = max(0, x)

Graph:
    β”‚   β•±
    β”‚  β•±
    β”‚ β•±
────┼─────
    β”‚
    
Benefits:
- Fast to compute
- Prevents vanishing gradients
- Works well for hidden layers

Softmax

Formula: softmax(xi) = e^xi / Ξ£(e^xj)

Purpose:
- Converts raw scores to probabilities
- All outputs sum to 1.0
- Used in output layer for classification

Example:
Raw scores: [2.1, 0.5, 4.2, 1.3]
After softmax: [0.12, 0.02, 0.84, 0.02]
               (12%   2%   84%   2% )

Dropout Layers

Purpose: Prevent overfitting

During Training:
- Randomly "turn off" 30% of neurons
- Forces network to learn robust features
- Network can't rely on specific neurons

During Inference (app usage):
- All neurons active
- Uses learned patterns to classify

Model Training Code

# File: model_training/train_model.py

import tensorflow as tf
from tensorflow import keras
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load landmark dataset
print("Loading dataset...")
data = pd.read_csv('landmarks_dataset.csv')

print(f"Dataset shape: {data.shape}")
print(f"Classes: {sorted(data['label'].unique())}")

# Separate features (X) and labels (y)
X = data.iloc[:, :-1].values  # 63 landmark columns
y = data.iloc[:, -1].values   # Label column

# Encode labels: A=0, B=1, C=2, ..., Z=25
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
y_categorical = keras.utils.to_categorical(y_encoded)

# Save label mapping
label_mapping = {i: label for i, label in enumerate(encoder.classes_)}
print(f"Label mapping: {label_mapping}")

# Train/test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y_categorical,
    test_size=0.2,
    random_state=42,
    stratify=y_categorical  # Maintain class distribution
)

print(f"\nTraining samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

# Build the DNN model
model = keras.Sequential([
    # Input layer
    keras.layers.Input(shape=(63,), name='landmark_input'),
    
    # Hidden layer 1
    keras.layers.Dense(128, activation='relu', name='dense_1'),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.3),
    
    # Hidden layer 2
    keras.layers.Dense(64, activation='relu', name='dense_2'),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.3),
    
    # Hidden layer 3
    keras.layers.Dense(32, activation='relu', name='dense_3'),
    keras.layers.Dropout(0.2),
    
    # Output layer
    keras.layers.Dense(len(encoder.classes_), activation='softmax', name='output')
])

# Compile model
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# Display model summary
model.summary()

# Training callbacks
callbacks = [
    # Stop training if validation loss doesn't improve for 10 epochs
    keras.callbacks.EarlyStopping(
        monitor='val_loss',
        patience=10,
        restore_best_weights=True
    ),
    
    # Reduce learning rate if validation loss plateaus
    keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=5,
        min_lr=0.00001
    ),
    
    # Save best model during training
    keras.callbacks.ModelCheckpoint(
        'best_model.h5',
        monitor='val_accuracy',
        save_best_only=True
    )
]

# Train the model
print("\nTraining model...")
history = model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.2,  # Use 20% of training data for validation
    callbacks=callbacks,
    verbose=1
)

# Evaluate on test set
print("\nEvaluating on test set...")
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy*100:.2f}%")

# Save final model
model.save('isl_model.h5')
print("\nModel saved as 'isl_model.h5'")

# Convert to TFLite
print("\nConverting to TFLite...")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

with open('isl_model.tflite', 'wb') as f:
    f.write(tflite_model)

print(f"TFLite model saved!")
print(f"Model size: {len(tflite_model) / 1024:.2f} KB")

Model Summary Output

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 landmark_input (InputLayer) [(None, 63)]              0         
                                                                 
 dense_1 (Dense)             (None, 128)               8,192     
                                                                 
 batch_normalization         (None, 128)               512       
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_2 (Dense)             (None, 64)                8,256     
                                                                 
 batch_normalization_1       (None, 64)                256       
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_3 (Dense)             (None, 32)                2,080     
                                                                 
 dropout_2 (Dropout)         (None, 32)                0         
                                                                 
 output (Dense)              (None, 26)                858       
                                                                 
=================================================================
Total params: 20,154 (78.73 KB)
Trainable params: 19,770 (77.23 KB)
Non-trainable params: 384 (1.50 KB)
_________________________________________________________________

What Happens During Training

Epoch 1/100
────────────────────────────────────────
Batch 1/250:  Forward pass β†’ Calculate loss β†’ Backward pass β†’ Update weights
Batch 2/250:  Forward pass β†’ Calculate loss β†’ Backward pass β†’ Update weights
...
Batch 250/250: Forward pass β†’ Calculate loss β†’ Backward pass β†’ Update weights

Validation:
- Test on validation set (20% of training data)
- Calculate validation accuracy
- If improved, save as best model

Epoch 1: loss=0.5423, accuracy=0.8234, val_loss=0.4321, val_accuracy=0.8567
Epoch 2: loss=0.3215, accuracy=0.8876, val_loss=0.3102, val_accuracy=0.8923
Epoch 3: loss=0.2543, accuracy=0.9123, val_loss=0.2876, val_accuracy=0.9034
...
Epoch 45: loss=0.0234, accuracy=0.9892, val_loss=0.0456, val_accuracy=0.9845
Epoch 46: loss=0.0231, accuracy=0.9894, val_loss=0.0458, val_accuracy=0.9844
(No improvement for 10 epochs β†’ Early stopping triggered)

Best model: Epoch 45 with val_accuracy=0.9845

9. Platform Channels Explained

What Are Platform Channels?

Platform channels are Flutter's official mechanism for communication between Dart code and native platform code (Kotlin/Swift).

The Problem They Solve

Flutter (Dart) runs in its own runtime
    ↓
Cannot directly access:
- Native camera APIs
- MediaPipe library (Android/iOS)
- Hardware sensors
- Native ML frameworks
- Bluetooth, NFC, etc.

Solution: Platform Channels = Bridge between worlds

Types of Platform Channels

1. MethodChannel (Request-Response)

Use Case: One-time requests with responses

Flutter asks β†’ Kotlin does work β†’ Kotlin responds β†’ Flutter receives

Example: Start/stop camera, take photo, get device info

// Flutter side
final result = await methodChannel.invokeMethod('getCameraStatus');
print(result); // "active" or "inactive"
// Kotlin side
methodChannel.setMethodCallHandler { call, result ->
    when (call.method) {
        "getCameraStatus" -> {
            val status = if (cameraActive) "active" else "inactive"
            result.success(status)
        }
    }
}

2. EventChannel (Continuous Stream)

Use Case: Continuous data stream from native to Flutter

Kotlin continuously sends β†’ Flutter receives stream β†’ UI updates in real-time

Example: Hand detection results, sensor data, location updates

// Flutter side
eventChannel.receiveBroadcastStream().listen((data) {
    print(data); // Continuous updates
});
// Kotlin side
eventChannel.setStreamHandler(object : EventChannel.StreamHandler {
    override fun onListen(arguments: Any?, events: EventChannel.EventSink?) {
        // Send data continuously
        events?.success(detectionResult)
    }
})

3. BasicMessageChannel (Bi-directional)

Use Case: Custom protocols, binary data

Not commonly used for this project

Comparison Table

Feature MethodChannel EventChannel
Direction Bi-directional Kotlin β†’ Flutter (one way)
Pattern Request-Response Stream
Use Case Commands Continuous data
Example "Start camera" Detection results every 50ms
Frequency On-demand Continuous

Complete Implementation for KairoAI

Flutter Side (Dart)

// File: lib/services/sign_detection_service.dart

import 'package:flutter/services.dart';
import 'dart:async';

class SignDetectionService {
  // Method channel for commands
  static const MethodChannel _methodChannel =
      MethodChannel('com.kairo.ai/detection');
  
  // Event channel for continuous detection stream
  static const EventChannel _eventChannel =
      EventChannel('com.kairo.ai/detection_stream');
  
  Stream<DetectionResult>? _detectionStream;
  
  /// Start hand sign detection
  Future<void> startDetection() async {
    try {
      await _methodChannel.invokeMethod('startDetection');
      print('βœ… Detection started');
    } on PlatformException catch (e) {
      print('❌ Error starting detection: ${e.message}');
      rethrow;
    }
  }
  
  /// Stop hand sign detection
  Future<void> stopDetection() async {
    try {
      await _methodChannel.invokeMethod('stopDetection');
      print('βœ… Detection stopped');
    } on PlatformException catch (e) {
      print('❌ Error stopping detection: ${e.message}');
      rethrow;
    }
  }
  
  /// Get continuous stream of detection results
  Stream<DetectionResult> get detectionStream {
    _detectionStream ??= _eventChannel
        .receiveBroadcastStream()
        .map((event) {
          final data = Map<String, dynamic>.from(event);
          return DetectionResult.fromMap(data);
        });
    
    return _detectionStream!;
  }
  
  /// Check camera permission status
  Future<bool> checkCameraPermission() async {
    try {
      final bool hasPermission =
          await _methodChannel.invokeMethod('checkCameraPermission');
      return hasPermission;
    } on PlatformException catch (e) {
      print('❌ Error checking permission: ${e.message}');
      return false;
    }
  }
  
  /// Request camera permission
  Future<bool> requestCameraPermission() async {
    try {
      final bool granted =
          await _methodChannel.invokeMethod('requestCameraPermission');
      return granted;
    } on PlatformException catch (e) {
      print('❌ Error requesting permission: ${e.message}');
      return false;
    }
  }
}

/// Data class for detection results
class DetectionResult {
  final String letter;
  final double confidence;
  final bool handDetected;
  final int timestamp;
  
  DetectionResult({
    required this.letter,
    required this.confidence,
    required this.handDetected,
    required this.timestamp,
  });
  
  factory DetectionResult.fromMap(Map<String, dynamic> map) {
    return DetectionResult(
      letter: map['letter'] as String,
      confidence: (map['confidence'] as num).toDouble(),
      handDetected: map['handDetected'] as bool,
      timestamp: map['timestamp'] as int,
    );
  }
  
  @override
  String toString() {
    return 'DetectionResult(letter: $letter, confidence: ${confidence.toStringAsFixed(2)}, handDetected: $handDetected)';
  }
}

Kotlin Side (Android)

// File: android/app/src/main/kotlin/com/kairo/ai/MainActivity.kt

package com.kairo.ai

import android.Manifest
import android.content.pm.PackageManager
import androidx.core.app.ActivityCompat
import androidx.core.content.ContextCompat
import io.flutter.embedding.android.FlutterActivity
import io.flutter.embedding.engine.FlutterEngine
import io.flutter.plugin.common.MethodChannel
import io.flutter.plugin.common.EventChannel
import android.graphics.Bitmap

class MainActivity : FlutterActivity() {
    
    // Channel names (must match Flutter side)
    private val METHOD_CHANNEL = "com.kairo.ai/detection"
    private val EVENT_CHANNEL = "com.kairo.ai/detection_stream"
    
    // Camera permission request code
    private val CAMERA_PERMISSION_CODE = 100
    
    // Our custom classes (to be implemented)
    private lateinit var cameraManager: CameraManager
    private lateinit var handDetector: HandLandmarkDetector
    private lateinit var signClassifier: SignClassifier
    
    // Event sink for streaming data to Flutter
    private var eventSink: EventChannel.EventSink? = null
    
    override fun configureFlutterEngine(flutterEngine: FlutterEngine) {
        super.configureFlutterEngine(flutterEngine)
        
        // Initialize our components
        cameraManager = CameraManager(this)
        handDetector = HandLandmarkDetector(this)
        signClassifier = SignClassifier(this)
        
        // Setup MethodChannel for commands
        MethodChannel(
            flutterEngine.dartExecutor.binaryMessenger,
            METHOD_CHANNEL
        ).setMethodCallHandler { call, result ->
            when (call.method) {
                "startDetection" -> {
                    startDetection()
                    result.success(null)
                }
                
                "stopDetection" -> {
                    stopDetection()
                    result.success(null)
                }
                
                "checkCameraPermission" -> {
                    val hasPermission = checkCameraPermission()
                    result.success(hasPermission)
                }
                
                "requestCameraPermission" -> {
                    requestCameraPermission()
                    result.success(null)  // Actual result comes via callback
                }
                
                else -> {
                    result.notImplemented()
                }
            }
        }
        
        // Setup EventChannel for streaming data
        EventChannel(
            flutterEngine.dartExecutor.binaryMessenger,
            EVENT_CHANNEL
        ).setStreamHandler(object : EventChannel.StreamHandler {
            
            override fun onListen(arguments: Any?, events: EventChannel.EventSink?) {
                eventSink = events
                println("βœ… Flutter is now listening to detection stream")
            }
            
            override fun onCancel(arguments: Any?) {
                eventSink = null
                stopDetection()
                println("❌ Flutter stopped listening to detection stream")
            }
        })
    }
    
    /**
     * Start camera and begin hand detection
     */
    private fun startDetection() {
        println("πŸŽ₯ Starting camera and detection...")
        
        cameraManager.startCamera { bitmap ->
            processFrame(bitmap)
        }
    }
    
    /**
     * Stop camera and detection
     */
    private fun stopDetection() {
        println("πŸ›‘ Stopping camera and detection...")
        cameraManager.stopCamera()
    }
    
    /**
     * Process each camera frame
     */
    private fun processFrame(bitmap: Bitmap) {
        // Step 1: Detect hand landmarks
        val landmarks = handDetector.detectLandmarks(bitmap)
        
        if (landmarks != null) {
            // Step 2: Classify the sign
            val result = signClassifier.classify(landmarks)
            
            // Step 3: Send to Flutter
            val data = mapOf(
                "letter" to result.letter,
                "confidence" to result.confidence,
                "handDetected" to true,
                "timestamp" to System.currentTimeMillis()
            )
            
            // Must run on UI thread
            runOnUiThread {
                eventSink?.success(data)
            }
            
        } else {
            // No hand detected
            val data = mapOf(
                "letter" to "",
                "confidence" to 0.0,
                "handDetected" to false,
                "timestamp" to System.currentTimeMillis()
            )
            
            runOnUiThread {
                eventSink?.success(data)
            }
        }
    }
    
    /**
     * Check if camera permission is granted
     */
    private fun checkCameraPermission(): Boolean {
        return ContextCompat.checkSelfPermission(
            this,
            Manifest.permission.CAMERA
        ) == PackageManager.PERMISSION_GRANTED
    }
    
    /**
     * Request camera permission from user
     */
    private fun requestCameraPermission() {
        ActivityCompat.requestPermissions(
            this,
            arrayOf(Manifest.permission.CAMERA),
            CAMERA_PERMISSION_CODE
        )
    }
    
    /**
     * Handle permission request result
     */
    override fun onRequestPermissionsResult(
        requestCode: Int,
        permissions: Array<out String>,
        grantResults: IntArray
    ) {
        super.onRequestPermissionsResult(requestCode, permissions, grantResults)
        
        when (requestCode) {
            CAMERA_PERMISSION_CODE -> {
                val granted = grantResults.isNotEmpty() &&
                        grantResults[0] == PackageManager.PERMISSION_GRANTED
                
                if (granted) {
                    println("βœ… Camera permission granted")
                } else {
                    println("❌ Camera permission denied")
                }
            }
        }
    }
}

Communication Flow Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         FLUTTER (Dart)                           β”‚
β”‚                                                                  β”‚
β”‚  User taps "Start Lesson"                                       β”‚
β”‚          ↓                                                       β”‚
β”‚  LessonScreen calls:                                             β”‚
β”‚  signDetectionService.startDetection()                          β”‚
β”‚          ↓                                                       β”‚
β”‚  MethodChannel sends: "startDetection"                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό Platform Channel Bridge
                         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      KOTLIN (Android)                            β”‚
β”‚                                                                  β”‚
β”‚  MethodChannel receives: "startDetection"                       β”‚
β”‚          ↓                                                       β”‚
β”‚  MainActivity.startDetection()                                  β”‚
β”‚          ↓                                                       β”‚
β”‚  CameraManager.startCamera()                                    β”‚
β”‚          ↓                                                       β”‚
β”‚  Camera frames arrive (30 FPS)                                  β”‚
β”‚          ↓                                                       β”‚
β”‚  processFrame(bitmap) for each frame:                           β”‚
β”‚    1. HandDetector.detectLandmarks(bitmap) β†’ 63 floats          β”‚
β”‚    2. SignClassifier.classify(landmarks) β†’ Letter + confidence  β”‚
β”‚    3. Package into Map                                          β”‚
β”‚    4. EventChannel sends to Flutter                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό Event Channel Stream
                         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         FLUTTER (Dart)                           β”‚
β”‚                                                                  β”‚
β”‚  EventChannel.receiveBroadcastStream()                          β”‚
β”‚          ↓                                                       β”‚
β”‚  .listen((data) {                                               β”‚
β”‚    setState(() {                                                β”‚
β”‚      currentLetter = data['letter'];                            β”‚
β”‚      confidence = data['confidence'];                           β”‚
β”‚    });                                                          β”‚
β”‚  })                                                             β”‚
β”‚          ↓                                                       β”‚
β”‚  UI rebuilds with new data                                      β”‚
β”‚  Shows: "Detected: A (95%)"                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Format Examples

Method Channel (Command)

Flutter β†’ Kotlin:
{
  "method": "startDetection",
  "arguments": null
}

Kotlin β†’ Flutter:
{
  "success": true
}

Event Channel (Stream)

Kotlin β†’ Flutter (continuous stream):

Frame 1:
{
  "letter": "A",
  "confidence": 0.87,
  "handDetected": true,
  "timestamp": 1702907345000
}

Frame 2 (50ms later):
{
  "letter": "A",
  "confidence": 0.91,
  "handDetected": true,
  "timestamp": 1702907345050
}

Frame 3 (50ms later):
{
  "letter": "A",
  "confidence": 0.95,
  "handDetected": true,
  "timestamp": 1702907345100
}

Frame 4 (user changes hand):
{
  "letter": "B",
  "confidence": 0.23,
  "handDetected": true,
  "timestamp": 1702907345150
}

Frame 5:
{
  "letter": "B",
  "confidence": 0.78,
  "handDetected": true,
  "timestamp": 1702907345200
}

Error Handling

// Flutter side - Handling errors

try {
  await signDetectionService.startDetection();
} on PlatformException catch (e) {
  switch (e.code) {
    case 'CAMERA_ERROR':
      showSnackBar('Camera failed to start');
      break;
    case 'PERMISSION_DENIED':
      showSnackBar('Camera permission required');
      break;
    case 'MEDIAPIPE_ERROR':
      showSnackBar('Hand detection failed');
      break;
    default:
      showSnackBar('Unknown error: ${e.message}');
  }
}
// Kotlin side - Sending errors

try {
    startCamera()
    result.success(null)
} catch (e: SecurityException) {
    result.error("PERMISSION_DENIED", "Camera permission not granted", null)
} catch (e: CameraAccessException) {
    result.error("CAMERA_ERROR", "Failed to access camera: ${e.message}", null)
} catch (e: Exception) {
    result.error("UNKNOWN_ERROR", e.message, null)
}

10. Dataset Creation Guide

Understanding the Dataset

What You Need

For a 26-letter ISL alphabet app, you need:

Dataset Size Calculation:
- 26 letters (A-Z)
- 500-1000 images per letter (recommended)
- Total: 13,000 - 26,000 images

Actual Data After Extraction:
- Each image β†’ 1 row in CSV
- Each row = 63 landmark values + 1 label
- Final CSV: 13,000-26,000 rows Γ— 64 columns

Dataset Quality Factors

Factor Impact on Accuracy Recommendation
Number of samples High 500+ per letter
Variety of people High 5-10 different people
Hand orientations Medium Multiple angles
Lighting conditions Low (landmarks robust) Normal indoor lighting OK
Background None (landmarks only) Any background works
Camera distance Medium Keep consistent (arm's length)

Option 1: Use Existing Dataset (Fastest)

Step 1: Find ISL Dataset

# Search on Kaggle
https://www.kaggle.com/search?q=indian+sign+language

# Popular datasets:
# 1. "ISL Dataset" by various authors
# 2. "Indian Sign Language Recognition Dataset"
# 3. "ISL Alphabet Dataset"

Step 2: Download and Extract

# Download from Kaggle (requires Kaggle account)
kaggle datasets download -d <dataset-name>

# Extract
unzip dataset.zip

# Expected structure:
ISL_Dataset/
β”œβ”€β”€ A/
β”‚   β”œβ”€β”€ img_001.jpg
β”‚   β”œβ”€β”€ img_002.jpg
β”‚   └── ...
β”œβ”€β”€ B/
β”‚   β”œβ”€β”€ img_001.jpg
β”‚   └── ...
└── Z/
    └── ...

Option 2: Collect Your Own (Better Accuracy)

Method 1: Video Recording (Recommended)

Equipment Needed:
- Smartphone camera
- Good lighting (natural or indoor)
- Plain background (optional but helpful)

Process:
1. Record 30-second video per letter
2. Person makes the sign continuously
3. Vary hand position slightly
4. Extract frames β†’ 200-300 images per video

Advantages:
- Quick data collection (30 min for all 26 letters)
- Natural hand movements
- Variety in positioning

Step-by-Step Video Collection

# File: data_collection/extract_frames_from_video.py

import cv2
import os

def extract_frames_from_video(video_path, output_folder, letter, frame_interval=3):
    """
    Extract frames from video at specified interval
    
    Args:
        video_path: Path to video file
        output_folder: Where to save frames
        letter: ISL letter (A-Z)
        frame_interval: Extract every Nth frame (3 = every 3rd frame)
    """
    # Create output directory
    letter_folder = os.path.join(output_folder, letter)
    os.makedirs(letter_folder, exist_ok=True)
    
    # Open video
    cap = cv2.VideoCapture(video_path)
    
    frame_count = 0
    saved_count = 0
    
    print(f"Processing video: {video_path}")
    
    while True:
        ret, frame = cap.read()
        
        if not ret:
            break
        
        # Extract every Nth frame
        if frame_count % frame_interval == 0:
            output_path = os.path.join(
                letter_folder,
                f"{letter}_{saved_count:04d}.jpg"
            )
            cv2.imwrite(output_path, frame)
            saved_count += 1
        
        frame_count += 1
    
    cap.release()
    
    print(f"βœ… Extracted {saved_count} frames for letter '{letter}'")
    print(f"   Saved to: {letter_folder}")

# Usage
if __name__ == "__main__":
    # Extract frames from all videos
    videos = [
        ("videos/letter_A.mp4", "A"),
        ("videos/letter_B.mp4", "B"),
        # ... add all 26 letters
    ]
    
    output_folder = "extracted_frames"
    
    for video_path, letter in videos:
        extract_frames_from_video(video_path, output_folder, letter, frame_interval=3)
    
    print("\nβœ… All frames extracted!")

Method 2: Photo Collection App

# File: data_collection/photo_collector.py

import cv2
import os
import time

def collect_photos_for_letter(letter, num_photos=500):
    """
    Interactive photo collection using webcam
    
    Args:
        letter: ISL letter to collect (A-Z)
        num_photos: Number of photos to capture
    """
    # Create output directory
    output_folder = f"collected_data/{letter}"
    os.makedirs(output_folder, exist_ok=True)
    
    # Open webcam
    cap = cv2.VideoCapture(0)
    
    print<!-- filepath: d:\study files\FlutterProjects\KairoAI\DOCUMENTATION.md -->
# KairoAI - Complete Project Documentation
## Indian Sign Language Learning App with AI-Powered Hand Detection

**Author:** Megh Modi  
**Created:** December 18, 2025  
**Version:** 1.0.0  
**Status:** Planning & Architecture Phase

---

# Table of Contents

1. [Executive Summary](#executive-summary)
2. [Project Vision & Goals](#project-vision--goals)
3. [Technical Architecture](#technical-architecture)
4. [Technology Stack](#technology-stack)
5. [Understanding the AI Pipeline](#understanding-the-ai-pipeline)
6. [Data Flow & Pipeline](#data-flow--pipeline)
7. [MediaPipe Explained](#mediapipe-explained)
8. [DNN Model Explained](#dnn-model-explained)
9. [Platform Channels Explained](#platform-channels-explained)
10. [Dataset Creation Guide](#dataset-creation-guide)
11. [Model Training Guide](#model-training-guide)
12. [Implementation Roadmap](#implementation-roadmap)
13. [Code Structure](#code-structure)
14. [Challenges & Solutions](#challenges--solutions)
15. [Feasibility Assessment](#feasibility-assessment)
16. [Resources & Learning Path](#resources--learning-path)

---

# 1. Executive Summary

## What is KairoAI?

KairoAI is an Indian Sign Language (ISL) learning application designed specifically for children. The app uses real-time hand gesture detection via the device camera to teach ISL alphabets and words, providing instant feedback to students.

## Core Innovation

The app combines three powerful technologies:
- **Flutter** for cross-platform UI
- **MediaPipe** for hand detection (running natively on Android)
- **TensorFlow Lite** for sign language classification

## Key Differentiator

Unlike traditional learning apps, KairoAI provides **real-time visual feedback** by:
1. Showing the user what sign to make
2. Detecting their hand position using the camera
3. Validating if they're making the correct sign
4. Providing instant feedback (success/try again)

---

# 2. Project Vision & Goals

## Primary Goal

Create an accessible, engaging platform for children to learn Indian Sign Language through interactive, AI-powered lessons.

## Target Users

- **Primary:** Children aged 6-14 learning ISL
- **Secondary:** Parents and educators teaching ISL
- **Tertiary:** Anyone interested in learning ISL

## Core Features

### 1. Lesson Mode
- Display a target alphabet (e.g., "A") or word (e.g., "MEGH")
- Open device camera
- Detect student's hand sign in real-time
- Validate against expected sign
- Show success animation/sound on correct detection
- Provide guidance hints on incorrect attempts

### 2. Quiz Mode
- Present random alphabets or words
- Student performs signs sequentially
- Each detected letter is validated in order
- Progress only on correct detection
- Track accuracy and completion time

### 3. Progress Tracking
- Store lesson completion in Firebase Firestore
- Track quiz scores and accuracy
- Visualize learning progress over time
- Gamification elements (badges, streaks)

---

# 3. Technical Architecture

## High-Level Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ FLUTTER LAYER (UI) β”‚ β”‚ Written in Dart β”‚ β”‚ β”‚ β”‚ β€’ Lessons UI β€’ Quiz UI β€’ Progress Dashboard β”‚ β”‚ β€’ Camera Preview β€’ Feedback Animations β”‚ β”‚ β€’ Firebase Integration (Auth, Firestore) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ Platform Channels (Bridge) β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ KOTLIN LAYER (Android Native) β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ CameraX β”‚β†’ β”‚ MediaPipe β”‚β†’ β”‚ TensorFlow β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Hands β”‚ β”‚ Lite β”‚ β”‚ β”‚ β”‚ Capture β”‚ β”‚ Detect hand β”‚ β”‚ Classify β”‚ β”‚ β”‚ β”‚ frames β”‚ β”‚ Extract 21 β”‚ β”‚ sign β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ landmarks β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ Returns: { letter: "A", confidence: 0.95, handDetected: true } β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜


## Architecture Principles

### Why Hybrid Architecture?

| Component | Layer | Reason |
|-----------|-------|--------|
| **UI & Navigation** | Flutter (Dart) | Cross-platform, fast development, beautiful UI |
| **Camera & ML** | Kotlin (Native) | Direct hardware access, optimized performance |
| **Firebase** | Flutter (Dart) | Easy integration, real-time sync |

### Key Design Decision

**DO NOT attempt to run MediaPipe or camera processing in Dart.**

Why?
- Flutter cannot directly access native camera APIs efficiently
- MediaPipe requires native Android/iOS libraries
- ML inference is faster in native code
- Better battery performance with native implementation

---

# 4. Technology Stack

## Flutter Side (Dart)

### Core Dependencies

```yaml
dependencies:
  flutter:
    sdk: flutter
  
  # Firebase
  firebase_core: ^2.24.2        # Firebase initialization
  firebase_auth: ^4.16.0        # User authentication
  cloud_firestore: ^4.14.0      # Database for progress tracking
  
  # State Management
  provider: ^6.1.1              # For managing app state
  
  # Navigation
  go_router: ^13.0.0            # Declarative routing
  
  # UI/UX Enhancements
  lottie: ^3.0.0                # Success animations
  audioplayers: ^5.2.1          # Sound effects
  
  # Utilities
  equatable: ^2.0.5             # Value comparison

Why These Libraries?

Library Purpose Alternative
provider Simple state management riverpod, bloc
go_router Type-safe routing auto_route, manual routing
lottie Beautiful animations flare, custom animations

Android/Kotlin Side

Build Configuration

// android/app/build.gradle.kts

plugins {
    id("com.android.application")
    id("kotlin-android")
    id("dev.flutter.flutter-gradle-plugin")
}

android {
    namespace = "com.kairo.ai"
    compileSdk = 34
    
    defaultConfig {
        applicationId = "com.kairo.ai"
        minSdk = 26        // Required for CameraX
        targetSdk = 34
        versionCode = 1
        versionName = "1.0"
    }

    compileOptions {
        sourceCompatibility = JavaVersion.VERSION_17
        targetCompatibility = JavaVersion.VERSION_17
    }

    kotlinOptions {
        jvmTarget = "17"
    }

    // Required for TFLite model files
    aaptOptions {
        noCompress("tflite")
    }
}

Dependencies

dependencies {
    // MediaPipe Tasks Vision (Hand Landmark Detection)
    implementation("com.google.mediapipe:tasks-vision:0.10.14")
    
    // TensorFlow Lite
    implementation("org.tensorflow:tensorflow-lite:2.14.0")
    implementation("org.tensorflow:tensorflow-lite-support:0.4.4")
    
    // CameraX (Camera API)
    implementation("androidx.camera:camera-core:1.3.1")
    implementation("androidx.camera:camera-camera2:1.3.1")
    implementation("androidx.camera:camera-lifecycle:1.3.1")
    implementation("androidx.camera:camera-view:1.3.1")
    
    // Coroutines for async operations
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-android:1.7.3")
}

Python Side (Model Training)

Required Libraries

tensorflow==2.15.0           # Deep learning framework
mediapipe==0.10.9            # Hand landmark extraction
opencv-python==4.8.1.78      # Image processing
numpy==1.26.2                # Numerical operations
pandas==2.1.3                # Data manipulation
scikit-learn==1.3.2          # ML utilities
matplotlib==3.8.2            # Visualization

Installation

pip install tensorflow mediapipe opencv-python numpy pandas scikit-learn matplotlib

5. Understanding the AI Pipeline

What is AI Doing in This App?

The AI has one primary job: "Look at the camera and tell me which ISL letter the user is showing"

The Problem Breakdown

Traditional Approach (Pure Image Classification)

Camera Image β†’ CNN Model β†’ Letter
Problem: Slow, requires huge dataset, background sensitive

Our Smart Approach (Landmark-based Classification)

Camera Image β†’ MediaPipe (find hand) β†’ Extract landmarks β†’ DNN Model β†’ Letter
Benefit: Fast, small dataset, background-independent

Why Two AI Models?

Model 1: MediaPipe Hands (Google's Pre-trained Model)

Job: Find the hand and identify 21 key points

Input: Camera frame (640Γ—480 pixels)
Output: 21 landmark points (x, y, z coordinates)

Example landmarks:
Point 0: Wrist
Point 1-4: Thumb (base to tip)
Point 5-8: Index finger
Point 9-12: Middle finger
Point 13-16: Ring finger
Point 17-20: Pinky finger

Why use it?

  • Already trained by Google on millions of images
  • Works in real-time on mobile devices
  • Handles different hand sizes, skin tones, lighting
  • Free to use

Model 2: Your Custom TFLite Model (Train Yourself)

Job: Classify the 21 landmark points into ISL letters

Input: 63 numbers (21 points Γ— 3 coordinates)
Output: Letter (A-Z) + confidence score

Example:
Input: [0.45, 0.82, 0.01, 0.52, 0.75, ...]
Output: { letter: "A", confidence: 0.95 }

Why train your own?

  • ISL signs are unique (different from ASL)
  • You control accuracy by adding more training data
  • Model is tiny (~50-100 KB)
  • Fast inference (~1-5ms)

6. Data Flow & Pipeline

Complete Pipeline: Camera β†’ Detection β†’ Flutter UI

Step-by-Step Data Transformation

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STEP 1: CAMERA CAPTURE (CameraX)                                 β”‚
β”‚                                                                  β”‚
β”‚ Input:  Nothing (hardware)                                       β”‚
β”‚ Process: Open camera, capture frames at 30 FPS                  β”‚
β”‚ Output: Bitmap (640Γ—480 RGB image)                              β”‚
β”‚ Data Size: 921,600 values (640 Γ— 480 Γ— 3)                       β”‚
β”‚                                                                  β”‚
β”‚ Visual:                                                          β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                             β”‚
β”‚ β”‚ β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ β”‚                                             β”‚
β”‚ β”‚ β–‘β–‘β–‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘ β”‚  ← Raw camera frame                        β”‚
β”‚ β”‚ β–‘β–‘β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ β”‚     (user's hand visible)                  β”‚
β”‚ β”‚ β–‘β–‘β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ β”‚                                             β”‚
β”‚ β”‚ β–‘β–‘β–‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘ β”‚                                             β”‚
β”‚ β”‚ β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ β”‚                                             β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STEP 2: HAND DETECTION (MediaPipe)                               β”‚
β”‚                                                                  β”‚
β”‚ Input:  Bitmap (921,600 values)                                  β”‚
β”‚ Process:                                                         β”‚
β”‚   1. Detect if hand exists in frame                             β”‚
β”‚   2. Locate 21 anatomical landmarks                             β”‚
β”‚   3. Extract (x, y, z) for each point                           β”‚
β”‚ Output: FloatArray[63] = [x0,y0,z0, x1,y1,z1, ..., x20,y20,z20]β”‚
β”‚ Data Size: 63 float values                                      β”‚
β”‚ Reduction: 921,600 β†’ 63 (99.99% reduction!)                     β”‚
β”‚                                                                  β”‚
β”‚ Visual - 21 Landmark Points:                                     β”‚
β”‚         8   12  16  20  (fingertips)                            β”‚
β”‚         |   |   |   |                                           β”‚
β”‚     7   11  15  19  |                                           β”‚
β”‚     |   |   |   |   |                                           β”‚
β”‚     6   10  14  18  |                                           β”‚
β”‚     |   |   |   |   |                                           β”‚
β”‚     5───9───13──17β”€β”€β”˜                                           β”‚
β”‚      \                                                           β”‚
β”‚       4───3───2───1 (thumb)                                     β”‚
β”‚            \                                                     β”‚
β”‚             0 (wrist)                                            β”‚
β”‚                                                                  β”‚
β”‚ Example Output for Letter "A":                                  β”‚
β”‚ [0.45, 0.82, 0.01,  ← Point 0 (wrist)                          β”‚
β”‚  0.52, 0.75, 0.02,  ← Point 1 (thumb base)                     β”‚
β”‚  0.58, 0.65, 0.03,  ← Point 2 (thumb middle)                   β”‚
β”‚  ...                                                             β”‚
β”‚  0.63, 0.42, 0.01]  ← Point 20 (pinky tip)                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STEP 3: SIGN CLASSIFICATION (TensorFlow Lite)                    β”‚
β”‚                                                                  β”‚
β”‚ Input:  FloatArray[63]                                           β”‚
β”‚ Process:                                                         β”‚
β”‚   Neural Network Layers:                                         β”‚
β”‚   [63] β†’ [128 neurons] β†’ [64 neurons] β†’ [32 neurons] β†’ [26]    β”‚
β”‚         Dense Layer    Dense Layer    Dense Layer    Softmax    β”‚
β”‚                                                                  β”‚
β”‚ Output: Probabilities for each letter                           β”‚
β”‚   [p_A, p_B, p_C, ..., p_Z]                                     β”‚
β”‚                                                                  β”‚
β”‚ Example:                                                         β”‚
β”‚ Input:  [0.45, 0.82, 0.01, ...]                                 β”‚
β”‚ Output: [0.01, 0.03, 0.95, 0.00, ...]                          β”‚
β”‚          (1%   3%    95%   0%   ...)                            β”‚
β”‚                                                                  β”‚
β”‚ Best Prediction: Letter "C" with 95% confidence                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STEP 4: SEND TO FLUTTER (Platform Channel - EventChannel)        β”‚
β”‚                                                                  β”‚
β”‚ Kotlin prepares data as Map:                                     β”‚
β”‚ {                                                                β”‚
β”‚   "letter": "C",                                                 β”‚
β”‚   "confidence": 0.95,                                            β”‚
β”‚   "handDetected": true,                                          β”‚
β”‚   "timestamp": 1702907345000                                     β”‚
β”‚ }                                                                β”‚
β”‚                                                                  β”‚
β”‚ Sends through EventChannel (continuous stream)                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STEP 5: FLUTTER RECEIVES & DISPLAYS (Dart)                       β”‚
β”‚                                                                  β”‚
β”‚ EventChannel stream listener receives data                      β”‚
β”‚                                                                  β”‚
β”‚ setState(() {                                                    β”‚
β”‚   currentLetter = "C";                                           β”‚
β”‚   confidence = 0.95;                                             β”‚
β”‚   handDetected = true;                                           β”‚
β”‚ });                                                              β”‚
β”‚                                                                  β”‚
β”‚ UI Updates:                                                      β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                  β”‚
β”‚ β”‚    🎯 Target: Letter A     β”‚                                  β”‚
β”‚ β”‚                            β”‚                                  β”‚
β”‚ β”‚    πŸ“· [Camera Preview]     β”‚                                  β”‚
β”‚ β”‚                            β”‚                                  β”‚
β”‚ β”‚  ❌ Detected: C (95%)      β”‚ ← Red (incorrect)               β”‚
β”‚ β”‚  πŸ’‘ Hint: Try again!       β”‚                                  β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                  β”‚
β”‚                                                                  β”‚
β”‚ If correct (C == target):                                        β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                  β”‚
β”‚ β”‚  βœ… Correct! (95%)         β”‚ ← Green (success)               β”‚
β”‚ β”‚  πŸŽ‰ [Success Animation]    β”‚                                  β”‚
β”‚ β”‚  πŸ”Š [Success Sound]        β”‚                                  β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Processing Speed

Step Processing Time Frequency
Camera Frame ~33ms 30 FPS
MediaPipe Detection ~10-20ms Per frame
TFLite Classification ~1-5ms Per detection
Platform Channel <1ms Per result
Total Latency ~50ms ~20 detections/sec

This means users get near-instant feedback!


7. MediaPipe Explained

What is MediaPipe?

MediaPipe is Google's open-source framework for building multimodal (video, audio, text) applied machine learning pipelines.

MediaPipe Hands Solution

Specifically designed to detect and track hands in real-time.

Key Capabilities:

  • Detects up to 2 hands simultaneously
  • Works in various lighting conditions
  • Handles different hand sizes and skin tones
  • Runs efficiently on mobile devices
  • Provides 21 3D landmark points per hand

The 21 Hand Landmarks

Landmark Numbering System:

        8   12  16  20
        β”‚   β”‚   β”‚   β”‚
    7───11──15──19──│
    β”‚   β”‚   β”‚   β”‚   β”‚
    6───10──14──18──│
    β”‚   β”‚   β”‚   β”‚   β”‚
    5───9───13──17β”€β”€β”˜
     \
      4───3───2───1
           \
            0

Point 0:  Wrist
Point 1:  Thumb CMC (base)
Point 2:  Thumb MCP
Point 3:  Thumb IP
Point 4:  Thumb tip

Point 5:  Index finger MCP
Point 6:  Index finger PIP
Point 7:  Index finger DIP
Point 8:  Index finger tip

Point 9:  Middle finger MCP
Point 10: Middle finger PIP
Point 11: Middle finger DIP
Point 12: Middle finger tip

Point 13: Ring finger MCP
Point 14: Ring finger PIP
Point 15: Ring finger DIP
Point 16: Ring finger tip

Point 17: Pinky MCP
Point 18: Pinky PIP
Point 19: Pinky DIP
Point 20: Pinky tip

Coordinate System

Each landmark has 3 coordinates:

X Coordinate

  • Range: 0.0 to 1.0
  • 0.0 = left edge of image
  • 1.0 = right edge of image
  • Normalized (independent of image resolution)

Y Coordinate

  • Range: 0.0 to 1.0
  • 0.0 = top edge of image
  • 1.0 = bottom edge of image
  • Normalized (independent of image resolution)

Z Coordinate

  • Approximate depth from wrist
  • Smaller values = closer to camera
  • Relative to wrist (Point 0)
  • Units: roughly in same scale as X

Example Coordinates

Letter "A" (closed fist with thumb up):

Point 0 (Wrist):        x=0.50, y=0.70, z=0.00
Point 1 (Thumb base):   x=0.48, y=0.65, z=0.02
Point 2 (Thumb mid):    x=0.46, y=0.58, z=0.03
Point 3 (Thumb bend):   x=0.44, y=0.52, z=0.04
Point 4 (Thumb tip):    x=0.42, y=0.45, z=0.05
Point 5 (Index base):   x=0.54, y=0.66, z=0.01
Point 6 (Index mid):    x=0.56, y=0.68, z=0.00
Point 7 (Index bend):   x=0.57, y=0.69, z=-0.01
Point 8 (Index tip):    x=0.58, y=0.70, z=-0.02
...

MediaPipe Integration Code

// File: android/app/src/main/kotlin/com/kairo/ai/ml/HandLandmarkDetector.kt

import com.google.mediapipe.tasks.vision.handlandmarker.HandLandmarker
import com.google.mediapipe.tasks.vision.handlandmarker.HandLandmarkerResult
import com.google.mediapipe.framework.image.BitmapImageBuilder
import com.google.mediapipe.framework.image.MPImage
import android.graphics.Bitmap
import android.content.Context

class HandLandmarkDetector(context: Context) {
    
    private val handLandmarker: HandLandmarker
    
    init {
        // Configure MediaPipe Hands
        val options = HandLandmarker.HandLandmarkerOptions.builder()
            .setBaseOptions(
                BaseOptions.builder()
                    .setModelAssetPath("hand_landmarker.task")  // MediaPipe's pre-trained model
                    .build()
            )
            .setNumHands(1)  // Detect only one hand
            .setMinHandDetectionConfidence(0.5f)  // 50% confidence threshold
            .setMinHandPresenceConfidence(0.5f)
            .setMinTrackingConfidence(0.5f)
            .build()
        
        handLandmarker = HandLandmarker.createFromOptions(context, options)
    }
    
    /**
     * Detect hand landmarks from a camera frame
     * 
     * @param bitmap The camera frame
     * @return FloatArray of 63 values [x0,y0,z0, x1,y1,z1, ..., x20,y20,z20]
     *         or null if no hand detected
     */
    fun detectLandmarks(bitmap: Bitmap): FloatArray? {
        // Convert Android Bitmap to MediaPipe Image
        val mpImage: MPImage = BitmapImageBuilder(bitmap).build()
        
        // Run hand detection
        val result: HandLandmarkerResult = handLandmarker.detect(mpImage)
        
        // Check if any hands were detected
        if (result.landmarks().isEmpty()) {
            return null  // No hand found
        }
        
        // Get landmarks from first detected hand
        val handLandmarks = result.landmarks()[0]
        
        // Convert to flat array
        val landmarkArray = FloatArray(63)
        
        for (i in 0 until 21) {
            val landmark = handLandmarks[i]
            landmarkArray[i * 3 + 0] = landmark.x()
            landmarkArray[i * 3 + 1] = landmark.y()
            landmarkArray[i * 3 + 2] = landmark.z()
        }
        
        return landmarkArray
    }
    
    /**
     * Optional: Normalize landmarks relative to wrist
     * This makes the model robust to hand position/size
     */
    fun normalizeLandmarks(landmarks: FloatArray): FloatArray {
        val normalized = FloatArray(63)
        
        // Get wrist coordinates (point 0)
        val wristX = landmarks[0]
        val wristY = landmarks[1]
        val wristZ = landmarks[2]
        
        // Normalize all points relative to wrist
        for (i in 0 until 21) {
            normalized[i * 3 + 0] = landmarks[i * 3 + 0] - wristX
            normalized[i * 3 + 1] = landmarks[i * 3 + 1] - wristY
            normalized[i * 3 + 2] = landmarks[i * 3 + 2] - wristZ
        }
        
        return normalized
    }
}

8. DNN Model Explained

What is a DNN (Dense Neural Network)?

A DNN is a type of artificial neural network where every neuron in one layer is connected to every neuron in the next layer.

Simple Analogy

Think of it as a decision-making chain:

Your Brain Recognizing a Friend:

Eyes see features β†’ Brain processes patterns β†’ Brain decides who it is
(height, hair,     (tall + brown hair      ("It's John!")
 glasses, voice)    + glasses = pattern)

DNN for Sign Language

Input: 63 numbers β†’ Hidden layers process β†’ Output: Letter prediction
(hand landmarks)    (find patterns)          (A-Z with confidence)

DNN vs CNN vs Other Models

Model Type Input Best For When to Use
DNN (Dense) Numbers/Features Structured data, pre-extracted features When you have landmark coordinates
CNN (Convolutional) Images Raw image classification When working with raw pixels
RNN/LSTM Sequences Time series, text When order matters (sentences, videos)
Transformer Sequences Language, large-scale Complex NLP tasks (ChatGPT)

Why DNN for KairoAI?

Comparison: CNN vs DNN Approach

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    CNN APPROACH (Not Used)                       β”‚
β”‚                                                                  β”‚
β”‚  Camera Frame (640Γ—480Γ—3 = 921,600 values)                      β”‚
β”‚       ↓                                                          β”‚
β”‚  Convolutional Layers (find edges, shapes)                      β”‚
β”‚       ↓                                                          β”‚
β”‚  Pooling Layers (reduce size)                                   β”‚
β”‚       ↓                                                          β”‚
β”‚  More Conv Layers (find complex patterns)                       β”‚
β”‚       ↓                                                          β”‚
β”‚  Dense Layers (classify)                                        β”‚
β”‚       ↓                                                          β”‚
β”‚  Output: Letter                                                 β”‚
β”‚                                                                  β”‚
β”‚  Problems:                                                       β”‚
β”‚  ❌ Slow (100-200ms inference)                                  β”‚
β”‚  ❌ Large model (10-50 MB)                                      β”‚
β”‚  ❌ Needs 10,000+ images per class                              β”‚
β”‚  ❌ Background variations affect accuracy                       β”‚
β”‚  ❌ Sensitive to lighting, hand size, position                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              DNN APPROACH (Our Choice) βœ…                        β”‚
β”‚                                                                  β”‚
β”‚  MediaPipe Output (63 landmark values)                          β”‚
β”‚       ↓                                                          β”‚
β”‚  Dense Layer 1 (128 neurons, ReLU activation)                   β”‚
β”‚       ↓                                                          β”‚
β”‚  Dense Layer 2 (64 neurons, ReLU activation)                    β”‚
β”‚       ↓                                                          β”‚
β”‚  Dense Layer 3 (32 neurons, ReLU activation)                    β”‚
β”‚       ↓                                                          β”‚
β”‚  Output Layer (26 neurons, Softmax activation)                  β”‚
β”‚       ↓                                                          β”‚
β”‚  Output: Letter + Confidence                                    β”‚
β”‚                                                                  β”‚
β”‚  Benefits:                                                       β”‚
β”‚  βœ… Fast (1-5ms inference)                                      β”‚
β”‚  βœ… Tiny model (~50-100 KB)                                     β”‚
β”‚  βœ… Needs only 500-1000 images per class                        β”‚
β”‚  βœ… Background doesn't matter (landmarks only)                  β”‚
β”‚  βœ… Works in any lighting, any hand size                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Model Architecture

Layer-by-Layer Breakdown

# Our DNN Model Architecture

Input Layer:
  Shape: (63,)
  Description: 63 landmark coordinates
  Example: [0.45, 0.82, 0.01, 0.52, 0.75, ...]

    ↓ (fully connected)

Hidden Layer 1:
  Neurons: 128
  Activation: ReLU (Rectified Linear Unit)
  Dropout: 30% (prevents overfitting)
  Description: Learns basic patterns (finger positions)

    ↓ (fully connected)

Hidden Layer 2:
  Neurons: 64
  Activation: ReLU
  Dropout: 30%
  Description: Learns complex patterns (hand shapes)

    ↓ (fully connected)

Hidden Layer 3:
  Neurons: 32
  Activation: ReLU
  Dropout: 20%
  Description: Learns letter-specific features

    ↓ (fully connected)

Output Layer:
  Neurons: 26
  Activation: Softmax
  Description: Probability for each letter (A-Z)
  Example Output: [0.01, 0.03, 0.95, 0.00, ..., 0.01]
                   (1%   3%    95%   0%        1% )
                    A     B     C     D    ...  Z

What Each Layer Does

1. Input Layer (63 neurons)

  • Receives raw landmark coordinates
  • No processing, just passes data forward

2. Hidden Layer 1 (128 neurons)

  • Learns basic geometric relationships
  • Examples:
    • "Are fingers spread apart?"
    • "Is thumb extended?"
    • "What's the palm orientation?"

3. Hidden Layer 2 (64 neurons)

  • Combines basic patterns into complex ones
  • Examples:
    • "Thumb up + fingers curled = might be 'A'"
    • "All fingers extended = might be 'B'"

4. Hidden Layer 3 (32 neurons)

  • Fine-tunes letter-specific features
  • Distinguishes similar signs
  • Examples:
    • "Is this 'M' or 'N'?" (very similar in ISL)

5. Output Layer (26 neurons)

  • Each neuron represents one letter
  • Softmax ensures probabilities sum to 1.0
  • Highest probability = predicted letter

Activation Functions

ReLU (Rectified Linear Unit)

Formula: f(x) = max(0, x)

Graph:
    β”‚   β•±
    β”‚  β•±
    β”‚ β•±
────┼─────
    β”‚
    
Benefits:
- Fast to compute
- Prevents vanishing gradients
- Works well for hidden layers

Softmax

Formula: softmax(xi) = e^xi / Ξ£(e^xj)

Purpose:
- Converts raw scores to probabilities
- All outputs sum to 1.0
- Used in output layer for classification

Example:
Raw scores: [2.1, 0.5, 4.2, 1.3]
After softmax: [0.12, 0.02, 0.84, 0.02]
               (12%   2%   84%   2% )

Dropout Layers

Purpose: Prevent overfitting

During Training:
- Randomly "turn off" 30% of neurons
- Forces network to learn robust features
- Network can't rely on specific neurons

During Inference (app usage):
- All neurons active
- Uses learned patterns to classify

Model Training Code

# File: model_training/train_model.py

import tensorflow as tf
from tensorflow import keras
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load landmark dataset
print("Loading dataset...")
data = pd.read_csv('landmarks_dataset.csv')

print(f"Dataset shape: {data.shape}")
print(f"Classes: {sorted(data['label'].unique())}")

# Separate features (X) and labels (y)
X = data.iloc[:, :-1].values  # 63 landmark columns
y = data.iloc[:, -1].values   # Label column

# Encode labels: A=0, B=1, C=2, ..., Z=25
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
y_categorical = keras.utils.to_categorical(y_encoded)

# Save label mapping
label_mapping = {i: label for i, label in enumerate(encoder.classes_)}
print(f"Label mapping: {label_mapping}")

# Train/test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y_categorical,
    test_size=0.2,
    random_state=42,
    stratify=y_categorical  # Maintain class distribution
)

print(f"\nTraining samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

# Build the DNN model
model = keras.Sequential([
    # Input layer
    keras.layers.Input(shape=(63,), name='landmark_input'),
    
    # Hidden layer 1
    keras.layers.Dense(128, activation='relu', name='dense_1'),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.3),
    
    # Hidden layer 2
    keras.layers.Dense(64, activation='relu', name='dense_2'),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.3),
    
    # Hidden layer 3
    keras.layers.Dense(32, activation='relu', name='dense_3'),
    keras.layers.Dropout(0.2),
    
    # Output layer
    keras.layers.Dense(len(encoder.classes_), activation='softmax', name='output')
])

# Compile model
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# Display model summary
model.summary()

# Training callbacks
callbacks = [
    # Stop training if validation loss doesn't improve for 10 epochs
    keras.callbacks.EarlyStopping(
        monitor='val_loss',
        patience=10,
        restore_best_weights=True
    ),
    
    # Reduce learning rate if validation loss plateaus
    keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=5,
        min_lr=0.00001
    ),
    
    # Save best model during training
    keras.callbacks.ModelCheckpoint(
        'best_model.h5',
        monitor='val_accuracy',
        save_best_only=True
    )
]

# Train the model
print("\nTraining model...")
history = model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.2,  # Use 20% of training data for validation
    callbacks=callbacks,
    verbose=1
)

# Evaluate on test set
print("\nEvaluating on test set...")
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy*100:.2f}%")

# Save final model
model.save('isl_model.h5')
print("\nModel saved as 'isl_model.h5'")

# Convert to TFLite
print("\nConverting to TFLite...")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

with open('isl_model.tflite', 'wb') as f:
    f.write(tflite_model)

print(f"TFLite model saved!")
print(f"Model size: {len(tflite_model) / 1024:.2f} KB")

Model Summary Output

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 landmark_input (InputLayer) [(None, 63)]              0         
                                                                 
 dense_1 (Dense)             (None, 128)               8,192     
                                                                 
 batch_normalization         (None, 128)               512       
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_2 (Dense)             (None, 64)                8,256     
                                                                 
 batch_normalization_1       (None, 64)                256       
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_3 (Dense)             (None, 32)                2,080     
                                                                 
 dropout_2 (Dropout)         (None, 32)                0         
                                                                 
 output (Dense)              (None, 26)                858       
                                                                 
=================================================================
Total params: 20,154 (78.73 KB)
Trainable params: 19,770 (77.23 KB)
Non-trainable params: 384 (1.50 KB)
_________________________________________________________________

What Happens During Training

Epoch 1/100
────────────────────────────────────────
Batch 1/250:  Forward pass β†’ Calculate loss β†’ Backward pass β†’ Update weights
Batch 2/250:  Forward pass β†’ Calculate loss β†’ Backward pass β†’ Update weights
...
Batch 250/250: Forward pass β†’ Calculate loss β†’ Backward pass β†’ Update weights

Validation:
- Test on validation set (20% of training data)
- Calculate validation accuracy
- If improved, save as best model

Epoch 1: loss=0.5423, accuracy=0.8234, val_loss=0.4321, val_accuracy=0.8567
Epoch 2: loss=0.3215, accuracy=0.8876, val_loss=0.3102, val_accuracy=0.8923
Epoch 3: loss=0.2543, accuracy=0.9123, val_loss=0.2876, val_accuracy=0.9034
...
Epoch 45: loss=0.0234, accuracy=0.9892, val_loss=0.0456, val_accuracy=0.9845
Epoch 46: loss=0.0231, accuracy=0.9894, val_loss=0.0458, val_accuracy=0.9844
(No improvement for 10 epochs β†’ Early stopping triggered)

Best model: Epoch 45 with val_accuracy=0.9845

9. Platform Channels Explained

What Are Platform Channels?

Platform channels are Flutter's official mechanism for communication between Dart code and native platform code (Kotlin/Swift).

The Problem They Solve

Flutter (Dart) runs in its own runtime
    ↓
Cannot directly access:
- Native camera APIs
- MediaPipe library (Android/iOS)
- Hardware sensors
- Native ML frameworks
- Bluetooth, NFC, etc.

Solution: Platform Channels = Bridge between worlds

Types of Platform Channels

1. MethodChannel (Request-Response)

Use Case: One-time requests with responses

Flutter asks β†’ Kotlin does work β†’ Kotlin responds β†’ Flutter receives

Example: Start/stop camera, take photo, get device info

// Flutter side
final result = await methodChannel.invokeMethod('getCameraStatus');
print(result); // "active" or "inactive"
// Kotlin side
methodChannel.setMethodCallHandler { call, result ->
    when (call.method) {
        "getCameraStatus" -> {
            val status = if (cameraActive) "active" else "inactive"
            result.success(status)
        }
    }
}

2. EventChannel (Continuous Stream)

Use Case: Continuous data stream from native to Flutter

Kotlin continuously sends β†’ Flutter receives stream β†’ UI updates in real-time

Example: Hand detection results, sensor data, location updates

// Flutter side
eventChannel.receiveBroadcastStream().listen((data) {
    print(data); // Continuous updates
});
// Kotlin side
eventChannel.setStreamHandler(object : EventChannel.StreamHandler {
    override fun onListen(arguments: Any?, events: EventChannel.EventSink?) {
        // Send data continuously
        events?.success(detectionResult)
    }
})

3. BasicMessageChannel (Bi-directional)

Use Case: Custom protocols, binary data

Not commonly used for this project

Comparison Table

Feature MethodChannel EventChannel
Direction Bi-directional Kotlin β†’ Flutter (one way)
Pattern Request-Response Stream
Use Case Commands Continuous data
Example "Start camera" Detection results every 50ms
Frequency On-demand Continuous

Complete Implementation for KairoAI

Flutter Side (Dart)

// File: lib/services/sign_detection_service.dart

import 'package:flutter/services.dart';
import 'dart:async';

class SignDetectionService {
  // Method channel for commands
  static const MethodChannel _methodChannel =
      MethodChannel('com.kairo.ai/detection');
  
  // Event channel for continuous detection stream
  static const EventChannel _eventChannel =
      EventChannel('com.kairo.ai/detection_stream');
  
  Stream<DetectionResult>? _detectionStream;
  
  /// Start hand sign detection
  Future<void> startDetection() async {
    try {
      await _methodChannel.invokeMethod('startDetection');
      print('βœ… Detection started');
    } on PlatformException catch (e) {
      print('❌ Error starting detection: ${e.message}');
      rethrow;
    }
  }
  
  /// Stop hand sign detection
  Future<void> stopDetection() async {
    try {
      await _methodChannel.invokeMethod('stopDetection');
      print('βœ… Detection stopped');
    } on PlatformException catch (e) {
      print('❌ Error stopping detection: ${e.message}');
      rethrow;
    }
  }
  
  /// Get continuous stream of detection results
  Stream<DetectionResult> get detectionStream {
    _detectionStream ??= _eventChannel
        .receiveBroadcastStream()
        .map((event) {
          final data = Map<String, dynamic>.from(event);
          return DetectionResult.fromMap(data);
        });
    
    return _detectionStream!;
  }
  
  /// Check camera permission status
  Future<bool> checkCameraPermission() async {
    try {
      final bool hasPermission =
          await _methodChannel.invokeMethod('checkCameraPermission');
      return hasPermission;
    } on PlatformException catch (e) {
      print('❌ Error checking permission: ${e.message}');
      return false;
    }
  }
  
  /// Request camera permission
  Future<bool> requestCameraPermission() async {
    try {
      final bool granted =
          await _methodChannel.invokeMethod('requestCameraPermission');
      return granted;
    } on PlatformException catch (e) {
      print('❌ Error requesting permission: ${e.message}');
      return false;
    }
  }
}

/// Data class for detection results
class DetectionResult {
  final String letter;
  final double confidence;
  final bool handDetected;
  final int timestamp;
  
  DetectionResult({
    required this.letter,
    required this.confidence,
    required this.handDetected,
    required this.timestamp,
  });
  
  factory DetectionResult.fromMap(Map<String, dynamic> map) {
    return DetectionResult(
      letter: map['letter'] as String,
      confidence: (map['confidence'] as num).toDouble(),
      handDetected: map['handDetected'] as bool,
      timestamp: map['timestamp'] as int,
    );
  }
  
  @override
  String toString() {
    return 'DetectionResult(letter: $letter, confidence: ${confidence.toStringAsFixed(2)}, handDetected: $handDetected)';
  }
}

Kotlin Side (Android)

// File: android/app/src/main/kotlin/com/kairo/ai/MainActivity.kt

package com.kairo.ai

import android.Manifest
import android.content.pm.PackageManager
import androidx.core.app.ActivityCompat
import androidx.core.content.ContextCompat
import io.flutter.embedding.android.FlutterActivity
import io.flutter.embedding.engine.FlutterEngine
import io.flutter.plugin.common.MethodChannel
import io.flutter.plugin.common.EventChannel
import android.graphics.Bitmap

class MainActivity : FlutterActivity() {
    
    // Channel names (must match Flutter side)
    private val METHOD_CHANNEL = "com.kairo.ai/det…

<!-- Continue from line 1131 -->

    private val EVENT_CHANNEL = "com.kairo.ai/detection_stream"
    
    // Camera permission request code
    private val CAMERA_PERMISSION_CODE = 100
    
    // Our custom classes (to be implemented)
    private lateinit var cameraManager: CameraManager
    private lateinit var handDetector: HandLandmarkDetector
    private lateinit var signClassifier: SignClassifier
    
    // Event sink for streaming data to Flutter
    private var eventSink: EventChannel.EventSink? = null
    
    override fun configureFlutterEngine(flutterEngine: FlutterEngine) {
        super.configureFlutterEngine(flutterEngine)
        
        // Initialize our components
        cameraManager = CameraManager(this)
        handDetector = HandLandmarkDetector(this)
        signClassifier = SignClassifier(this)
        
        // Setup MethodChannel for commands
        MethodChannel(
            flutterEngine.dartExecutor.binaryMessenger,
            METHOD_CHANNEL
        ).setMethodCallHandler { call, result ->
            when (call.method) {
                "startDetection" -> {
                    startDetection()
                    result.success(null)
                }
                
                "stopDetection" -> {
                    stopDetection()
                    result.success(null)
                }
                
                "checkCameraPermission" -> {
                    val hasPermission = checkCameraPermission()
                    result.success(hasPermission)
                }
                
                "requestCameraPermission" -> {
                    requestCameraPermission()
                    result.success(null)
                }
                
                else -> {
                    result.notImplemented()
                }
            }
        }
        
        // Setup EventChannel for streaming data
        EventChannel(
            flutterEngine.dartExecutor.binaryMessenger,
            EVENT_CHANNEL
        ).setStreamHandler(object : EventChannel.StreamHandler {
            
            override fun onListen(arguments: Any?, events: EventChannel.EventSink?) {
                eventSink = events
                println("βœ… Flutter is now listening to detection stream")
            }
            
            override fun onCancel(arguments: Any?) {
                eventSink = null
                stopDetection()
                println("❌ Flutter stopped listening to detection stream")
            }
        })
    }
    
    private fun startDetection() {
        println("πŸŽ₯ Starting camera and detection...")
        cameraManager.startCamera { bitmap ->
            processFrame(bitmap)
        }
    }
    
    private fun stopDetection() {
        println("πŸ›‘ Stopping camera and detection...")
        cameraManager.stopCamera()
    }
    
    private fun processFrame(bitmap: Bitmap) {
        // Step 1: Detect hand landmarks
        val landmarks = handDetector.detectLandmarks(bitmap)
        
        if (landmarks != null) {
            // Step 2: Classify the sign
            val result = signClassifier.classify(landmarks)
            
            // Step 3: Send to Flutter
            val data = mapOf(
                "letter" to result.letter,
                "confidence" to result.confidence,
                "handDetected" to true,
                "timestamp" to System.currentTimeMillis()
            )
            
            runOnUiThread {
                eventSink?.success(data)
            }
        } else {
            // No hand detected
            val data = mapOf(
                "letter" to "",
                "confidence" to 0.0,
                "handDetected" to false,
                "timestamp" to System.currentTimeMillis()
            )
            
            runOnUiThread {
                eventSink?.success(data)
            }
        }
    }
    
    private fun checkCameraPermission(): Boolean {
        return ContextCompat.checkSelfPermission(
            this,
            Manifest.permission.CAMERA
        ) == PackageManager.PERMISSION_GRANTED
    }
    
    private fun requestCameraPermission() {
        ActivityCompat.requestPermissions(
            this,
            arrayOf(Manifest.permission.CAMERA),
            CAMERA_PERMISSION_CODE
        )
    }
    
    override fun onRequestPermissionsResult(
        requestCode: Int,
        permissions: Array<out String>,
        grantResults: IntArray
    ) {
        super.onRequestPermissionsResult(requestCode, permissions, grantResults)
        
        when (requestCode) {
            CAMERA_PERMISSION_CODE -> {
                val granted = grantResults.isNotEmpty() &&
                        grantResults[0] == PackageManager.PERMISSION_GRANTED
                
                if (granted) {
                    println("βœ… Camera permission granted")
                } else {
                    println("❌ Camera permission denied")
                }
            }
        }
    }
}

Communication Flow Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         FLUTTER (Dart)                           β”‚
β”‚                                                                  β”‚
β”‚  User taps "Start Lesson"                                       β”‚
β”‚          ↓                                                       β”‚
β”‚  LessonScreen calls:                                             β”‚
β”‚  signDetectionService.startDetection()                          β”‚
β”‚          ↓                                                       β”‚
β”‚  MethodChannel sends: "startDetection"                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό Platform Channel Bridge
                         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      KOTLIN (Android)                            β”‚
β”‚                                                                  β”‚
β”‚  MethodChannel receives: "startDetection"                       β”‚
β”‚          ↓                                                       β”‚
β”‚  MainActivity.startDetection()                                  β”‚
β”‚          ↓                                                       β”‚
β”‚  CameraManager.startCamera()                                    β”‚
β”‚          ↓                                                       β”‚
β”‚  Camera frames arrive (30 FPS)                                  β”‚
β”‚          ↓                                                       β”‚
β”‚  processFrame(bitmap) for each frame:                           β”‚
β”‚    1. HandDetector.detectLandmarks(bitmap) β†’ 63 floats          β”‚
β”‚    2. SignClassifier.classify(landmarks) β†’ Letter + confidence  β”‚
β”‚    3. Package into Map                                          β”‚
β”‚    4. EventChannel sends to Flutter                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό Event Channel Stream
                         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         FLUTTER (Dart)                           β”‚
β”‚                                                                  β”‚
β”‚  EventChannel.receiveBroadcastStream()                          β”‚
β”‚          ↓                                                       β”‚
β”‚  .listen((data) {                                               β”‚
β”‚    setState(() {                                                β”‚
β”‚      currentLetter = data['letter'];                            β”‚
β”‚      confidence = data['confidence'];                           β”‚
β”‚    });                                                          β”‚
β”‚  })                                                             β”‚
β”‚          ↓                                                       β”‚
β”‚  UI rebuilds with new data                                      β”‚
β”‚  Shows: "Detected: A (95%)"                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

10. Dataset Creation Guide

Understanding the Dataset

What You Need

Dataset Size Calculation:
- 26 letters (A-Z)
- 500-1000 images per letter (recommended)
- Total: 13,000 - 26,000 images

Actual Data After Extraction:
- Each image β†’ 1 row in CSV
- Each row = 63 landmark values + 1 label
- Final CSV: 13,000-26,000 rows Γ— 64 columns

Option 1: Use Existing Dataset (Fastest)

Find ISL Dataset on Kaggle

# Popular datasets:
# 1. "ISL Dataset" 
# 2. "Indian Sign Language Recognition Dataset"
# 3. "ISL Alphabet Dataset"

# Expected structure:
ISL_Dataset/
β”œβ”€β”€ A/
β”‚   β”œβ”€β”€ img_001.jpg
β”‚   β”œβ”€β”€ img_002.jpg
β”‚   └── ...
β”œβ”€β”€ B/
β”‚   └── ...
└── Z/
    └── ...

Option 2: Extract Landmarks from Images

Complete Extraction Script

# File: model_training/extract_landmarks.py

import mediapipe as mp
import cv2
import os
import csv
import numpy as np

mp_hands = mp.solutions.hands
hands = mp_hands.Hands(
    static_image_mode=True,
    max_num_hands=1,
    min_detection_confidence=0.5
)

def extract_landmarks_from_image(image_path):
    """
    Extract 21 hand landmarks from an image.
    Returns 63 values (21 points Γ— 3 coordinates) or None if no hand detected.
    """
    image = cv2.imread(image_path)
    if image is None:
        return None
    
    image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    results = hands.process(image_rgb)
    
    if not results.multi_hand_landmarks:
        return None
    
    hand_landmarks = results.multi_hand_landmarks[0]
    landmarks = []
    
    for landmark in hand_landmarks.landmark:
        landmarks.extend([landmark.x, landmark.y, landmark.z])
    
    return landmarks

def normalize_landmarks(landmarks):
    """
    Normalize landmarks relative to wrist and scale
    Makes model robust to hand position and size
    """
    landmarks = np.array(landmarks).reshape(21, 3)
    
    # Get wrist position (point 0)
    wrist = landmarks[0]
    
    # Translate to origin (wrist at 0,0,0)
    landmarks = landmarks - wrist
    
    # Calculate bounding box
    min_vals = landmarks.min(axis=0)
    max_vals = landmarks.max(axis=0)
    
    # Scale to unit box
    scale = max_vals - min_vals
    scale[scale == 0] = 1  # Avoid division by zero
    landmarks = (landmarks - min_vals) / scale
    
    return landmarks.flatten()

def process_dataset(dataset_path, output_csv):
    """
    Process all images in dataset and save landmarks to CSV.
    """
    header = []
    for i in range(21):
        header.extend([f'x{i}', f'y{i}', f'z{i}'])
    header.append('label')
    
    total_images = 0
    successful_extractions = 0
    failed_extractions = 0
    
    with open(output_csv, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(header)
        
        for label in sorted(os.listdir(dataset_path)):
            label_path = os.path.join(dataset_path, label)
            
            if not os.path.isdir(label_path):
                continue
            
            print(f"Processing class: {label}")
            
            for image_name in os.listdir(label_path):
                image_path = os.path.join(label_path, image_name)
                
                if not image_name.lower().endswith(('.jpg', '.jpeg', '.png', '.bmp')):
                    continue
                
                total_images += 1
                landmarks = extract_landmarks_from_image(image_path)
                
                if landmarks is not None:
                    # Normalize landmarks
                    normalized = normalize_landmarks(landmarks)
                    writer.writerow(list(normalized) + [label])
                    successful_extractions += 1
                else:
                    failed_extractions += 1
                    print(f"  ⚠️ No hand detected: {image_name}")
    
    print("\n" + "="*50)
    print("EXTRACTION COMPLETE")
    print("="*50)
    print(f"Total images processed: {total_images}")
    print(f"Successful extractions: {successful_extractions}")
    print(f"Failed extractions: {failed_extractions}")
    print(f"Success rate: {(successful_extractions/total_images)*100:.1f}%")
    print(f"Output saved to: {output_csv}")

if __name__ == "__main__":
    DATASET_PATH = "ISL_Dataset"  # Change to your dataset path
    OUTPUT_CSV = "landmarks_dataset.csv"
    
    process_dataset(DATASET_PATH, OUTPUT_CSV)

11. Model Training Guide

Training Process Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     TRAINING PIPELINE                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 1: Load CSV Dataset
  ↓
Step 2: Split into Features (X) and Labels (y)
  ↓
Step 3: Encode Labels (A→0, B→1, ..., Z→25)
  ↓
Step 4: Split into Train/Test Sets (80%/20%)
  ↓
Step 5: Build DNN Model
  ↓
Step 6: Train Model (with callbacks)
  ↓
Step 7: Evaluate on Test Set
  ↓
Step 8: Save Model (.h5)
  ↓
Step 9: Convert to TFLite (.tflite)
  ↓
Step 10: Deploy to Android App

Complete Training Script

# File: model_training/train_model.py

import tensorflow as tf
from tensorflow import keras
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

# Load dataset
print("πŸ“Š Loading dataset...")
data = pd.read_csv('landmarks_dataset.csv')

print(f"Dataset shape: {data.shape}")
print(f"Number of samples: {len(data)}")
print(f"Classes: {sorted(data['label'].unique())}")

# Check for class imbalance
class_counts = data['label'].value_counts()
print("\nπŸ“ˆ Class distribution:")
print(class_counts)

# Separate features and labels
X = data.iloc[:, :-1].values  # 63 landmark columns
y = data.iloc[:, -1].values   # Label column

# Encode labels
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
y_categorical = keras.utils.to_categorical(y_encoded)

# Save label mapping
label_mapping = {i: label for i, label in enumerate(encoder.classes_)}
print(f"\n🏷️ Label mapping: {label_mapping}")

# Save encoder for later use
import pickle
with open('label_encoder.pkl', 'wb') as f:
    pickle.dump(encoder, f)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y_categorical,
    test_size=0.2,
    random_state=42,
    stratify=y_categorical
)

print(f"\nπŸ“š Training samples: {len(X_train)}")
print(f"πŸ“š Testing samples: {len(X_test)}")

# Build model
model = keras.Sequential([
    keras.layers.Input(shape=(63,), name='landmark_input'),
    
    keras.layers.Dense(128, activation='relu', name='dense_1'),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.3),
    
    keras.layers.Dense(64, activation='relu', name='dense_2'),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.3),
    
    keras.layers.Dense(32, activation='relu', name='dense_3'),
    keras.layers.Dropout(0.2),
    
    keras.layers.Dense(len(encoder.classes_), activation='softmax', name='output')
])

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

model.summary()

# Callbacks
callbacks = [
    keras.callbacks.EarlyStopping(
        monitor='val_loss',
        patience=15,
        restore_best_weights=True,
        verbose=1
    ),
    
    keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=5,
        min_lr=0.00001,
        verbose=1
    ),
    
    keras.callbacks.ModelCheckpoint(
        'best_model.h5',
        monitor='val_accuracy',
        save_best_only=True,
        verbose=1
    )
]

# Train
print("\nπŸš€ Training model...")
history = model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.2,
    callbacks=callbacks,
    verbose=1
)

# Evaluate
print("\nπŸ“Š Evaluating on test set...")
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy*100:.2f}%")

# Plot training history
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.tight_layout()
plt.savefig('training_history.png')
print("πŸ“ˆ Training history saved to training_history.png")

# Save model
model.save('isl_model.h5')
print("\nπŸ’Ύ Model saved as 'isl_model.h5'")

# Convert to TFLite
print("\nπŸ”„ Converting to TFLite...")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

with open('isl_model.tflite', 'wb') as f:
    f.write(tflite_model)

print(f"βœ… TFLite model saved!")
print(f"πŸ“¦ Model size: {len(tflite_model) / 1024:.2f} KB")

print("\nπŸŽ‰ Training complete!")
print("\nπŸ“‹ Next steps:")
print("1. Copy isl_model.tflite to android/app/src/main/assets/")
print("2. Copy label_encoder.pkl for future use")
print("3. Test the model in your Android app")

12. Implementation Roadmap

Phase 1: Proof of Concept (1-2 Weeks)

Goal: Detect hand and show "Hand detected!" in Flutter

Tasks:

  1. Setup Android Dependencies

    • Add MediaPipe to build.gradle.kts
    • Add CameraX libraries
    • Configure permissions
  2. Basic Camera Integration

    • Create CameraManager.kt
    • Open camera
    • Capture frames
  3. Hand Detection Only

    • Create HandLandmarkDetector.kt
    • Detect hand (yes/no)
    • Print landmarks to logcat
  4. Platform Channel Setup

    • Create MethodChannel
    • Send "Hand detected: true/false" to Flutter
    • Display in Flutter UI

Success Criteria:

  • Camera opens
  • Hand detection works
  • Flutter shows "Hand detected!" message

Phase 2: Single Letter Recognition (1-2 Weeks)

Goal: Recognize letter "A" only

Tasks:

  1. Collect Data for Letter A

    • Record 100 images of letter "A"
    • Record 100 images of letter "B" (for contrast)
    • Extract landmarks to CSV
  2. Train 2-Class Model

    • Train DNN to distinguish A vs B
    • Convert to TFLite
    • Achieve 85%+ accuracy
  3. Integrate TFLite

    • Create SignClassifier.kt
    • Load .tflite model
    • Run inference on landmarks
  4. Flutter Integration

    • EventChannel for continuous detection
    • Show detected letter
    • Show confidence score

Success Criteria:

  • Model can distinguish A from B
  • Real-time detection works
  • Flutter UI shows letter and confidence

Phase 3: Full Alphabet (2-3 Weeks)

Goal: Recognize all 26 letters

Tasks:

  1. Complete Dataset

    • Find or collect full ISL dataset
    • Extract landmarks for all 26 letters
    • Validate data quality
  2. Train 26-Class Model

    • Train full model
    • Aim for 90%+ test accuracy
    • Optimize for mobile inference
  3. Lesson Mode UI

    • Create lesson screen
    • Show target letter
    • Camera preview
    • Success/failure feedback
  4. Quiz Mode Logic

    • Random letter selection
    • Sequential validation
    • Score tracking

Success Criteria:

  • 90%+ accuracy on test set
  • All 26 letters recognized
  • Lesson and quiz modes working

Phase 4: Polish & Deploy (1-2 Weeks)

Goal: Production-ready app

Tasks:

  1. Firebase Integration

    • Store user progress
    • Track quiz scores
    • Sync across devices
  2. UI/UX Polish

    • Lottie animations
    • Sound effects
    • Smooth transitions
    • Error handling
  3. Performance Optimization

    • Reduce latency
    • Battery optimization
    • Memory management
  4. Testing

    • User testing
    • Edge cases
    • Different devices
    • Bug fixes

Success Criteria:

  • App feels polished
  • No crashes
  • Good user feedback
  • Ready for release

13. Code Structure

Complete Project Structure

KairoAI/
β”œβ”€β”€ android/
β”‚   └── app/
β”‚       β”œβ”€β”€ src/
β”‚       β”‚   └── main/
β”‚       β”‚       β”œβ”€β”€ kotlin/com/kairo/ai/
β”‚       β”‚       β”‚   β”œβ”€β”€ MainActivity.kt
β”‚       β”‚       β”‚   β”œβ”€β”€ camera/
β”‚       β”‚       β”‚   β”‚   └── CameraManager.kt
β”‚       β”‚       β”‚   └── ml/
β”‚       β”‚       β”‚       β”œβ”€β”€ HandLandmarkDetector.kt
β”‚       β”‚       β”‚       └── SignClassifier.kt
β”‚       β”‚       β”œβ”€β”€ assets/
β”‚       β”‚       β”‚   β”œβ”€β”€ isl_model.tflite
β”‚       β”‚       β”‚   └── hand_landmarker.task
β”‚       β”‚       └── AndroidManifest.xml
β”‚       └── build.gradle.kts
β”‚
β”œβ”€β”€ lib/
β”‚   β”œβ”€β”€ main.dart
β”‚   β”œβ”€β”€ services/
β”‚   β”‚   └── sign_detection_service.dart
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ lesson.dart
β”‚   β”‚   └── quiz_result.dart
β”‚   β”œβ”€β”€ screens/
β”‚   β”‚   β”œβ”€β”€ home_screen.dart
β”‚   β”‚   β”œβ”€β”€ lesson_screen.dart
β”‚   β”‚   β”œβ”€β”€ quiz_screen.dart
β”‚   β”‚   └── progress_screen.dart
β”‚   β”œβ”€β”€ widgets/
β”‚   β”‚   β”œβ”€β”€ camera_preview.dart
β”‚   β”‚   β”œβ”€β”€ detection_overlay.dart
β”‚   β”‚   └── success_animation.dart
β”‚   └── providers/
β”‚       └── detection_provider.dart
β”‚
β”œβ”€β”€ model_training/
β”‚   β”œβ”€β”€ extract_landmarks.py
β”‚   β”œβ”€β”€ train_model.py
β”‚   β”œβ”€β”€ convert_to_tflite.py
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── README.md
β”‚
β”œβ”€β”€ assets/
β”‚   β”œβ”€β”€ animations/
β”‚   β”‚   β”œβ”€β”€ success.json
β”‚   β”‚   └── loading.json
β”‚   β”œβ”€β”€ sounds/
β”‚   β”‚   β”œβ”€β”€ success.mp3
β”‚   β”‚   └── error.mp3
β”‚   └── images/
β”‚       └── isl_alphabet_guide.png
β”‚
β”œβ”€β”€ pubspec.yaml
β”œβ”€β”€ README.md
└── DOCUMENTATION.md (this file)

14. Challenges & Solutions

Technical Challenges

Challenge 1: MediaPipe Not Detecting Hand

Symptoms:

  • landmarks returns null
  • No hand landmarks extracted

Possible Causes & Solutions:

Cause Solution
Poor lighting Improve lighting conditions
Hand too small in frame Move camera closer
Hand partially out of frame Ensure full hand visible
Wrong model file Download correct hand_landmarker.task
Low confidence threshold Reduce minHandDetectionConfidence to 0.3

Debug Code:

val result = handLandmarker.detect(mpImage)
println("Hands detected: ${result.landmarks().size}")

if (result.landmarks().isEmpty()) {
    println("❌ No hand detected")
    println("Try: better lighting, move hand closer, show full hand")
}

Challenge 2: Low Model Accuracy

Symptoms:

  • Test accuracy < 85%
  • Wrong letter predictions
  • Low confidence scores

Solutions:

  1. Collect More Data

    # Aim for at least 500 samples per letter
    # Current: 100 per letter β†’ Low
    # Target: 500+ per letter β†’ Good
    
  2. Normalize Landmarks

    def normalize_landmarks(landmarks):
        landmarks = np.array(landmarks).reshape(21, 3)
        wrist = landmarks[0]
        landmarks = landmarks - wrist  # Relative to wrist
        # Scale to unit box
        min_vals = landmarks.min(axis=0)
        max_vals = landmarks.max(axis=0)
        scale = max_vals - min_vals
        landmarks = (landmarks - min_vals) / scale
        return landmarks.flatten()
    
  3. Tune Hyperparameters

    # Try different:
    - Learning rates: 0.0001, 0.001, 0.01
    - Batch sizes: 16, 32, 64
    - Dropout rates: 0.2, 0.3, 0.5
    - Number of layers: 2, 3, 4
    

Challenge 3: Platform Channel Not Working

Symptoms:

  • Flutter doesn't receive data
  • MissingPluginException
  • No communication

Solutions:

  1. Check Channel Names Match

    // Flutter
    MethodChannel('com.kairo.ai/detection');  // Must match exactly
    
    // Kotlin
    MethodChannel(..., "com.kairo.ai/detection");  // Same string
    
  2. Verify Plugin Registration

    // MainActivity must extend FlutterActivity
    class MainActivity : FlutterActivity() {
        // ...
    }
    
  3. Debug with Logs

    // Flutter
    try {
        await _methodChannel.invokeMethod('test');
        print('βœ… Channel working');
    } catch (e) {
        print('❌ Channel error: $e');
    }
    
    // Kotlin
    methodChannel.setMethodCallHandler { call, result ->
        println("πŸ“ž Received call: ${call.method}")
        result.success("OK")
    }
    

Challenge 4: App Crashes on TFLite Inference

Symptoms:

  • App crashes when detecting
  • IllegalArgumentException
  • ArrayIndexOutOfBoundsException

Solutions:

  1. Check Input Shape

    // Model expects: [1, 63]
    val inputBuffer = Array(1) { FloatArray(63) }
    inputBuffer[0] = landmarks  // landmarks must be exactly 63 floats
    
    if (landmarks.size != 63) {
        println("❌ Wrong input size: ${landmarks.size}, expected 63")
        return
    }
    
  2. Check Output Shape

    // Model outputs: [1, 26]
    val outputBuffer = Array(1) { FloatArray(26) }
    
    interpreter.run(inputBuffer, outputBuffer)
    
    val probabilities = outputBuffer[0]
    if (probabilities.size != 26) {
        println("❌ Wrong output size: ${probabilities.size}")
    }
    
  3. Verify Model File

    // Check if model file exists and loads correctly
    try {
        val modelFile = loadModelFile(context, "isl_model.tflite")
        println("βœ… Model loaded: ${modelFile.capacity()} bytes")
    } catch (e: Exception) {
        println("❌ Failed to load model: ${e.message}")
    }
    

Common Pitfalls

Pitfall 1: Not Normalizing Landmarks

Problem: Model accuracy drops when user changes hand position or distance

Solution: Always normalize landmarks during both training and inference

# Training time
def normalize_landmarks(landmarks):
    # Make relative to wrist and scale to unit box
    # ...

# Inference time (Kotlin)
fun normalizeLandmarks(landmarks: FloatArray): FloatArray {
    // Same normalization logic
    // ...
}

Pitfall 2: Imbalanced Dataset

Problem: Some letters have 1000 samples, others have 100

Solution: Balance the dataset

# Check class distribution
print(data['label'].value_counts())

# Undersample majority classes or oversample minority classes
from imblearn.over_sampling import SMOTE
X_balanced, y_balanced = SMOTE().fit_resample(X, y)

Pitfall 3: Forgetting to Close Camera

Problem: Camera stays on even after leaving screen, draining battery

Solution: Properly manage lifecycle

class LessonScreen extends StatefulWidget {
  @override
  _LessonScreenState createState() => _LessonScreenState();
}

class _LessonScreenState extends State<LessonScreen> {
  @override
  void initState() {
    super.initState();
    signDetectionService.startDetection();
  }
  
  @override
  void dispose() {
    signDetectionService.stopDetection();  // Important!
    super.dispose();
  }
}

15. Feasibility Assessment

Is This Project Doable?

βœ… Absolutely YES, Here's Why:

  1. All Technology Exists

    • MediaPipe: Production-ready, used by millions
    • TFLite: Mature, optimized for mobile
    • Flutter Platform Channels: Well-documented
  2. Similar Apps Exist

    • ASL learning apps
    • Gesture recognition apps
    • Hand tracking AR filters
  3. You Have Foundation

    • Flutter project setup βœ…
    • Firebase working βœ…
    • Android configuration βœ…

Realistic Timeline

Beginner Path (No ML/Native experience)

Phase Time Milestone
Learning 1-2 weeks Understand MediaPipe, TFLite basics
Setup 3-5 days Add dependencies, configure Android
Data Collection 1 week Find/collect dataset, extract landmarks
Model Training 1-2 days Train model, achieve 85%+ accuracy
Integration 1-2 weeks Platform channels, camera, detection
Testing 1 week Fix bugs, improve accuracy
Polish 1 week UI/UX, animations, Firebase
TOTAL 6-8 weeks Working MVP

With Experience Path

Phase Time Milestone
Setup 1 day Dependencies configured
Data 2-3 days Dataset ready
Training 4 hours Model trained
Integration 1 week Everything connected
Testing 3-4 days Bugs fixed
TOTAL 2-3 weeks Working app

Skills You'll Gain

Even if this takes longer than expected:

βœ… Flutter platform channels (valuable)
βœ… Android native development (Kotlin)
βœ… Machine learning basics (TensorFlow)
βœ… Computer vision (MediaPipe)
βœ… Mobile camera handling
βœ… Firebase integration
βœ… Real-world app architecture
βœ… Complex system integration
βœ… Problem-solving under constraints

Value: These skills are highly marketable

Risk Mitigation

Risk 1: Can't Collect Enough Data

Mitigation:

  • Use Kaggle datasets (many available)
  • Start with 10 letters instead of 26
  • Use data augmentation

Risk 2: Model Accuracy Too Low

Mitigation:

  • Start with 70% accuracy goal (acceptable for MVP)
  • Improve gradually with more data
  • Use transfer learning if needed

Risk 3: Platform Channel Issues

Mitigation:

  • Use exact code patterns from documentation
  • Start with simple "Hello World" channel
  • Test incrementally (string β†’ number β†’ map)

16. Resources & Learning Path

Week 1: Learn Basics

MediaPipe Resources

  1. Official Docs

  2. YouTube Tutorials

    • "MediaPipe Hands Tutorial" by Nicholas Renotte
    • "Real-time Hand Tracking" by TensorFlow
  3. Example Code

    git clone https://github.com/google/mediapipe
    cd mediapipe/examples/hand_landmarker/android
    # Open in Android Studio and run
    

TensorFlow Lite Resources

  1. Official Docs

  2. Tutorials

    • "TFLite on Android" by TensorFlow
    • "Custom Model Deployment" tutorials

Week 2-3: Hands-On Practice

Practice Project 1: Hand Detection

Goal: Display camera feed and draw hand landmarks

// Simple app that just shows hand landmarks
class MainActivity : AppCompatActivity() {
    // Use MediaPipe to detect hand
    // Draw landmarks on camera preview
    // No classification yet
}

Practice Project 2: Platform Channel Hello World

Goal: Send message from Kotlin to Flutter

// Flutter
final result = await channel.invokeMethod('sayHello');
print(result); // "Hello from Kotlin!"
// Kotlin
channel.setMethodCallHandler { call, result ->
    if (call.method == "sayHello") {
        result.success("Hello from Kotlin!")
    }
}

Recommended Development Tools

Essential

Tool Purpose Download
Android Studio Android development Download
VS Code Flutter development Download
Python 3.10+ Model training Download

Optional but Helpful

Tool Purpose
Google Colab Free GPU for training
Postman API testing
Firebase Console Database management
Android Device Real device testing

Community Support

Where to Get Help

  1. Stack Overflow

    • Tag: [flutter] [mediapipe]
    • Tag: [tensorflow-lite]
  2. GitHub Discussions

  3. Discord Communities

    • Flutter Discord
    • ML/AI Discord servers
  4. Reddit

    • r/FlutterDev
    • r/MachineLearning
    • r/androiddev

Debugging Checklist

When Something Doesn't Work

β–‘ Check logs (Android Studio Logcat)
β–‘ Verify dependencies versions match
β–‘ Clean and rebuild project
β–‘ Restart Android Studio
β–‘ Check permissions in AndroidManifest.xml
β–‘ Verify channel names match exactly
β–‘ Test on real device (not emulator)
β–‘ Check model file exists in assets
β–‘ Verify input/output shapes
β–‘ Print debug information at each step

Next Steps: Your Action Plan

Immediate (This Week)

  1. Download MediaPipe Example

    git clone https://github.com/google/mediapipe
    cd mediapipe/examples/hand_landmarker/android
    
  2. Run the Example

    • Open in Android Studio
    • Build and run on device
    • Verify hand detection works
  3. Report Back

    • Did it work?
    • Any errors?
    • What did you learn?

Week 2-3

  1. Create Basic Flutter App

    • Camera preview
    • Platform channel setup
    • Simple hand detection
  2. Collect Small Dataset

    • 50 images each of letters A and B
    • Extract landmarks
    • Train 2-class model
  3. Integrate TFLite

    • Load model in Android
    • Run inference
    • Send results to Flutter

Week 4-6

  1. Expand to Full Alphabet
  2. Build UI
  3. Add Firebase
  4. Test and Polish

Final Thoughts

This Project is:

βœ… Technically feasible - All pieces exist and work βœ… Educationally valuable - You'll learn A LOT βœ… Portfolio-worthy - Impressive for job applications βœ… Challenging but doable - With persistence

This Project is NOT:

❌ A weekend project ❌ Impossible for beginners ❌ Requiring PhD-level ML knowledge ❌ Dependent on expensive tools


Success Factors

You WILL succeed if you:

  1. βœ… Start small (hand detection first, full app later)
  2. βœ… Break problems into tiny steps
  3. βœ… Debug systematically (logs everywhere)
  4. βœ… Ask for help when stuck (community is helpful)
  5. βœ… Accept imperfection (70% accuracy is a great start)
  6. βœ… Stay persistent (debugging takes time)

You might struggle if you:

  1. ❌ Try to do everything at once
  2. ❌ Skip the learning phase
  3. ❌ Give up at first error
  4. ❌ Aim for perfection immediately

Contact & Support

If you need help while building this:

  1. GitHub Discussions - Most responsive
  2. Stack Overflow - Tag your questions properly
  3. Flutter Discord - Real-time chat
  4. This AI Assistant - Come back anytime!

Conclusion

KairoAI is an ambitious but achievable project.

You have:

  • βœ… Clear architecture
  • βœ… Detailed implementation guide
  • βœ… Code examples for every component
  • βœ… Realistic timeline
  • βœ… Troubleshooting guides

Now it's time to build!

Start with the MediaPipe example this week. Once you see hand detection working on your device, you'll realize this is not just possibleβ€”it's inevitable.

Good luck! πŸš€


Last updated: December 18, 2025 Version: 1.0.0 Author: Megh Modi


Appendix: Quick Reference

Key Commands

# Flutter
flutter doctor
flutter clean
flutter pub get
flutter run

# Android
./gradlew clean
./gradlew assembleDebug

# Python
pip install -r requirements.txt
python extract_landmarks.py
python train_model.py

Important File Paths

android/app/src/main/assets/isl_model.tflite
android/app/src/main/kotlin/com/kairo/ai/MainActivity.kt
lib/services/sign_detection_service.dart
model_training/train_model.py

Channel Names (Must Match!)

com.kairo.ai/detection          # MethodChannel
com.kairo.ai/detection_stream   # EventChannel

Common Error Codes

Error Meaning Fix
MissingPluginException Channel not registered Check MainActivity
PlatformException Native code error Check Kotlin logs
IllegalArgumentException Wrong input shape Verify array size is 63
FileNotFoundException Model not found Check assets folder

End of Documentation

Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support