tommulder commited on
Commit
e300623
·
1 Parent(s): f256ddd

Prepare for Hugging Face Spaces deployment

Browse files

- Updated Dockerfile with proper user permissions for HF Spaces
- Added field extraction module and models
- Enhanced README with deployment instructions
- Added deployment documentation
- Fixed app.py imports and structure

Files changed (6) hide show
  1. DEPLOYMENT.md +86 -0
  2. Dockerfile +19 -6
  3. README.md +0 -1
  4. app.py +2 -5
  5. field_extraction.py +132 -0
  6. models.py +50 -0
DEPLOYMENT.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Dots.OCR Service - Hugging Face Spaces Deployment Guide
2
+
3
+ ## ✅ Ready for Deployment
4
+
5
+ The dots-ocr service is now fully self-contained and ready for deployment to Hugging Face Spaces.
6
+
7
+ ## Files Updated
8
+
9
+ - **`app.py`** - Fixed import paths to be self-contained
10
+ - **`models.py`** - Created local data structures (ExtractedField, IdCardFields, MRZData)
11
+ - **`field_extraction.py`** - Created local field extraction module
12
+ - **`Dockerfile`** - Updated for HF compliance with proper user permissions
13
+ - **`README.md`** - Updated with proper HF Spaces configuration
14
+
15
+ ## Deployment Steps
16
+
17
+ ### 1. Create Hugging Face Space
18
+
19
+ ```bash
20
+ # Login to Hugging Face
21
+ huggingface-cli login
22
+
23
+ # Create a new Space
24
+ huggingface-cli repo create dots-ocr-idcard --type space --space_sdk docker --organization algoryn
25
+ ```
26
+
27
+ ### 2. Deploy to HF Space
28
+
29
+ ```bash
30
+ # Clone the space locally
31
+ git clone https://huggingface.co/spaces/algoryn/dots-ocr-idcard
32
+ cd dots-ocr-idcard
33
+
34
+ # Copy all files from this repository
35
+ cp /Users/tmulder/Sources/Algoryn/kybtech-dots-ocr/* .
36
+
37
+ # Commit and push
38
+ git add .
39
+ git commit -m "Deploy Dots.OCR text extraction service"
40
+ git push
41
+ ```
42
+
43
+ ### 3. Test the Deployment
44
+
45
+ Once deployed (usually takes 5-10 minutes), test with:
46
+
47
+ ```bash
48
+ # Basic OCR test
49
+ curl -X POST https://algoryn-dots-ocr-idcard.hf.space/v1/id/ocr \
50
+ -H "Authorization: Bearer YOUR_HF_TOKEN" \
51
+ -F "file=@test_image.jpg"
52
+
53
+ # With ROI (region of interest)
54
+ curl -X POST https://algoryn-dots-ocr-idcard.hf.space/v1/id/ocr \
55
+ -H "Authorization: Bearer YOUR_HF_TOKEN" \
56
+ -F "file=@test_image.jpg" \
57
+ -F 'roi={"x1":0.1,"y1":0.1,"x2":0.9,"y2":0.9}'
58
+ ```
59
+
60
+ ## Features
61
+
62
+ - **Self-contained**: No external dependencies on parent repository
63
+ - **HF Compliant**: Follows Hugging Face Docker Spaces best practices
64
+ - **Mock Mode**: Falls back to mock implementation if Dots.OCR fails to load
65
+ - **ROI Support**: Process pre-cropped images or full images with ROI coordinates
66
+ - **Field Extraction**: Structured field extraction with confidence scores
67
+ - **MRZ Detection**: Machine Readable Zone data extraction
68
+
69
+ ## API Endpoints
70
+
71
+ - `GET /health` - Health check
72
+ - `POST /v1/id/ocr` - Text extraction with optional ROI
73
+
74
+ ## Environment Variables
75
+
76
+ No special environment variables needed. The service runs on port 7860 by default.
77
+
78
+ ## Performance
79
+
80
+ - **GPU**: 300-900ms processing time
81
+ - **CPU**: 3-8s processing time
82
+ - **Memory**: ~6GB per instance
83
+
84
+ ## Privacy
85
+
86
+ This endpoint processes images temporarily and does not store or log personal information. All field values are redacted in logs for privacy protection.
Dockerfile CHANGED
@@ -1,9 +1,6 @@
1
  FROM python:3.11-slim
2
 
3
- # Set working directory
4
- WORKDIR /app
5
-
6
- # Install system dependencies
7
  RUN apt-get update && apt-get install -y \
8
  libgl1-mesa-glx \
9
  libglib2.0-0 \
@@ -13,12 +10,28 @@ RUN apt-get update && apt-get install -y \
13
  libgomp1 \
14
  && rm -rf /var/lib/apt/lists/*
15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  # Copy requirements and install Python dependencies
17
- COPY requirements.txt .
18
  RUN pip install --no-cache-dir -r requirements.txt
19
 
20
  # Copy application code
21
- COPY . .
22
 
23
  # Expose port
24
  EXPOSE 7860
 
1
  FROM python:3.11-slim
2
 
3
+ # Install system dependencies as root
 
 
 
4
  RUN apt-get update && apt-get install -y \
5
  libgl1-mesa-glx \
6
  libglib2.0-0 \
 
10
  libgomp1 \
11
  && rm -rf /var/lib/apt/lists/*
12
 
13
+ # Set up a new user named "user" with user ID 1000
14
+ RUN useradd -m -u 1000 user
15
+
16
+ # Switch to the "user" user
17
+ USER user
18
+
19
+ # Set home to the user's home directory
20
+ ENV HOME=/home/user \
21
+ PATH=/home/user/.local/bin:$PATH
22
+
23
+ # Set the working directory to the user's home directory
24
+ WORKDIR $HOME/app
25
+
26
+ # Try and run pip command after setting the user with `USER user` to avoid permission issues with Python
27
+ RUN pip install --no-cache-dir --upgrade pip
28
+
29
  # Copy requirements and install Python dependencies
30
+ COPY --chown=user requirements.txt .
31
  RUN pip install --no-cache-dir -r requirements.txt
32
 
33
  # Copy application code
34
+ COPY --chown=user . .
35
 
36
  # Expose port
37
  EXPOSE 7860
README.md CHANGED
@@ -4,7 +4,6 @@ emoji: 🔍
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: docker
7
- sdk_version: "0.0.0"
8
  app_port: 7860
9
  pinned: false
10
  license: "private"
 
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: docker
 
7
  app_port: 7860
8
  pinned: false
9
  license: "private"
app.py CHANGED
@@ -30,11 +30,8 @@ except ImportError:
30
  DOTS_OCR_AVAILABLE = False
31
  logging.warning("Dots.OCR not available - using mock implementation")
32
 
33
- # Import field extraction utilities
34
- import sys
35
- import os
36
- sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..', '..', 'src'))
37
- from idcard_api.field_extraction import FieldExtractor
38
 
39
  # Configure logging
40
  logging.basicConfig(level=logging.INFO)
 
30
  DOTS_OCR_AVAILABLE = False
31
  logging.warning("Dots.OCR not available - using mock implementation")
32
 
33
+ # Import local field extraction utilities
34
+ from field_extraction import FieldExtractor
 
 
 
35
 
36
  # Configure logging
37
  logging.basicConfig(level=logging.INFO)
field_extraction.py ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Field extraction utilities for OCR text processing.
2
+
3
+ This module provides field extraction and mapping from OCR results
4
+ to structured KYB field formats.
5
+ """
6
+
7
+ import re
8
+ from typing import Optional
9
+ from models import ExtractedField, IdCardFields, MRZData
10
+
11
+
12
+ class FieldExtractor:
13
+ """Field extraction and mapping from OCR results."""
14
+
15
+ # Field mapping patterns for Dutch ID cards
16
+ FIELD_PATTERNS = {
17
+ "document_number": [
18
+ r"documentnummer[:\s]*([A-Z0-9]+)",
19
+ r"document\s*number[:\s]*([A-Z0-9]+)",
20
+ r"nr[:\s]*([A-Z0-9]+)"
21
+ ],
22
+ "surname": [
23
+ r"achternaam[:\s]*([A-Z]+)",
24
+ r"surname[:\s]*([A-Z]+)",
25
+ r"family\s*name[:\s]*([A-Z]+)"
26
+ ],
27
+ "given_names": [
28
+ r"voornamen[:\s]*([A-Z]+)",
29
+ r"given\s*names[:\s]*([A-Z]+)",
30
+ r"first\s*name[:\s]*([A-Z]+)"
31
+ ],
32
+ "nationality": [
33
+ r"nationaliteit[:\s]*([A-Za-z]+)",
34
+ r"nationality[:\s]*([A-Za-z]+)"
35
+ ],
36
+ "date_of_birth": [
37
+ r"geboortedatum[:\s]*(\d{2}[./-]\d{2}[./-]\d{4})",
38
+ r"date\s*of\s*birth[:\s]*(\d{2}[./-]\d{2}[./-]\d{4})",
39
+ r"born[:\s]*(\d{2}[./-]\d{2}[./-]\d{4})"
40
+ ],
41
+ "gender": [
42
+ r"geslacht[:\s]*([MF])",
43
+ r"gender[:\s]*([MF])",
44
+ r"sex[:\s]*([MF])"
45
+ ],
46
+ "place_of_birth": [
47
+ r"geboorteplaats[:\s]*([A-Za-z\s]+)",
48
+ r"place\s*of\s*birth[:\s]*([A-Za-z\s]+)",
49
+ r"born\s*in[:\s]*([A-Za-z\s]+)"
50
+ ],
51
+ "date_of_issue": [
52
+ r"uitgiftedatum[:\s]*(\d{2}[./-]\d{2}[./-]\d{4})",
53
+ r"date\s*of\s*issue[:\s]*(\d{2}[./-]\d{2}[./-]\d{4})",
54
+ r"issued[:\s]*(\d{2}[./-]\d{2}[./-]\d{4})"
55
+ ],
56
+ "date_of_expiry": [
57
+ r"vervaldatum[:\s]*(\d{2}[./-]\d{2}[./-]\d{4})",
58
+ r"date\s*of\s*expiry[:\s]*(\d{2}[./-]\d{2}[./-]\d{4})",
59
+ r"expires[:\s]*(\d{2}[./-]\d{2}[./-]\d{4})"
60
+ ],
61
+ "personal_number": [
62
+ r"persoonsnummer[:\s]*(\d{9})",
63
+ r"personal\s*number[:\s]*(\d{9})",
64
+ r"bsn[:\s]*(\d{9})"
65
+ ]
66
+ }
67
+
68
+ @classmethod
69
+ def extract_fields(cls, ocr_text: str) -> IdCardFields:
70
+ """Extract structured fields from OCR text.
71
+
72
+ Args:
73
+ ocr_text: Raw OCR text from document processing
74
+
75
+ Returns:
76
+ IdCardFields object with extracted field data
77
+ """
78
+ fields = {}
79
+
80
+ for field_name, patterns in cls.FIELD_PATTERNS.items():
81
+ value = None
82
+ confidence = 0.0
83
+
84
+ for pattern in patterns:
85
+ match = re.search(pattern, ocr_text, re.IGNORECASE)
86
+ if match:
87
+ value = match.group(1).strip()
88
+ confidence = 0.8 # Base confidence for pattern match
89
+ break
90
+
91
+ if value:
92
+ fields[field_name] = ExtractedField(
93
+ field_name=field_name,
94
+ value=value,
95
+ confidence=confidence,
96
+ source="ocr"
97
+ )
98
+
99
+ return IdCardFields(**fields)
100
+
101
+ @classmethod
102
+ def extract_mrz(cls, ocr_text: str) -> Optional[MRZData]:
103
+ """Extract MRZ data from OCR text.
104
+
105
+ Args:
106
+ ocr_text: Raw OCR text from document processing
107
+
108
+ Returns:
109
+ MRZData object if MRZ detected, None otherwise
110
+ """
111
+ # Look for MRZ patterns (TD1, TD2, TD3)
112
+ mrz_patterns = [
113
+ r"(P<[A-Z0-9<]+\n[A-Z0-9<]+)", # Generic passport format (try first)
114
+ r"([A-Z0-9<]{30}\n[A-Z0-9<]{30})", # TD1 format
115
+ r"([A-Z0-9<]{44}\n[A-Z0-9<]{44})", # TD2 format
116
+ r"([A-Z0-9<]{44}\n[A-Z0-9<]{44}\n[A-Z0-9<]{44})" # TD3 format
117
+ ]
118
+
119
+ for pattern in mrz_patterns:
120
+ match = re.search(pattern, ocr_text, re.MULTILINE)
121
+ if match:
122
+ raw_mrz = match.group(1)
123
+ # Basic MRZ parsing (simplified)
124
+ return MRZData(
125
+ raw_text=raw_mrz,
126
+ format_type="TD3" if len(raw_mrz.split('\n')) == 3 else "TD2",
127
+ is_valid=True, # Assume valid if present
128
+ checksum_errors=[], # Not implemented in basic version
129
+ confidence=0.9
130
+ )
131
+
132
+ return None
models.py ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Pydantic models for Dots.OCR text extraction service.
2
+
3
+ This module defines the data structures used for API requests,
4
+ responses, and internal data processing.
5
+ """
6
+
7
+ from typing import List, Optional, Dict, Any
8
+ from pydantic import BaseModel, Field
9
+
10
+
11
+ class ExtractedField(BaseModel):
12
+ """Individual extracted field from identity document."""
13
+ field_name: str = Field(..., description="Standardized field name")
14
+ value: Optional[str] = Field(None, description="Extracted field value")
15
+ confidence: float = Field(..., ge=0.0, le=1.0, description="Extraction confidence")
16
+ source: str = Field(..., description="Source of extraction (MRZ, OCR, VLM)")
17
+
18
+
19
+ class IdCardFields(BaseModel):
20
+ """Structured fields extracted from identity documents."""
21
+ document_number: Optional[ExtractedField] = Field(None, description="Document number/ID")
22
+ document_type: Optional[ExtractedField] = Field(None, description="Type of document")
23
+ issuing_country: Optional[ExtractedField] = Field(None, description="Issuing country code")
24
+ issuing_authority: Optional[ExtractedField] = Field(None, description="Issuing authority")
25
+
26
+ # Personal Information
27
+ surname: Optional[ExtractedField] = Field(None, description="Family name/surname")
28
+ given_names: Optional[ExtractedField] = Field(None, description="Given names")
29
+ nationality: Optional[ExtractedField] = Field(None, description="Nationality code")
30
+ date_of_birth: Optional[ExtractedField] = Field(None, description="Date of birth")
31
+ gender: Optional[ExtractedField] = Field(None, description="Gender")
32
+ place_of_birth: Optional[ExtractedField] = Field(None, description="Place of birth")
33
+
34
+ # Validity Information
35
+ date_of_issue: Optional[ExtractedField] = Field(None, description="Date of issue")
36
+ date_of_expiry: Optional[ExtractedField] = Field(None, description="Date of expiry")
37
+ personal_number: Optional[ExtractedField] = Field(None, description="Personal number")
38
+
39
+ # Additional fields for specific document types
40
+ optional_data_1: Optional[ExtractedField] = Field(None, description="Optional data field 1")
41
+ optional_data_2: Optional[ExtractedField] = Field(None, description="Optional data field 2")
42
+
43
+
44
+ class MRZData(BaseModel):
45
+ """Machine Readable Zone data extracted from identity documents."""
46
+ raw_text: str = Field(..., description="Raw MRZ text as extracted")
47
+ format_type: str = Field(..., description="MRZ format type (TD1, TD2, TD3, MRVA, MRVB)")
48
+ is_valid: bool = Field(..., description="Whether MRZ checksums are valid")
49
+ checksum_errors: List[str] = Field(default_factory=list, description="List of checksum validation errors")
50
+ confidence: float = Field(..., ge=0.0, le=1.0, description="Extraction confidence score")