--- language: - hmr - lus - en license: cc-by-nc-sa-4.0 tags: - hmar - mizo - northeast-india - language-preservation - nlp - textual-corpus - low-resource-languages pretty_name: Hmar Digital Corpus (WIP) size_categories: - n<1K configs: - config_name: default data_files: - split: train path: "metadata.jsonl" --- # 📚 Hmar Digital Corpus & Research Archive ## 🚧 Status: Work in Progress This archive documents and preserves Hmar-related literature through **manual digitization of physical volumes** and the curation of rare digital materials. Each work is processed individually—from scan to metadata—to prioritize long-term preservation and research usability over scale. **Current Workflow:** - **Digitization:** High-quality scans of physical books and documents. - **Curation:** Collection and verification of existing digital research papers. - **Standardization:** Structured metadata (`metadata.json`) for every entry to ensure consistent indexing and retrieval. --- ## 📋 Project Philosophy The **Hmar Digital Corpus** exists to prevent the loss of Hmar literature and historical records. By converting limited-circulation books, community publications, and local research into a structured digital archive, the project ensures that the Hmar language (ISO 639-3: `hmr`) remains accessible for academic, linguistic, and cultural research. This project prioritizes **preservation, provenance, and transparency** over textual polish or scale. --- ## 📂 Repository Structure The corpus follows an **Atomic Folder** model: each book or document is a self-contained unit. ```text books/ ├── hmar/ # Hmar language literature & textbooks ├── mizo/ # Mizo-language works concerning Hmar history ├── english/ # Academic research, ethnographies, and linguistics └── bilingual/ # Dictionaries, primers, and parallel texts ``` Each document folder typically contains: - the scanned PDF (authoritative artifact), - a `metadata.json` file, - and, when available, an `ocr/` directory. --- ## 🔍 Metadata & Searchability Every document directory includes a `metadata.json` file, enabling reliable organization and search at scale. - **Core Fields:** Title (bilingual where applicable), Author, Year, Publisher. - **Extended Metadata:** Contextual information such as institutional approvals, author credentials, edition notes, or physical source details. Metadata is treated as first-class data and is required even when OCR is absent. --- ## 📁 OCR Directory Structure When available, OCR output is stored in a dedicated `ocr/` subdirectory within each book or document folder. **Example structure:** ```text Bilingual/ └── hmar_tawng_inchukna/ ├── hmar_tawng_inchukna_2012.pdf ├── metadata.json └── ocr/ └── hmar_tawng_inchukna_2012.txt ``` ### OCR File Semantics - The `ocr/` directory contains **raw, untouched OCR output**. - In the current corpus, OCR is typically stored as a **single consolidated `.txt` file** corresponding to the scanned PDF. - In future entries, OCR may also be organized per page or per chapter where useful. - No semantic corrections, spell-checking, normalization, or stylistic cleanup is applied unless explicitly documented in metadata. The presence of an `ocr/` directory **does not imply textual accuracy or verification**. OCR text exists to support search, reference, and future annotation—not as a definitive transcription. The scanned PDF remains the **authoritative source** unless a text has been explicitly marked as *verified* or *curated* in its metadata. Any corrected or curated text—if introduced—will exist as a **separate layer** and will retain explicit references to the original OCR output. --- ## 🛠 Progress & Limitations 1. **Preservation first:** Priority is given to securing high-quality scans and accurate metadata. 2. **OCR is secondary:** OCR is generated opportunistically and may be incomplete or error-prone. 3. **Incremental growth:** Materials are added as they are processed and documented. --- ## ⚖️ Rights & Usage This archive is provided for **non-commercial research, linguistic study, and cultural preservation**. Copyright remains with the original authors, societies, and publishers (e.g., HLS, HSA, AIRTSC). Rights holders may request removal or modification of materials through the repository’s discussion channel.