NLP-intelligence / Data /README.md
Nomio4640's picture
reorganized files
3773a26

NER Training Data

This directory contains all training, evaluation, and reference data for the Mongolian NER model (Nomio4640/ner-mongolian on HuggingFace Hub).

None of these files are used at runtime by the web app. They are only used for model training and evaluation.

Labeling Formats

Two different labeling formats are used across this project. Both produce identical BERT token inputs after tokenization — the format does not affect model quality.

CoNLL (BIO tags) — used in data/

Token-level format. One word per line with BIO label, sentences separated by blank lines.

Батболд O O B-PER
гишүүн  O O O

Улаанбаатар O O B-LOC
хотод       O O O

.txt and .conll file extensions are both CoNLL format — there is no technical difference.

JSONL (character offsets) — used in datav2/

Sentence-level format. One JSON object per line with character-position labels.

{"text": "Батболд гишүүн", "labels": [[0, 7, "PER"]]}
{"text": "Улаанбаатар хотод", "labels": [[0, 10, "LOC"]]}

.json = one big JSON array/object wrapping everything. .jsonl = one independent JSON object per line (easier to stream and append).

Why two formats?

  • CoNLL was used for v1 (manual annotation + silver labeling). Standard format for NER datasets.
  • JSONL was used for v2 (synthetic data generation). Character offsets are easier to produce programmatically — no need to pre-tokenize.
  • Both get converted to BERT subword token labels during training. The Colab training code has separate tokenizers for each format (tokenize_conll_data() and tokenize_json_data()), but the model sees the same tensors.

Shared Validation & Test Sets

data/valid.txt and data/test.txt are used by both v1 and v2 training pipelines. This is intentional — using the same evaluation set allows fair comparison of model performance across different training approaches.

Directory Structure

data/ — CoNLL Training Data (v1 Pipeline)

File Size Description
train.txt 2.5MB Original manually-annotated gold training data
train_final.txt 6.0MB Final training file — train.txt + auto-labeled silver data with label fixes. This is what gets uploaded to Colab
valid.txt 275KB Validation split (shared across v1 and v2, used for early stopping)
test.txt 307KB Test split (shared across v1 and v2, used by eval/ scripts)

datav2/ — JSONL Training Data (v2 Pipeline)

File Description
generate_training_data.py Data generation script (run locally). Reads NER-dataset/ reference files and produces synthetic training sentences with Mongolian case suffixes, politician names, companies, etc.
training_v2_cells.py Model training code (copy-paste into Google Colab cells). Loads train_v2_merged.jsonl for training and data/valid.txt for validation. Handles character-offset to BERT subword alignment
train_v2_merged.jsonl Final v2 training file (20,696 sentences). All generated data merged and shuffled

Intermediate per-entity JSONL files (per_names.jsonl, org_*.jsonl, etc.) are gitignored. Regenerate with:

cd Data/datav2 && python generate_training_data.py

NER-dataset/ — Reference Data for Data Generation

Source datasets used by generate_training_data.py to create synthetic training examples. Not used directly for training or at runtime.

File Description
NER_v1.0.json.gz Base NER dataset (10,162 sentences, gzip-compressed JSON). .gz = gzip compression — Python reads directly with gzip.open()
locations.json All Mongolian administrative locations (279 entries: аймаг, сум, дүүрэг, хороо, хороолол) with parent hierarchy and coordinates
districts.csv Flat list of 353 sums/districts with parent aimag. Supplements locations.json with 280 additional sums
mongolian_abbreviations.csv Organization abbreviations (526 entries, e.g. МҮОХ → Монголын Үндэсний Олимпийн Хороо)
countries.csv Country names for LOC entity generation
mongolian_news_demo.csv Demo news articles (500 rows) for testing

Large compressed files (mongolian_personal_names.csv.gz, mongolian_company_names.csv.gz, mongolian_clan_names.csv.gz) are gitignored — keep locally if needed for regenerating datav2.

Training Pipelines

v1 (CoNLL)

train.txt (manual annotation)
    +
silver data (auto-labeled by model, high confidence)
    |
    v  [merge_train.py → fix_labels.py]
train_final.txt
    |
    v  Upload to Google Colab
    |
    v  Fine-tune BERT model
    |
    v  Push to HuggingFace Hub: Nomio4640/ner-mongolian

v2 (JSONL)

NER-dataset/ reference data (names, locations, abbreviations)
    |
    v  [generate_training_data.py] (run locally)
    |
intermediate JSONL files (gitignored, regenerable)
    |
    v  [merged + shuffled]
train_v2_merged.jsonl
    |
    v  Upload to Google Colab
    |
    v  [training_v2_cells.py] Fine-tune BERT model
    |
    v  Push to HuggingFace Hub: Nomio4640/ner-mongolian

Both pipelines use data/valid.txt for validation and data/test.txt for evaluation.

How to Regenerate Files

# Regenerate v2 intermediate JSONL files from reference data
cd Data/datav2 && python generate_training_data.py

# v1 intermediate files (merge_train.py, fix_labels.py) were removed from repo
# but can be recovered from git history if needed:
# git log --all --oneline -- scripts/