license: other pretty_name: Multilingual Dictionary Data size_categories: - 1M<n<10M configs: - config_name: default data_files: - split: train path: hf_dataset//*.jsonl
Multilingual Dictionary Data
This is a cleaned and structured multilingual dictionary dataset prepared for publication on Hugging Face Datasets.
The final release data is stored under hf_dataset/. The original decoded exports in this repository were normalized, cleaned, filtered, and converted into JSONL shards that Hugging Face can load directly.
Dataset Structure
The Hugging Face dataset loader should read:
hf_dataset//*.jsonl
Each JSONL file contains one record per line. The final release schema is:
- id: Stable record identifier generated from source path, headword, and cleaned text.
- headword: Dictionary headword or lookup term.
- text: Cleaned dictionary body text.
- text_length: Character length of the cleaned text field.
- source_group: Top-level source group, such as main, GetWord, other, pro, qamus, Address, or national.
- source_name: Source dictionary name or subdirectory name.
- source_id: Compact source identifier.
- source_path: Relative source path of the original decoded JSONL file.
- entry_type: Entry type from the decoded source when available.
Data Cleaning
The final hf_dataset/ release directory is intentionally separate from intermediate working directories.
The cleaning pipeline performed the following operations:
- Removed clearly corrupted source files from the release set.
- Removed control characters and private-use Unicode characters from released text.
- Removed inline dictionary marker codes from released text.
- Removed empty or too-short placeholder records.
- Removed exact duplicate records within each source file.
- Dropped raw decoder metadata from the final release records.
- Kept source tracing fields so every released record can still be traced back to its decoded source file.
The final release records do not include raw_text, binary offsets, blob identifiers, or decoder-only metadata.
Release Statistics
Current final release statistics:
- Data files: 267 JSONL shards.
- Input records considered: 4,887,244.
- Released records: 4,886,829.
- Removed empty or too-short records: 137.
- Removed exact duplicates within source files: 278.
- Corrupted source files excluded: 2.
Records by source group:
- main: 3,084,918
- GetWord: 1,712,181
- pro: 55,335
- other: 16,504
- qamus: 16,290
- Address: 1,078
- national: 523
Usage
After the dataset is uploaded to Hugging Face, it can be loaded with:
from datasets import load_dataset
dataset = load_dataset("YOUR_USERNAME/dictionary_data") print(dataset["train"][0])
A typical record looks like:
{ "id": "bdef9ba6690a5ba0", "headword": "ئا", "text": "دۇنيا يەر - جاي ناملىرى\n拉帕克\nlā pà kè", "text_length": 36, "source_group": "main", "source_name": "uy_han", "source_id": "main/uy_han", "source_path": "main/uy_han/entries.jsonl", "entry_type": "text" }
Rebuilding the Dataset
The final release can be rebuilt from the decoded source directories with:
python scripts/prepare_dataset.py --repo-root . --out-dir hf_data_base --report-dir reports python scripts/build_clean_dataset.py --input-dir hf_data_base --output-dir hf_dataset python scripts/audit_dataset.py --data-dir hf_dataset --report-dir reports
The intermediate hf_data_base/ directory is not intended as the release dataset. Only hf_dataset/ is referenced by the Hugging Face dataset card.
Quality Reports
Generated reports are available under reports/:
- reports/quality_report.json: Latest audit summary.
- reports/quality_samples.jsonl: Sample records for detected quality warnings.
- reports/errors.jsonl: Records that could not be parsed during preprocessing. The current audit of hf_dataset/ reports no corrupted sources, no control-character issues, no private-use character issues, no replacement-character issues, and no too-short text records. Remaining very_long_text warnings are retained because they correspond to valid long dictionary or encyclopedia-style entries.
License
The license is currently marked as other in the dataset metadata.
Before public release, verify that the source dictionary data can legally be redistributed on Hugging Face and update this section with the exact license terms.