license: other pretty_name: Multilingual Dictionary Data size_categories: - 1M<n<10M configs: - config_name: default data_files: - split: train path: hf_dataset//*.jsonl

Multilingual Dictionary Data

This is a cleaned and structured multilingual dictionary dataset prepared for publication on Hugging Face Datasets.

The final release data is stored under hf_dataset/. The original decoded exports in this repository were normalized, cleaned, filtered, and converted into JSONL shards that Hugging Face can load directly.

Dataset Structure

The Hugging Face dataset loader should read:

hf_dataset//*.jsonl

Each JSONL file contains one record per line. The final release schema is:

id: Stable record identifier generated from source path, headword, and cleaned text.
headword: Dictionary headword or lookup term.
text: Cleaned dictionary body text.
text_length: Character length of the cleaned text field.
source_group: Top-level source group, such as main, GetWord, other, pro, qamus, Address, or national.
source_name: Source dictionary name or subdirectory name.
source_id: Compact source identifier.
source_path: Relative source path of the original decoded JSONL file.
entry_type: Entry type from the decoded source when available.

Data Cleaning

The final hf_dataset/ release directory is intentionally separate from intermediate working directories.

The cleaning pipeline performed the following operations:

Removed clearly corrupted source files from the release set.
Removed control characters and private-use Unicode characters from released text.
Removed inline dictionary marker codes from released text.
Removed empty or too-short placeholder records.
Removed exact duplicate records within each source file.
Dropped raw decoder metadata from the final release records.
Kept source tracing fields so every released record can still be traced back to its decoded source file.

The final release records do not include raw_text, binary offsets, blob identifiers, or decoder-only metadata.

Release Statistics

Current final release statistics:

Data files: 267 JSONL shards.
Input records considered: 4,887,244.
Released records: 4,886,829.
Removed empty or too-short records: 137.
Removed exact duplicates within source files: 278.
Corrupted source files excluded: 2.

Records by source group:

main: 3,084,918
GetWord: 1,712,181
pro: 55,335
other: 16,504
qamus: 16,290
Address: 1,078
national: 523

Usage

After the dataset is uploaded to Hugging Face, it can be loaded with:

from datasets import load_dataset

dataset = load_dataset("YOUR_USERNAME/dictionary_data") print(dataset["train"][0])

A typical record looks like:

{ "id": "bdef9ba6690a5ba0", "headword": "ئا", "text": "دۇنيا يەر - جاي ناملىرى\n拉帕克\nlā pà kè", "text_length": 36, "source_group": "main", "source_name": "uy_han", "source_id": "main/uy_han", "source_path": "main/uy_han/entries.jsonl", "entry_type": "text" }

Rebuilding the Dataset

The final release can be rebuilt from the decoded source directories with:

python scripts/prepare_dataset.py --repo-root . --out-dir hf_data_base --report-dir reports python scripts/build_clean_dataset.py --input-dir hf_data_base --output-dir hf_dataset python scripts/audit_dataset.py --data-dir hf_dataset --report-dir reports

The intermediate hf_data_base/ directory is not intended as the release dataset. Only hf_dataset/ is referenced by the Hugging Face dataset card.

Quality Reports

Generated reports are available under reports/:

reports/quality_report.json: Latest audit summary.
reports/quality_samples.jsonl: Sample records for detected quality warnings.
reports/errors.jsonl: Records that could not be parsed during preprocessing. The current audit of hf_dataset/ reports no corrupted sources, no control-character issues, no private-use character issues, no replacement-character issues, and no too-short text records. Remaining very_long_text warnings are retained because they correspond to valid long dictionary or encyclopedia-style entries.

License

The license is currently marked as other in the dataset metadata.

Before public release, verify that the source dictionary data can legally be redistributed on Hugging Face and update this section with the exact license terms.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support