NaturalnessEncoder

NaturalnessEncoder is a DeBERTa-v3-base encoder fine-tuned with LoRA and an ordinal supervised contrastive objective on HU-Nat, an 8-level naturalness spectrum spanning machine-translation output, learner English, and graded native prose. On the held-out test split, a frozen-embedding ridge probe recovers the ordinal label at Spearman ρ = +0.941 and pairwise cosine geometry tracks label proximity at ρ = +0.891.

The encoder maps English text into a 128-dimensional space where cosine similarity reflects proximity on the naturalness spectrum: texts with similar degrees of fluency, idiomaticity, and distributional fit sit close together; texts that differ strongly are pushed apart.

It is intended for relative comparison, retrieval, ranking, and corpus-level analysis. It is not a hard classifier and should not be used as an uncalibrated absolute score.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("foudil/lens-naturalness-encoder")

# Same factual content rewritten at three native readability/register levels
sentences = [
    "Scientists have found a new kind of small creature in the deep sea. It is very different from anything they have seen before.",
    "Researchers have discovered a new species of small organism living in the deep ocean, unlike anything previously identified.",
    "Marine biologists have identified a novel taxonomic group of microorganisms inhabiting the abyssal zone, distinct from any extant phylogenetic lineage.",
]
emb = model.encode(sentences)
print(model.similarity(emb, emb).numpy().round(3))
# [[1.000  0.702  0.433]
#  [0.702  1.000  0.924]
#  [0.433  0.924  1.000]]
# → Adjacent native levels are closer than distant ones; the gradient is monotonic.

Why this model exists

Naturalness is the degree to which a text conforms to the surface, syntactic, and idiomatic regularities of fluent English, considered separately from whether it preserves a source meaning, expresses a particular sentiment, or belongs to a specific topic.

Existing tools only partially address this problem:

MT evaluation metrics such as COMET or BLEURT are designed around source/reference settings and often mix fluency with adequacy.
Language-model perplexity is a one-dimensional proxy that can penalise rare but natural domains, reward generic low-entropy text, and inherit the distributional biases of the language model used.
Semantic encoders are optimised to preserve meaning, not fluency or idiomaticity.

NaturalnessEncoder is trained with an ordinal supervised contrastive objective. Pairs whose labels are close on an 8-level empirical spectrum are pulled together; pairs at distant levels are pushed apart. Attraction strength decays with ordinal distance, so the model learns a continuous metric space rather than a set of discrete categories.

Model details

Property	Value
Base model	`microsoft/deberta-v3-base`
Architecture	DeBERTa-v3-base + masked mean pooling + 2-layer MLP projection head (768→768 GELU, 768→128)
Adaptation	LoRA (r=32, α=64) applied to all `nn.Linear` modules in the backbone, merged for release
Output dimension	128, ℓ₂-normalized
Similarity function	Cosine similarity
Max sequence length	100 tokens
Training data	`foudil/HU-Nat`: 52,694 train / 13,174 test, 8-level ordinal labels
Training objective	Ordinal supervised contrastive learning with exponential-decay positive weighting
Language	English

The training data is the Naturalness Spectrum dataset, an 8-level ordinal corpus assembled from WMT06 machine-translation outputs, W&I+LOCNESS learner English, and OneStopEnglish graded native prose. See the dataset card for source details, label construction, splits, and confound audit.

Evaluation and dataset audit

Evaluation was run on the held-out test split of 13,174 sentences. The purpose of these probes is twofold:

measure whether the encoder recovers the intended ordinal structure; and
audit whether that structure is reducible to unrelated semantic or affective side signals.

The semantic and affective models below are therefore audit baselines, not competing naturalness systems. They test whether a generic meaning encoder or an emotion/tone encoder can explain the same label geometry without being trained for naturalness.

Global probes

Probe	Metric	NaturalnessEncoder	Semantic audit encoder (mpnet)	Affective audit encoder (RoBERTa)
Frozen-embedding ridge probe	Spearman ρ with ordinal label	+0.941	+0.697	+0.582
Frozen-embedding ridge probe	Kendall τ with ordinal label	+0.829	+0.531	+0.431
Pairwise cosine geometry probe	Spearman ρ with graded label similarity	+0.891	+0.066	−0.005
Pairwise cosine geometry probe	Pearson r with graded label similarity	+0.909	+0.066	−0.021
Pairwise cosine geometry probe	Mean far-vs-near cosine gap	+0.570	+0.030	+0.034

The global ridge scores for the audit encoders should be read cautiously: a linear probe can exploit corpus-level regularities that co-vary with the labels, including topic, genre, readability, learner level, or corpus source. The stricter pairwise cosine geometry probes show that generic semantic and affective spaces do not naturally reproduce the ordinal naturalness geometry.

Intra-source replication

Source group	NaturalnessEncoder ρ, frozen ridge probe	NaturalnessEncoder ρ, pairwise cosine geometry probe
MT / WMT06 (2 levels)	+0.138	+0.008
Learner / W&I+LOCNESS (3 levels)	+0.600	+0.222
Native / OneStopEnglish (3 levels)	+0.827	+0.577

The native/OneStopEnglish subset is the cleanest internal replication because its Elementary, Intermediate, and Advanced versions are rewrites of the same articles. This holds topic and broad communicative content relatively constant while varying readability, syntactic complexity, and surface realisation.

The learner subset provides a weaker but still meaningful replication across CEFR-related learner levels. The MT subset has little within-source resolution, which is expected given that it contains only two levels and that WMT-style fluency judgements are noisy and system-dependent.

What the audit does and does not show

These results support the claim that the learned space is not merely a semantic-similarity space or an affective-tone space. They do not prove that the labels isolate a pure construct called naturalness. The source corpora differ in genre, collection method, annotation scheme, and surface conventions, so residual corpus identity signal may remain.

The intended interpretation is therefore conservative: NaturalnessEncoder learns the empirical ordering induced by this dataset and recovers that ordering most reliably where the source design best controls content, especially within OneStopEnglish.

On reading the geometry

The encoder learns a non-uniform metric. Distances between the MT, learner, and native macro-bands are much larger than distances between adjacent levels inside a band. As a consequence:

sentences from different macro-bands may have cosine similarity close to zero or below zero;
sentences from the same macro-band may have cosine similarity above 0.9 even when their labels differ by one rank;
raw cosine values should be treated as ordinal evidence, not interval-scaled naturalness scores.

This reflects the empirical structure of the training data rather than a guarantee that the 8 labels are equally spaced. Applications that require a scalar score should fit a calibration layer or monotone transformation on an appropriate validation set.

Usage

Within-band ordering

The opening example shows the intended within-band behaviour: versions of the same content at adjacent native levels are closer than versions at distant levels.

Cross-band ranking against a native anchor

anchor = "The committee agreed that the proposal warranted further investigation before any decision could be made."
corpus = [
    "After deliberation, the committee determined that the proposal merited additional review prior to reaching a decision.",
    "The committee said the idea needs more looking into before deciding anything.",
    "The committee they agreed that the proposal need more investigate before any decision can be make.",
    "Committee agreed proposal need more investigation before make decision any.",
    "committee committee agree more investigation proposal decision before",
]

scores = model.similarity(model.encode([anchor]), model.encode(corpus))[0]
for s, t in sorted(zip(scores.tolist(), corpus), reverse=True):
    print(f"{s:+.3f}  {t[:80]}")
# +0.836  The committee said the idea needs more looking into...
# +0.821  After deliberation, the committee determined...
# +0.010  The committee they agreed that the proposal need more investigate...
# -0.114  Committee agreed proposal need more investigation before make decision any.
# -0.279  committee committee agree more investigation proposal decision before

The ranking is monotonic across the broad spectrum. In this example, both native paraphrases rank above learner-like and MT-like variants, and the simpler native paraphrase slightly outranks the more formal one. This suggests that the model is not simply rewarding formality, although formality and lexical sophistication can still act as side signals in some domains.

Scalar score by calibration

For applications that require a scalar score, fit a calibration procedure on representative data. One simple option is projection onto a spectrum axis computed from calibration centroids:

# axis = F.normalize(centroid_lvl7 - centroid_lvl0, dim=0)  # precompute on calibration data
# score = emb @ axis  # scalar, then calibrate or rescale downstream

On the held-out test set, this projected score correlates with the integer label at Spearman ρ = +0.891. (A frozen-embedding ridge probe — a more flexible linear functional — reaches ρ = +0.941 on the same split, reported in the global probes table above.) Either number should be interpreted as in-distribution recovery of the dataset labels, not as a universal absolute naturalness score.

Intended use

Naturalness-aware ranking, retrieval, or filtering of English text
Auxiliary signal for MT, paraphrase, or text-generation evaluation where source/reference information is unavailable
Feature input for downstream systems that need a graded representation of fluency or idiomaticity
Corpus-level analysis of register, learner-language, or generation artifacts

Out-of-scope use

Hard classification. The model produces graded embeddings, not category labels.
Uncalibrated absolute scoring. Cosine similarities are most meaningful comparatively; scalar scores require calibration.
Non-English text. The model has not been trained or evaluated on other languages.
High-stakes assessment of writers. The model is not validated for individual proficiency assessment, grading, hiring, diagnosis, or pedagogy.
Literary or highly marked registers. Fiction, poetry, rhetorical prose, dialectal writing, and deliberately non-standard styles are underrepresented or absent.
Adequacy or factuality evaluation. A text can be natural while mistranslating, hallucinating, or contradicting a source.

Limitations

Dataset-derived construct. The model learns the ordering induced by HU-Nat. That ordering is useful but not a complete theory of naturalness.
Corpus-source confounds. The spectrum combines MT output, learner essays, and graded native prose. Corpus identity, genre, topic distribution, annotation practice, and preprocessing artifacts may contribute residual signal.
Upper-band side signal. In OneStopEnglish, readability level is partly expressed through lexical and syntactic sophistication. More complex native prose may therefore score higher than simpler but equally fluent prose.
Lower-band noise. WMT06 fluency labels and learner-level labels are coarser and noisier than the native readability triplets, limiting within-band resolution.
Non-uniform spacing. The learned metric is not equally spaced across the 8 labels. Treat it as ordinal unless calibrated.
Surface artifacts. Punctuation, capitalisation, tokenisation, and MT-pipeline artifacts may carry signal beyond the intended construct.
Domain coverage. Training data does not cover all natural English domains, dialects, or styles.

Licence and source-data terms

The model weights are released under Apache 2.0. This applies to the released parameters only, not to the training data.

HU-Nat aggregates material derived from WMT06, W&I+LOCNESS, and OneStopEnglish, and each source remains governed by its original licence or terms of use. In particular, OneStopEnglish is released under CC BY-SA 4.0, and the BEA-2019 W&I+LOCNESS release is subject to non-commercial-use terms. Users who redistribute the underlying data, or derivatives of it, are responsible for complying with those source terms; the Apache 2.0 grant on the weights does not relicense the source corpora.

Downstream users should preserve attribution to all source corpora and consult the HU-Nat dataset card before redistributing data derived from the sources.

Citation

If you use this model, please cite the three constituent source corpora below. Citations for the accompanying article and the HU-Nat dataset will be added on publication.

@inproceedings{koehn-monz-2006-manual,
  title     = {Manual and Automatic Evaluation of Machine Translation between {E}uropean Languages},
  author    = {Koehn, Philipp and Monz, Christof},
  booktitle = {Proceedings on the Workshop on Statistical Machine Translation},
  year      = {2006},
  month     = jun,
  address   = {New York City},
  publisher = {Association for Computational Linguistics},
  pages     = {102--121},
  url       = {https://aclanthology.org/W06-3114/}
}

@inproceedings{bryant-etal-2019-bea,
  title     = {The {BEA}-2019 Shared Task on Grammatical Error Correction},
  author    = {Bryant, Christopher and Felice, Mariano and Andersen, {\O}istein E. and Briscoe, Ted},
  booktitle = {Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications},
  year      = {2019},
  month     = aug,
  address   = {Florence, Italy},
  publisher = {Association for Computational Linguistics},
  pages     = {52--75},
  doi       = {10.18653/v1/W19-4406},
  url       = {https://aclanthology.org/W19-4406/}
}

@inproceedings{vajjala-lucic-2018-onestopenglish,
  title     = {{OneStopEnglish} corpus: A new corpus for automatic readability assessment and text simplification},
  author    = {Vajjala, Sowmya and Lu{\v{c}}i{\'c}, Ivana},
  booktitle = {Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications},
  year      = {2018},
  month     = jun,
  address   = {New Orleans, Louisiana},
  publisher = {Association for Computational Linguistics},
  pages     = {297--304},
  doi       = {10.18653/v1/W18-0535},
  url       = {https://aclanthology.org/W18-0535/}
}

Framework versions

Python 3.9.25
transformers 4.57.6
PyTorch 2.8.0
Tokenizers 0.22.2

Downloads last month: 39

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for foudil/lens-naturalness-encoder

Base model

microsoft/deberta-v3-base

Finetuned

(621)

this model

foudil
/

lens-naturalness-encoder