Instructions to use foudil/lens-naturalness-encoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use foudil/lens-naturalness-encoder with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("foudil/lens-naturalness-encoder") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
NaturalnessEncoder
NaturalnessEncoder is a DeBERTa-v3-base encoder fine-tuned with LoRA and an ordinal supervised contrastive objective on HU-Nat, an 8-level naturalness spectrum spanning machine-translation output, learner English, and graded native prose. On the held-out test split, a frozen-embedding ridge probe recovers the ordinal label at Spearman ρ = +0.941 and pairwise cosine geometry tracks label proximity at ρ = +0.891.
The encoder maps English text into a 128-dimensional space where cosine similarity reflects proximity on the naturalness spectrum: texts with similar degrees of fluency, idiomaticity, and distributional fit sit close together; texts that differ strongly are pushed apart.
It is intended for relative comparison, retrieval, ranking, and corpus-level analysis. It is not a hard classifier and should not be used as an uncalibrated absolute score.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("foudil/lens-naturalness-encoder")
# Same factual content rewritten at three native readability/register levels
sentences = [
"Scientists have found a new kind of small creature in the deep sea. It is very different from anything they have seen before.",
"Researchers have discovered a new species of small organism living in the deep ocean, unlike anything previously identified.",
"Marine biologists have identified a novel taxonomic group of microorganisms inhabiting the abyssal zone, distinct from any extant phylogenetic lineage.",
]
emb = model.encode(sentences)
print(model.similarity(emb, emb).numpy().round(3))
# [[1.000 0.702 0.433]
# [0.702 1.000 0.924]
# [0.433 0.924 1.000]]
# → Adjacent native levels are closer than distant ones; the gradient is monotonic.
Why this model exists
Naturalness is the degree to which a text conforms to the surface, syntactic, and idiomatic regularities of fluent English, considered separately from whether it preserves a source meaning, expresses a particular sentiment, or belongs to a specific topic.
Existing tools only partially address this problem:
- MT evaluation metrics such as COMET or BLEURT are designed around source/reference settings and often mix fluency with adequacy.
- Language-model perplexity is a one-dimensional proxy that can penalise rare but natural domains, reward generic low-entropy text, and inherit the distributional biases of the language model used.
- Semantic encoders are optimised to preserve meaning, not fluency or idiomaticity.
NaturalnessEncoder is trained with an ordinal supervised contrastive objective. Pairs whose labels are close on an 8-level empirical spectrum are pulled together; pairs at distant levels are pushed apart. Attraction strength decays with ordinal distance, so the model learns a continuous metric space rather than a set of discrete categories.
Model details
| Property | Value |
|---|---|
| Base model | microsoft/deberta-v3-base |
| Architecture | DeBERTa-v3-base + masked mean pooling + 2-layer MLP projection head (768→768 GELU, 768→128) |
| Adaptation | LoRA (r=32, α=64) applied to all nn.Linear modules in the backbone, merged for release |
| Output dimension | 128, ℓ₂-normalized |
| Similarity function | Cosine similarity |
| Max sequence length | 100 tokens |
| Training data | foudil/HU-Nat: 52,694 train / 13,174 test, 8-level ordinal labels |
| Training objective | Ordinal supervised contrastive learning with exponential-decay positive weighting |
| Language | English |
The training data is the Naturalness Spectrum dataset, an 8-level ordinal corpus assembled from WMT06 machine-translation outputs, W&I+LOCNESS learner English, and OneStopEnglish graded native prose. See the dataset card for source details, label construction, splits, and confound audit.
Evaluation and dataset audit
Evaluation was run on the held-out test split of 13,174 sentences. The purpose of these probes is twofold:
- measure whether the encoder recovers the intended ordinal structure; and
- audit whether that structure is reducible to unrelated semantic or affective side signals.
The semantic and affective models below are therefore audit baselines, not competing naturalness systems. They test whether a generic meaning encoder or an emotion/tone encoder can explain the same label geometry without being trained for naturalness.
Global probes
| Probe | Metric | NaturalnessEncoder | Semantic audit encoder (mpnet) | Affective audit encoder (RoBERTa) |
|---|---|---|---|---|
| Frozen-embedding ridge probe | Spearman ρ with ordinal label | +0.941 | +0.697 | +0.582 |
| Frozen-embedding ridge probe | Kendall τ with ordinal label | +0.829 | +0.531 | +0.431 |
| Pairwise cosine geometry probe | Spearman ρ with graded label similarity | +0.891 | +0.066 | −0.005 |
| Pairwise cosine geometry probe | Pearson r with graded label similarity | +0.909 | +0.066 | −0.021 |
| Pairwise cosine geometry probe | Mean far-vs-near cosine gap | +0.570 | +0.030 | +0.034 |
The global ridge scores for the audit encoders should be read cautiously: a linear probe can exploit corpus-level regularities that co-vary with the labels, including topic, genre, readability, learner level, or corpus source. The stricter pairwise cosine geometry probes show that generic semantic and affective spaces do not naturally reproduce the ordinal naturalness geometry.
Intra-source replication
| Source group | NaturalnessEncoder ρ, frozen ridge probe | NaturalnessEncoder ρ, pairwise cosine geometry probe |
|---|---|---|
| MT / WMT06 (2 levels) | +0.138 | +0.008 |
| Learner / W&I+LOCNESS (3 levels) | +0.600 | +0.222 |
| Native / OneStopEnglish (3 levels) | +0.827 | +0.577 |
The native/OneStopEnglish subset is the cleanest internal replication because its Elementary, Intermediate, and Advanced versions are rewrites of the same articles. This holds topic and broad communicative content relatively constant while varying readability, syntactic complexity, and surface realisation.
The learner subset provides a weaker but still meaningful replication across CEFR-related learner levels. The MT subset has little within-source resolution, which is expected given that it contains only two levels and that WMT-style fluency judgements are noisy and system-dependent.
What the audit does and does not show
These results support the claim that the learned space is not merely a semantic-similarity space or an affective-tone space. They do not prove that the labels isolate a pure construct called naturalness. The source corpora differ in genre, collection method, annotation scheme, and surface conventions, so residual corpus identity signal may remain.
The intended interpretation is therefore conservative: NaturalnessEncoder learns the empirical ordering induced by this dataset and recovers that ordering most reliably where the source design best controls content, especially within OneStopEnglish.
On reading the geometry
The encoder learns a non-uniform metric. Distances between the MT, learner, and native macro-bands are much larger than distances between adjacent levels inside a band. As a consequence:
- sentences from different macro-bands may have cosine similarity close to zero or below zero;
- sentences from the same macro-band may have cosine similarity above 0.9 even when their labels differ by one rank;
- raw cosine values should be treated as ordinal evidence, not interval-scaled naturalness scores.
This reflects the empirical structure of the training data rather than a guarantee that the 8 labels are equally spaced. Applications that require a scalar score should fit a calibration layer or monotone transformation on an appropriate validation set.
Usage
Within-band ordering
The opening example shows the intended within-band behaviour: versions of the same content at adjacent native levels are closer than versions at distant levels.
Cross-band ranking against a native anchor
anchor = "The committee agreed that the proposal warranted further investigation before any decision could be made."
corpus = [
"After deliberation, the committee determined that the proposal merited additional review prior to reaching a decision.",
"The committee said the idea needs more looking into before deciding anything.",
"The committee they agreed that the proposal need more investigate before any decision can be make.",
"Committee agreed proposal need more investigation before make decision any.",
"committee committee agree more investigation proposal decision before",
]
scores = model.similarity(model.encode([anchor]), model.encode(corpus))[0]
for s, t in sorted(zip(scores.tolist(), corpus), reverse=True):
print(f"{s:+.3f} {t[:80]}")
# +0.836 The committee said the idea needs more looking into...
# +0.821 After deliberation, the committee determined...
# +0.010 The committee they agreed that the proposal need more investigate...
# -0.114 Committee agreed proposal need more investigation before make decision any.
# -0.279 committee committee agree more investigation proposal decision before
The ranking is monotonic across the broad spectrum. In this example, both native paraphrases rank above learner-like and MT-like variants, and the simpler native paraphrase slightly outranks the more formal one. This suggests that the model is not simply rewarding formality, although formality and lexical sophistication can still act as side signals in some domains.
Scalar score by calibration
For applications that require a scalar score, fit a calibration procedure on representative data. One simple option is projection onto a spectrum axis computed from calibration centroids:
# axis = F.normalize(centroid_lvl7 - centroid_lvl0, dim=0) # precompute on calibration data
# score = emb @ axis # scalar, then calibrate or rescale downstream
On the held-out test set, this projected score correlates with the integer label at Spearman ρ = +0.891. (A frozen-embedding ridge probe — a more flexible linear functional — reaches ρ = +0.941 on the same split, reported in the global probes table above.) Either number should be interpreted as in-distribution recovery of the dataset labels, not as a universal absolute naturalness score.
Intended use
- Naturalness-aware ranking, retrieval, or filtering of English text
- Auxiliary signal for MT, paraphrase, or text-generation evaluation where source/reference information is unavailable
- Feature input for downstream systems that need a graded representation of fluency or idiomaticity
- Corpus-level analysis of register, learner-language, or generation artifacts
Out-of-scope use
- Hard classification. The model produces graded embeddings, not category labels.
- Uncalibrated absolute scoring. Cosine similarities are most meaningful comparatively; scalar scores require calibration.
- Non-English text. The model has not been trained or evaluated on other languages.
- High-stakes assessment of writers. The model is not validated for individual proficiency assessment, grading, hiring, diagnosis, or pedagogy.
- Literary or highly marked registers. Fiction, poetry, rhetorical prose, dialectal writing, and deliberately non-standard styles are underrepresented or absent.
- Adequacy or factuality evaluation. A text can be natural while mistranslating, hallucinating, or contradicting a source.
Limitations
- Dataset-derived construct. The model learns the ordering induced by HU-Nat. That ordering is useful but not a complete theory of naturalness.
- Corpus-source confounds. The spectrum combines MT output, learner essays, and graded native prose. Corpus identity, genre, topic distribution, annotation practice, and preprocessing artifacts may contribute residual signal.
- Upper-band side signal. In OneStopEnglish, readability level is partly expressed through lexical and syntactic sophistication. More complex native prose may therefore score higher than simpler but equally fluent prose.
- Lower-band noise. WMT06 fluency labels and learner-level labels are coarser and noisier than the native readability triplets, limiting within-band resolution.
- Non-uniform spacing. The learned metric is not equally spaced across the 8 labels. Treat it as ordinal unless calibrated.
- Surface artifacts. Punctuation, capitalisation, tokenisation, and MT-pipeline artifacts may carry signal beyond the intended construct.
- Domain coverage. Training data does not cover all natural English domains, dialects, or styles.
Licence and source-data terms
The model weights are released under Apache 2.0. This applies to the released parameters only, not to the training data.
HU-Nat aggregates material derived from WMT06, W&I+LOCNESS, and OneStopEnglish, and each source remains governed by its original licence or terms of use. In particular, OneStopEnglish is released under CC BY-SA 4.0, and the BEA-2019 W&I+LOCNESS release is subject to non-commercial-use terms. Users who redistribute the underlying data, or derivatives of it, are responsible for complying with those source terms; the Apache 2.0 grant on the weights does not relicense the source corpora.
Downstream users should preserve attribution to all source corpora and consult the HU-Nat dataset card before redistributing data derived from the sources.
Citation
If you use this model, please cite the three constituent source corpora below. Citations for the accompanying article and the HU-Nat dataset will be added on publication.
@inproceedings{koehn-monz-2006-manual,
title = {Manual and Automatic Evaluation of Machine Translation between {E}uropean Languages},
author = {Koehn, Philipp and Monz, Christof},
booktitle = {Proceedings on the Workshop on Statistical Machine Translation},
year = {2006},
month = jun,
address = {New York City},
publisher = {Association for Computational Linguistics},
pages = {102--121},
url = {https://aclanthology.org/W06-3114/}
}
@inproceedings{bryant-etal-2019-bea,
title = {The {BEA}-2019 Shared Task on Grammatical Error Correction},
author = {Bryant, Christopher and Felice, Mariano and Andersen, {\O}istein E. and Briscoe, Ted},
booktitle = {Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications},
year = {2019},
month = aug,
address = {Florence, Italy},
publisher = {Association for Computational Linguistics},
pages = {52--75},
doi = {10.18653/v1/W19-4406},
url = {https://aclanthology.org/W19-4406/}
}
@inproceedings{vajjala-lucic-2018-onestopenglish,
title = {{OneStopEnglish} corpus: A new corpus for automatic readability assessment and text simplification},
author = {Vajjala, Sowmya and Lu{\v{c}}i{\'c}, Ivana},
booktitle = {Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications},
year = {2018},
month = jun,
address = {New Orleans, Louisiana},
publisher = {Association for Computational Linguistics},
pages = {297--304},
doi = {10.18653/v1/W18-0535},
url = {https://aclanthology.org/W18-0535/}
}
Framework versions
- Python 3.9.25
- transformers 4.57.6
- PyTorch 2.8.0
- Tokenizers 0.22.2
- Downloads last month
- 39
Model tree for foudil/lens-naturalness-encoder
Base model
microsoft/deberta-v3-base