GIST-small-markov-slop-detector

This BERT-based classifier is trained to distinguish coherent human-written text from text generated by a Markov chain.

As expected, the classifier achieves near-perfect performance (98.2% accuracy on evaluation set), largely because BERT’s attention mechanism captures long-range contextual dependencies, whereas a Markov model relies only on the previous state.

Dataset

Class distribution of the training dataset:

Label	Train	Test	Total
markov	7998	2000	9998
real	8000	2000	10000
Total	15998	4000	19998

Real samples from: agentlans/high-quality-text-long sample_k10000
Markov samples from: agentlans/markov-slop

Model Specification

Model type: bert
Problem Type: single_label_classification
Number of Labels: 2
Vocabulary Size: 30522
License: MIT

Use

To get started with this model in Python using the Hugging Face Transformers library, run the following code:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "agentlans/GIST-small-markov-slop-detector"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "Replace this with your input text."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()
predicted_class_name = model.config.id2label[predicted_class_id]

print(f"Predicted Class ID: {predicted_class_id}")
print(f"Predicted Class Name: {predicted_class_name}")

Intended Uses & Limitations

Intended Use

This model is designed for sequence classification tasks. Below are the specific class labels mapped to their corresponding IDs:

Label ID	Label Name
0	markov
1	real

Training Details

Hyperparameters

The following hyperparameters were used during fine-tuning:

Learning Rate: 5e-05
Train Batch Size: 8
Eval Batch Size: 8
Optimizer: OptimizerNames.ADAMW_TORCH_FUSED
Number of Epochs: 3.0
Mixed Precision: BF16

Show Advanced Training Configuration

Optimization & Regularization

Gradient Accumulation Steps: 1
Learning Rate Scheduler: SchedulerType.LINEAR
Warmup Steps: 0
Warmup Ratio: None
Weight Decay: 0.0
Max Gradient Norm: 1.0

Hardware & Reproducibility

Number of GPUs: 1
Seed: 42

Training Results & Evaluation

During fine-tuning, the model achieved the following results on the evaluation set:

Metric	Value
Train Loss	0.0593
Validation Loss	0.0693
Validation F1 Score	N/A
Total FLOPs	7.9037e+14

Speed Performance

Training Runtime: 106.2373 seconds
Train Samples per Second: 451.762
Evaluation Runtime: 3.2116 seconds
Eval Samples per Second: 1245.467

Show Detailed Training Logs

Training Logs History

Step	Epoch	Learning Rate	Training Loss	Validation Loss	Validation F1
500	0.25	4.5842e-05	0.1758	N/A	N/A
1000	0.5	4.1675e-05	0.1194	N/A	N/A
1500	0.75	3.7508e-05	0.1157	N/A	N/A
2000	1.0	3.3342e-05	0.0829	0.0693	N/A
2500	1.25	2.9175e-05	0.0405	N/A	N/A
3000	1.5	2.5008e-05	0.0334	N/A	N/A
3500	1.75	2.0842e-05	0.0464	N/A	N/A
4000	2.0	1.6675e-05	0.0412	0.0949	N/A
4500	2.25	1.2508e-05	0.0113	N/A	N/A
5000	2.5	8.3417e-06	0.0099	N/A	N/A
5500	2.75	4.1750e-06	0.0188	N/A	N/A
6000	3.0	8.3333e-09	0.0159	0.0898	N/A

Framework Versions

Transformers: 5.0.0.dev0
PyTorch: 2.9.1+cu128

Downloads last month: 32

Safetensors

Model size

33.4M params

Tensor type

F32

Model tree for agentlans/GIST-small-markov-slop-detector

Base model

avsolatorio/GIST-small-Embedding-v0

Finetuned

(19)

this model

Datasets used to train agentlans/GIST-small-markov-slop-detector

Evaluation results

Evaluation F1
self-reported

N/A
Evaluation Loss
self-reported

0.069