XLM-RoBERTa Job Taxonomy — Main (German)

Fine-tuned xlm-roberta-base for classifying German job listings into 21 top-level industry categories.

Part of the JobBlast taxonomy model family — a system for automated classification of German-language job postings. This model handles the broad top-level split across all industries; for detailed IT role classification see the related IT model below.

Test Metrics

Metric	Value
Accuracy	81.75%
F1 macro	78.15%
F1 weighted	81.88%

Evaluated on a held-out test set of 3,320 German job listings.

Per-class results

Category	Precision	Recall	F1	Support
Administration & Office	0.704	0.737	0.720	190
Construction & Building	0.702	0.733	0.717	90
Design & Creative	0.714	0.800	0.755	25
Education & Teaching	0.620	0.689	0.653	45
Engineering	0.788	0.769	0.778	363
Finance, Accounting & Controlling	0.778	0.755	0.766	200
General Management & Consulting	0.611	0.767	0.680	129
Healthcare & Medical	0.892	0.892	0.892	194
Hospitality, Gastronomy & Tourism	0.897	0.776	0.832	67
Human Resources	0.667	0.783	0.720	23
IT & Software	0.907	0.880	0.893	500
Insurance & Real Estate	0.819	0.907	0.861	75
Legal	0.688	0.786	0.733	14
Logistics, Transport & Warehouse	0.859	0.905	0.882	74
Marketing, Communications & PR	0.794	0.711	0.750	38
Production & Manufacturing	0.712	0.767	0.739	129
Public Sector, Security & Defense	0.695	0.725	0.710	91
Sales & Business Development	0.908	0.908	0.908	500
Science & Research	0.733	0.815	0.772	27
Skilled Trades & Crafts	0.873	0.780	0.824	431
Social Work & Care	0.826	0.826	0.826	115

Repository Structure

Ashybalka/xlm-roberta-taxonomy-main-de/
├── config.json                    # shared — label map, model config
├── tokenizer.json                 # shared — fast tokenizer
├── tokenizer_config.json          # shared
├── sentencepiece.bpe.model        # shared
├── special_tokens_map.json        # shared
├── test_metrics.json              # evaluation results
├── classification_report.txt      # full per-class report
│
├── pytorch/
│   ├── config.json                # needed for from_pretrained(subfolder=)
│   └── model.safetensors          # GPU inference / fine-tuning (~1.1 GB)
│
├── onnx/
│   └── model.onnx                 # CPU fp32 inference (~1.1 GB)
│
└── onnx-int8/
    └── model_quantized.onnx       # CPU INT8 quantized (~280 MB 3-4× smaller)

Usage

Input Format

[Job Title] [SEP] [Job Description]

text = "Pflegefachkraft [SEP] Wir suchen eine examinierte Pflegefachkraft " \
       "für die Betreuung von Bewohnern in unserer Senioreneinrichtung."

PyTorch (GPU / fine-tuning)

from transformers import pipeline

clf = pipeline(
    "text-classification",
    model="Ashybalka/xlm-roberta-taxonomy-main-de",
    subfolder="pytorch",
    device=0,  # GPU; -1 for CPU
)

result = clf("DevOps Engineer [SEP] Kubernetes, CI/CD, Monitoring mit Prometheus.")
print(result)
# [{'label': 'IT & Software', 'score': 0.9712}]

ONNX fp32 (CPU inference)

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline

model     = ORTModelForSequenceClassification.from_pretrained(
                "Ashybalka/xlm-roberta-taxonomy-main-de",
                subfolder="onnx"
            )
tokenizer = AutoTokenizer.from_pretrained("Ashybalka/xlm-roberta-taxonomy-main-de")
clf       = pipeline("text-classification", model=model, tokenizer=tokenizer)

result = clf("Maurer [SEP] Erfahrener Maurer für Hochbau und Sanierungsarbeiten gesucht.")
print(result)
# [{'label': 'Construction & Building', 'score': 0.9483}]

ONNX INT8 (CPU, lightweight)

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline

model     = ORTModelForSequenceClassification.from_pretrained(
                "Ashybalka/xlm-roberta-taxonomy-main-de",
                subfolder="onnx-int8"
            )
tokenizer = AutoTokenizer.from_pretrained("Ashybalka/xlm-roberta-taxonomy-main-de")
clf       = pipeline("text-classification", model=model, tokenizer=tokenizer)

Direct ONNX Runtime (no transformers)

For production FastAPI services or environments without transformers:

import json
import numpy as np
import onnxruntime as ort
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer

# download model
path = snapshot_download(
    "Ashybalka/xlm-roberta-taxonomy-main-de",
    allow_patterns=["onnx/model.onnx", "tokenizer.json", "config.json"]
)

# load
session   = ort.InferenceSession(f"{path}/onnx/model.onnx",
                                  providers=["CPUExecutionProvider"])
tokenizer = Tokenizer.from_file(f"{path}/tokenizer.json")
tokenizer.enable_truncation(max_length=510)
tokenizer.no_padding()

with open(f"{path}/config.json") as f:
    labels = [v for _, v in sorted(json.load(f)["id2label"].items(), key=lambda x: int(x[0]))]

vocab    = tokenizer.get_vocab()
bos, eos = vocab["<s>"], vocab["</s>"]

def classify(title: str, description: str) -> dict:
    text     = f"{title} [SEP] {description}"
    encoding = tokenizer.encode(text, add_special_tokens=False)
    ids      = [bos] + encoding.ids + [eos]
    mask     = [1] * len(ids)

    logits = session.run(None, {
        "input_ids":      np.array([ids],  dtype=np.int64),
        "attention_mask": np.array([mask], dtype=np.int64),
    })[0][0]

    exp   = np.exp(logits - logits.max())
    probs = exp / exp.sum()
    idx   = int(np.argmax(probs))

    return {"category": labels[idx], "confidence": round(float(probs[idx]), 4)}

print(classify("Python Developer", "FastAPI, Docker, PostgreSQL, REST API Entwicklung."))
# {'category': 'IT & Software', 'confidence': 0.9534}

Training Details

Parameter	Value
Base model	`FacebookAI/xlm-roberta-base`
Labeling	LLM consensus (3-model voting)
Agreement filter	2/3 or 3/3 required
Max length	512 tokens
Learning rate	2e-5
Class weighting	Balanced + sample weights by agreement

Limitations

Optimized for German job listings only
Class imbalance in the test set — IT & Software and Sales & Business Development (500 samples each) vs. Legal (14 samples) and Design & Creative (25 samples); metrics for small classes remain higher-variance
Weakest classes: Education & Teaching (F1 0.653), General Management & Consulting (F1 0.680), Construction & Building (F1 0.717) — broad or semantically overlapping categories
Administration & Office and General Management & Consulting act as catch-all categories for ambiguous office roles and may absorb borderline listings
Inputs longer than 512 tokens are truncated — use title + first paragraph for best results

Related Models

Model	Categories	Use case
xlm-roberta-taxonomy-main-de (this model)	21 top-level	General job classification across all industries
xlm-roberta-taxonomy-it-de	14 IT subcategories	Detailed IT role classification

A typical pipeline: this main model assigns the top-level category, and listings classified as IT & Software are then passed to the it model for fine-grained IT role classification.

Downloads last month: 44

Model tree for Ashybalka/xlm-roberta-taxonomy-main-de

Base model

FacebookAI/xlm-roberta-base

Quantized

(21)

this model

Spaces using Ashybalka/xlm-roberta-taxonomy-main-de 2

Collection including Ashybalka/xlm-roberta-taxonomy-main-de

JobBlast

Collection

JobBlast taxonomy model family — a system for automated classification of German-language job postings, optimized for CPU inference. • 8 items • Updated May 23