XLM-RoBERTa Job Taxonomy β€” Main (German)

License Language Base Categories

Fine-tuned xlm-roberta-base for classifying German job listings into 21 top-level industry categories.

Part of the JobBlast taxonomy model family β€” a system for automated classification of German-language job postings. This model handles the broad top-level split across all industries; for detailed IT role classification see the related IT model below.


Test Metrics

Metric Value
Accuracy 81.75%
F1 macro 78.15%
F1 weighted 81.88%

Evaluated on a held-out test set of 3,320 German job listings.

Per-class results

Category Precision Recall F1 Support
Administration & Office 0.704 0.737 0.720 190
Construction & Building 0.702 0.733 0.717 90
Design & Creative 0.714 0.800 0.755 25
Education & Teaching 0.620 0.689 0.653 45
Engineering 0.788 0.769 0.778 363
Finance, Accounting & Controlling 0.778 0.755 0.766 200
General Management & Consulting 0.611 0.767 0.680 129
Healthcare & Medical 0.892 0.892 0.892 194
Hospitality, Gastronomy & Tourism 0.897 0.776 0.832 67
Human Resources 0.667 0.783 0.720 23
IT & Software 0.907 0.880 0.893 500
Insurance & Real Estate 0.819 0.907 0.861 75
Legal 0.688 0.786 0.733 14
Logistics, Transport & Warehouse 0.859 0.905 0.882 74
Marketing, Communications & PR 0.794 0.711 0.750 38
Production & Manufacturing 0.712 0.767 0.739 129
Public Sector, Security & Defense 0.695 0.725 0.710 91
Sales & Business Development 0.908 0.908 0.908 500
Science & Research 0.733 0.815 0.772 27
Skilled Trades & Crafts 0.873 0.780 0.824 431
Social Work & Care 0.826 0.826 0.826 115

Repository Structure

Ashybalka/xlm-roberta-taxonomy-main-de/
β”œβ”€β”€ config.json                    # shared β€” label map, model config
β”œβ”€β”€ tokenizer.json                 # shared β€” fast tokenizer
β”œβ”€β”€ tokenizer_config.json          # shared
β”œβ”€β”€ sentencepiece.bpe.model        # shared
β”œβ”€β”€ special_tokens_map.json        # shared
β”œβ”€β”€ test_metrics.json              # evaluation results
β”œβ”€β”€ classification_report.txt      # full per-class report
β”‚
β”œβ”€β”€ pytorch/
β”‚   β”œβ”€β”€ config.json                # needed for from_pretrained(subfolder=)
β”‚   └── model.safetensors          # GPU inference / fine-tuning (~1.1 GB)
β”‚
β”œβ”€β”€ onnx/
β”‚   └── model.onnx                 # CPU fp32 inference (~1.1 GB)
β”‚
└── onnx-int8/
    └── model_quantized.onnx       # CPU INT8 quantized (~280 MB 3-4Γ— smaller)

Usage

Input Format

[Job Title] [SEP] [Job Description]
text = "Pflegefachkraft [SEP] Wir suchen eine examinierte Pflegefachkraft " \
       "fΓΌr die Betreuung von Bewohnern in unserer Senioreneinrichtung."

PyTorch (GPU / fine-tuning)

from transformers import pipeline

clf = pipeline(
    "text-classification",
    model="Ashybalka/xlm-roberta-taxonomy-main-de",
    subfolder="pytorch",
    device=0,  # GPU; -1 for CPU
)

result = clf("DevOps Engineer [SEP] Kubernetes, CI/CD, Monitoring mit Prometheus.")
print(result)
# [{'label': 'IT & Software', 'score': 0.9712}]

ONNX fp32 (CPU inference)

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline

model     = ORTModelForSequenceClassification.from_pretrained(
                "Ashybalka/xlm-roberta-taxonomy-main-de",
                subfolder="onnx"
            )
tokenizer = AutoTokenizer.from_pretrained("Ashybalka/xlm-roberta-taxonomy-main-de")
clf       = pipeline("text-classification", model=model, tokenizer=tokenizer)

result = clf("Maurer [SEP] Erfahrener Maurer fΓΌr Hochbau und Sanierungsarbeiten gesucht.")
print(result)
# [{'label': 'Construction & Building', 'score': 0.9483}]

ONNX INT8 (CPU, lightweight)

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline

model     = ORTModelForSequenceClassification.from_pretrained(
                "Ashybalka/xlm-roberta-taxonomy-main-de",
                subfolder="onnx-int8"
            )
tokenizer = AutoTokenizer.from_pretrained("Ashybalka/xlm-roberta-taxonomy-main-de")
clf       = pipeline("text-classification", model=model, tokenizer=tokenizer)

Direct ONNX Runtime (no transformers)

For production FastAPI services or environments without transformers:

import json
import numpy as np
import onnxruntime as ort
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer

# download model
path = snapshot_download(
    "Ashybalka/xlm-roberta-taxonomy-main-de",
    allow_patterns=["onnx/model.onnx", "tokenizer.json", "config.json"]
)

# load
session   = ort.InferenceSession(f"{path}/onnx/model.onnx",
                                  providers=["CPUExecutionProvider"])
tokenizer = Tokenizer.from_file(f"{path}/tokenizer.json")
tokenizer.enable_truncation(max_length=510)
tokenizer.no_padding()

with open(f"{path}/config.json") as f:
    labels = [v for _, v in sorted(json.load(f)["id2label"].items(), key=lambda x: int(x[0]))]

vocab    = tokenizer.get_vocab()
bos, eos = vocab["<s>"], vocab["</s>"]

def classify(title: str, description: str) -> dict:
    text     = f"{title} [SEP] {description}"
    encoding = tokenizer.encode(text, add_special_tokens=False)
    ids      = [bos] + encoding.ids + [eos]
    mask     = [1] * len(ids)

    logits = session.run(None, {
        "input_ids":      np.array([ids],  dtype=np.int64),
        "attention_mask": np.array([mask], dtype=np.int64),
    })[0][0]

    exp   = np.exp(logits - logits.max())
    probs = exp / exp.sum()
    idx   = int(np.argmax(probs))

    return {"category": labels[idx], "confidence": round(float(probs[idx]), 4)}

print(classify("Python Developer", "FastAPI, Docker, PostgreSQL, REST API Entwicklung."))
# {'category': 'IT & Software', 'confidence': 0.9534}

Training Details

Parameter Value
Base model FacebookAI/xlm-roberta-base
Labeling LLM consensus (3-model voting)
Agreement filter 2/3 or 3/3 required
Max length 512 tokens
Learning rate 2e-5
Class weighting Balanced + sample weights by agreement

Limitations

  • Optimized for German job listings only
  • Class imbalance in the test set β€” IT & Software and Sales & Business Development (500 samples each) vs. Legal (14 samples) and Design & Creative (25 samples); metrics for small classes remain higher-variance
  • Weakest classes: Education & Teaching (F1 0.653), General Management & Consulting (F1 0.680), Construction & Building (F1 0.717) β€” broad or semantically overlapping categories
  • Administration & Office and General Management & Consulting act as catch-all categories for ambiguous office roles and may absorb borderline listings
  • Inputs longer than 512 tokens are truncated β€” use title + first paragraph for best results

Related Models

Model Categories Use case
xlm-roberta-taxonomy-main-de (this model) 21 top-level General job classification across all industries
xlm-roberta-taxonomy-it-de 14 IT subcategories Detailed IT role classification

A typical pipeline: this main model assigns the top-level category, and listings classified as IT & Software are then passed to the it model for fine-grained IT role classification.

Downloads last month
44
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Ashybalka/xlm-roberta-taxonomy-main-de

Quantized
(21)
this model

Spaces using Ashybalka/xlm-roberta-taxonomy-main-de 2

Collection including Ashybalka/xlm-roberta-taxonomy-main-de