JobBlast
Collection
JobBlast taxonomy model family β a system for automated classification of German-language job postings, optimized for CPU inference. β’ 8 items β’ Updated
How to use Ashybalka/xlm-roberta-taxonomy-main-de with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-classification", model="Ashybalka/xlm-roberta-taxonomy-main-de") # Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("Ashybalka/xlm-roberta-taxonomy-main-de")
model = AutoModelForSequenceClassification.from_pretrained("Ashybalka/xlm-roberta-taxonomy-main-de")Fine-tuned xlm-roberta-base for classifying German job listings into 21 top-level industry categories.
Part of the JobBlast taxonomy model family β a system for automated classification of German-language job postings. This model handles the broad top-level split across all industries; for detailed IT role classification see the related IT model below.
| Metric | Value |
|---|---|
| Accuracy | 81.75% |
| F1 macro | 78.15% |
| F1 weighted | 81.88% |
Evaluated on a held-out test set of 3,320 German job listings.
| Category | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| Administration & Office | 0.704 | 0.737 | 0.720 | 190 |
| Construction & Building | 0.702 | 0.733 | 0.717 | 90 |
| Design & Creative | 0.714 | 0.800 | 0.755 | 25 |
| Education & Teaching | 0.620 | 0.689 | 0.653 | 45 |
| Engineering | 0.788 | 0.769 | 0.778 | 363 |
| Finance, Accounting & Controlling | 0.778 | 0.755 | 0.766 | 200 |
| General Management & Consulting | 0.611 | 0.767 | 0.680 | 129 |
| Healthcare & Medical | 0.892 | 0.892 | 0.892 | 194 |
| Hospitality, Gastronomy & Tourism | 0.897 | 0.776 | 0.832 | 67 |
| Human Resources | 0.667 | 0.783 | 0.720 | 23 |
| IT & Software | 0.907 | 0.880 | 0.893 | 500 |
| Insurance & Real Estate | 0.819 | 0.907 | 0.861 | 75 |
| Legal | 0.688 | 0.786 | 0.733 | 14 |
| Logistics, Transport & Warehouse | 0.859 | 0.905 | 0.882 | 74 |
| Marketing, Communications & PR | 0.794 | 0.711 | 0.750 | 38 |
| Production & Manufacturing | 0.712 | 0.767 | 0.739 | 129 |
| Public Sector, Security & Defense | 0.695 | 0.725 | 0.710 | 91 |
| Sales & Business Development | 0.908 | 0.908 | 0.908 | 500 |
| Science & Research | 0.733 | 0.815 | 0.772 | 27 |
| Skilled Trades & Crafts | 0.873 | 0.780 | 0.824 | 431 |
| Social Work & Care | 0.826 | 0.826 | 0.826 | 115 |
Ashybalka/xlm-roberta-taxonomy-main-de/
βββ config.json # shared β label map, model config
βββ tokenizer.json # shared β fast tokenizer
βββ tokenizer_config.json # shared
βββ sentencepiece.bpe.model # shared
βββ special_tokens_map.json # shared
βββ test_metrics.json # evaluation results
βββ classification_report.txt # full per-class report
β
βββ pytorch/
β βββ config.json # needed for from_pretrained(subfolder=)
β βββ model.safetensors # GPU inference / fine-tuning (~1.1 GB)
β
βββ onnx/
β βββ model.onnx # CPU fp32 inference (~1.1 GB)
β
βββ onnx-int8/
βββ model_quantized.onnx # CPU INT8 quantized (~280 MB 3-4Γ smaller)
[Job Title] [SEP] [Job Description]
text = "Pflegefachkraft [SEP] Wir suchen eine examinierte Pflegefachkraft " \
"fΓΌr die Betreuung von Bewohnern in unserer Senioreneinrichtung."
from transformers import pipeline
clf = pipeline(
"text-classification",
model="Ashybalka/xlm-roberta-taxonomy-main-de",
subfolder="pytorch",
device=0, # GPU; -1 for CPU
)
result = clf("DevOps Engineer [SEP] Kubernetes, CI/CD, Monitoring mit Prometheus.")
print(result)
# [{'label': 'IT & Software', 'score': 0.9712}]
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline
model = ORTModelForSequenceClassification.from_pretrained(
"Ashybalka/xlm-roberta-taxonomy-main-de",
subfolder="onnx"
)
tokenizer = AutoTokenizer.from_pretrained("Ashybalka/xlm-roberta-taxonomy-main-de")
clf = pipeline("text-classification", model=model, tokenizer=tokenizer)
result = clf("Maurer [SEP] Erfahrener Maurer fΓΌr Hochbau und Sanierungsarbeiten gesucht.")
print(result)
# [{'label': 'Construction & Building', 'score': 0.9483}]
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline
model = ORTModelForSequenceClassification.from_pretrained(
"Ashybalka/xlm-roberta-taxonomy-main-de",
subfolder="onnx-int8"
)
tokenizer = AutoTokenizer.from_pretrained("Ashybalka/xlm-roberta-taxonomy-main-de")
clf = pipeline("text-classification", model=model, tokenizer=tokenizer)
For production FastAPI services or environments without transformers:
import json
import numpy as np
import onnxruntime as ort
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
# download model
path = snapshot_download(
"Ashybalka/xlm-roberta-taxonomy-main-de",
allow_patterns=["onnx/model.onnx", "tokenizer.json", "config.json"]
)
# load
session = ort.InferenceSession(f"{path}/onnx/model.onnx",
providers=["CPUExecutionProvider"])
tokenizer = Tokenizer.from_file(f"{path}/tokenizer.json")
tokenizer.enable_truncation(max_length=510)
tokenizer.no_padding()
with open(f"{path}/config.json") as f:
labels = [v for _, v in sorted(json.load(f)["id2label"].items(), key=lambda x: int(x[0]))]
vocab = tokenizer.get_vocab()
bos, eos = vocab["<s>"], vocab["</s>"]
def classify(title: str, description: str) -> dict:
text = f"{title} [SEP] {description}"
encoding = tokenizer.encode(text, add_special_tokens=False)
ids = [bos] + encoding.ids + [eos]
mask = [1] * len(ids)
logits = session.run(None, {
"input_ids": np.array([ids], dtype=np.int64),
"attention_mask": np.array([mask], dtype=np.int64),
})[0][0]
exp = np.exp(logits - logits.max())
probs = exp / exp.sum()
idx = int(np.argmax(probs))
return {"category": labels[idx], "confidence": round(float(probs[idx]), 4)}
print(classify("Python Developer", "FastAPI, Docker, PostgreSQL, REST API Entwicklung."))
# {'category': 'IT & Software', 'confidence': 0.9534}
| Parameter | Value |
|---|---|
| Base model | FacebookAI/xlm-roberta-base |
| Labeling | LLM consensus (3-model voting) |
| Agreement filter | 2/3 or 3/3 required |
| Max length | 512 tokens |
| Learning rate | 2e-5 |
| Class weighting | Balanced + sample weights by agreement |
IT & Software and Sales & Business Development (500 samples each) vs. Legal (14 samples) and Design & Creative (25 samples); metrics for small classes remain higher-varianceEducation & Teaching (F1 0.653), General Management & Consulting (F1 0.680), Construction & Building (F1 0.717) β broad or semantically overlapping categoriesAdministration & Office and General Management & Consulting act as catch-all categories for ambiguous office roles and may absorb borderline listings| Model | Categories | Use case |
|---|---|---|
| xlm-roberta-taxonomy-main-de (this model) | 21 top-level | General job classification across all industries |
| xlm-roberta-taxonomy-it-de | 14 IT subcategories | Detailed IT role classification |
A typical pipeline: this main model assigns the top-level category, and listings classified as IT & Software are then passed to the it model for fine-grained IT role classification.
Base model
FacebookAI/xlm-roberta-base