Model Card for Model ID

This fine-tuned few-shot model

This modelcard aims to be a base template for new models. It has been generated using this raw template.

Model Details

Base model: intfloat/multilingual-e5-large Language: Danish (da) Task: Reported Speech Detection Training data: Danish jobcenter conversation transcripts

Model Description

Model Type: SetFit
Sentence Transformer body: intfloat/multilingual-e5-large
Classification head: a LogisticRegression instance
Maximum Sequence Length: 512 tokens
Number of Classes: 2 classes
Language: Danish
License: MIT License

This model is a few-shot classifier fine-tuned on transcribed interviews from a job center in Denmark. It is designed for binary classification of reported speech, identifying sentences where a speaker references or quotes another person.

To support real-world usage, this model is integrated into a two-part processing pipeline that allows users to analyze interview documents and highlight relevant sentences.

This model is used in a document processing pipeline that performs the following tasks:

1️⃣ Input Handling: Accepts .docx files containing interview transcripts.
2️⃣ Sentence Segmentation: Splits the document into individual sentences.
3️⃣ Sentence Classification: Applies the trained model to classify sentences based on reported speech criteria.
4️⃣ HTML-Based Highlighting: Adds visual markers (via HTML tags) to classified sentences.
5️⃣ Output Generation: Produces a .docx file with highlighted sentences, preserving the original content.

Additionally, a GUI-based wrapper (built with Gooey) provides a user-friendly .exe program, allowing non-technical users to process documents efficiently. For a more in-depth view for the GUI, please read the Github page provided further down.

Developed by: CALDISS, AAU
Funded by [optional]: Aalborg University
Model type: [Few-Shot text-Classifier]
Language(s) (NLP): [Danish]
License: [MIT]
Finetuned from model [intfloat/Multilingual-e5-large]:

Model Sources

Repository: Project repository
Paper: Work in progress by the collaborative Researcher.

Uses

The model is trained and evaluated on text snippets of "reported speech" in Danish interviews between citizens and job counselors. It is intended to identify "reported speech" in similar text documents of that genre. It is assumed unsuitable for general classification of "reported speech".

Inteded users inludes researchers or analysts working with danish conversational data or transcripts specifically interested in reported speech as a phenomenon.

Following group (but not excluded to) may find it useful:

Social Scientists & political scientist:

Analysing interview transcipts for social
Identifying speech patterns in employment, front-desk services or other institutional/governmental settings.

Linguists & NLP researchers:

studying reported speech in danish.
Developing methods for classiying speech using Transformers architechture.

Out-of-Scope Use

This model is not designed for live conversation analysis or chatbot-like interactions. It works best in offline document processing workflows.
General-purpose text classification outside reported speech.
Live conversational AI or real-time speech processing.
Multilingual applications (this model is optimized for Danish only).

Bias, Risks, and Limitations

The model is trained on Danish job center interviews, so performance may vary on other types of texts.
Binary classification is based on reported speech detection, but edge cases may exist.
While based on a multilingual model, this fine-tuned version is specifically optimized for Danish. Performance may be unreliable in other languages.
The model assumes transcripts. Messy, formal, or highly unstructured text (e.g., speech-to-text outputs with errors) may reduce accuracy.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoModel, AutoTokenizer

model_name = "your-huggingface-username/danish-rep-speech-e5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

text = "Han sagde: 'Jeg kommer i morgen.'"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Extract the embedding
embedding = outputs.last_hidden_state[:, 0, :].detach().numpy()

Training Details

Training Data

Training data consits of 55 transcripts of conversations between a citizen and a social worker collected from a danish jobcenter. Data is therefore sensitive and not attached in this model card. Data was further evaluated to be balanced and containing a 50/50 split between both tags.

Training Procedure

Pretraining & Base Model:

This model is fine-tuned on top of intfloat/multilingual-e5-large, a transformer-based model optimized for embedding-based retrieval. The base model was pretrained using contrastive learning and large-scale multilingual datasets, making it well-suited for semantic similarity and classification tasks. Fine-Tuning Details

Training Dataset:
    The model was fine-tuned using labelled transcribed interviews from a Danish job center.
    Due to the sensitive nature of the data, it is not publicly available.

Objective:
    The model was trained for binary classification of reported speech.
    Labels indicate whether a sentence contains reported speech (reported-speech, not reported-speech).

Training Configuration:
    Few-shot learning approach with domain-specific samples.
    Batch size: 32.
    Body Learning rate: 1.0770502781075495e-06
    Solver: lbfgs.
    Number of epochs: 6
    Max Iterations: 279
    Evaluation metric: Accuracy & F1-score.

Technical Implementation

Tokenization performed using the SentencePiece-based tokenizer from intfloat/multilingual-e5-large.
Fine-tuning was done using PyTorch and the Hugging Face Trainer API.
The model is optimized for batch inference rather than real-time processing.

📌 For more details on the architecture, refer to the base model: multilingual-e5-large.

Training Hyperparameters

batch_size: (32, 32)
num_epochs: (6, 6)
max_steps: -1
sampling_strategy: oversampling
body_learning_rate: (1.0770502781075495e-06, 1.0770502781075495e-06)
head_learning_rate: 0.01
loss: CosineSimilarityLoss
distance_metric: cosine_distance
margin: 0.25
end_to_end: False
use_amp: False
warmup_proportion: 0.1
seed: 42
eval_max_steps: -1
load_best_model_at_end: True

Evaluation

Testing Data, Factors & Metrics

metrics:
- type: accuracy
  value: 0.9724770642201835
  name: Accuracy
- type: precision
  value: 0.9557522123893806
  name: Precision
- type: recall
  value: 0.9908256880733946
  name: Recall
- type: f1
  value: 0.972972972972973
  name: F1

Factors

[More Information Needed]

Metrics

The model was evaluated using standard classification metrics to measure its performance. Evaluation Metrics

Accuracy: Measures the overall correctness of predictions.
F1-Score: Balances precision and recall, ensuring that both false positives and false negatives are considered.
Precision: Measures how many of the predicted reported speech sentences are actually correct.

Results:

Not Reported Speech: Precision: 0.959 Recall: 0.924 F1-Score: 0.941 Recall: 0.942

Reported Speech: Precision: 0.927 Recall: 0.961 F1: 0.943

Accuracy: 0.942

Hardware used

Hardware Type: 48 (AMD EPYC 9454), 192 GB memory, 1 Nividia H100
Hours used: 50
Cloud Provider: Ucloud SDU
Compute Region: Cloud services based at University of Southern Denmark, Aarhus University and Aalborg Univesity

Compute Infrastructure

Ucloud-cloud infrastructure available at the danish universities

Framework Versions

Python: 3.12.3
SetFit: 1.0.3
Sentence Transformers: 3.0.1
Transformers: 4.39.0
PyTorch: 2.4.1+cu121
Datasets: 2.21.0
Tokenizers: 0.15.2

BibTeX:

@article{https://doi.org/10.48550/arxiv.2209.11055, doi = {10.48550/ARXIV.2209.11055}, url = {https://arxiv.org/abs/2209.11055}, author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren}, keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Efficient Few-Shot Learning Without Prompts}, publisher = {arXiv}, year = {2022}, copyright = {Creative Commons Attribution 4.0 International} }

Model Card Authors

Matias Kokholm Appel - mkap@adm.aau.dk
Kristian Gade Kjelmann - kgk@adm.aau.dk
Nana Ohmeyer

Model Card Contact

caldiss@adm.aau.dk

https://www.en.caldiss.aau.dk/

CALDISS-AAU
/

da-reported-speech-e5