Model Card for Model ID
This fine-tuned few-shot model
This modelcard aims to be a base template for new models. It has been generated using this raw template.
Model Details
Base model: intfloat/multilingual-e5-large Language: Danish (da) Task: Reported Speech Detection Training data: Danish jobcenter conversation transcripts
Model Description
- Model Type: SetFit
- Sentence Transformer body: intfloat/multilingual-e5-large
- Classification head: a LogisticRegression instance
- Maximum Sequence Length: 512 tokens
- Number of Classes: 2 classes
- Language: Danish
- License: MIT License
This model is a few-shot classifier fine-tuned on transcribed interviews from a job center in Denmark. It is designed for binary classification of reported speech, identifying sentences where a speaker references or quotes another person.
To support real-world usage, this model is integrated into a two-part processing pipeline that allows users to analyze interview documents and highlight relevant sentences.
This model is used in a document processing pipeline that performs the following tasks:
- 1️⃣ Input Handling: Accepts .docx files containing interview transcripts.
- 2️⃣ Sentence Segmentation: Splits the document into individual sentences.
- 3️⃣ Sentence Classification: Applies the trained model to classify sentences based on reported speech criteria.
- 4️⃣ HTML-Based Highlighting: Adds visual markers (via HTML tags) to classified sentences.
- 5️⃣ Output Generation: Produces a .docx file with highlighted sentences, preserving the original content.
Additionally, a GUI-based wrapper (built with Gooey) provides a user-friendly .exe program, allowing non-technical users to process documents efficiently. For a more in-depth view for the GUI, please read the Github page provided further down.
- Developed by: CALDISS, AAU
- Funded by [optional]: Aalborg University
- Model type: [Few-Shot text-Classifier]
- Language(s) (NLP): [Danish]
- License: [MIT]
- Finetuned from model [intfloat/Multilingual-e5-large]:
Model Sources
- Repository: Project repository
- Paper: Work in progress by the collaborative Researcher.
Uses
The model is trained and evaluated on text snippets of "reported speech" in Danish interviews between citizens and job counselors. It is intended to identify "reported speech" in similar text documents of that genre. It is assumed unsuitable for general classification of "reported speech".
Inteded users inludes researchers or analysts working with danish conversational data or transcripts specifically interested in reported speech as a phenomenon.
Following group (but not excluded to) may find it useful:
Social Scientists & political scientist:
- Analysing interview transcipts for social
- Identifying speech patterns in employment, front-desk services or other institutional/governmental settings.
Linguists & NLP researchers:
- studying reported speech in danish.
- Developing methods for classiying speech using Transformers architechture.
Out-of-Scope Use
- This model is not designed for live conversation analysis or chatbot-like interactions. It works best in offline document processing workflows.
- General-purpose text classification outside reported speech.
- Live conversational AI or real-time speech processing.
- Multilingual applications (this model is optimized for Danish only).
Bias, Risks, and Limitations
The model is trained on Danish job center interviews, so performance may vary on other types of texts.
Binary classification is based on reported speech detection, but edge cases may exist.
While based on a multilingual model, this fine-tuned version is specifically optimized for Danish. Performance may be unreliable in other languages.
The model assumes transcripts. Messy, formal, or highly unstructured text (e.g., speech-to-text outputs with errors) may reduce accuracy.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoModel, AutoTokenizer
model_name = "your-huggingface-username/danish-rep-speech-e5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
text = "Han sagde: 'Jeg kommer i morgen.'"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# Extract the embedding
embedding = outputs.last_hidden_state[:, 0, :].detach().numpy()
Training Details
Training Data
Training data consits of 55 transcripts of conversations between a citizen and a social worker collected from a danish jobcenter. Data is therefore sensitive and not attached in this model card. Data was further evaluated to be balanced and containing a 50/50 split between both tags.
Training Procedure
Pretraining & Base Model:
This model is fine-tuned on top of intfloat/multilingual-e5-large, a transformer-based model optimized for embedding-based retrieval. The base model was pretrained using contrastive learning and large-scale multilingual datasets, making it well-suited for semantic similarity and classification tasks. Fine-Tuning Details
Training Dataset:
The model was fine-tuned using labelled transcribed interviews from a Danish job center.
Due to the sensitive nature of the data, it is not publicly available.
Objective:
The model was trained for binary classification of reported speech.
Labels indicate whether a sentence contains reported speech (reported-speech, not reported-speech).
Training Configuration:
Few-shot learning approach with domain-specific samples.
Batch size: 32.
Body Learning rate: 1.0770502781075495e-06
Solver: lbfgs.
Number of epochs: 6
Max Iterations: 279
Evaluation metric: Accuracy & F1-score.
Technical Implementation
Tokenization performed using the SentencePiece-based tokenizer from intfloat/multilingual-e5-large.
Fine-tuning was done using PyTorch and the Hugging Face Trainer API.
The model is optimized for batch inference rather than real-time processing.
📌 For more details on the architecture, refer to the base model: multilingual-e5-large.
Training Hyperparameters
- batch_size: (32, 32)
- num_epochs: (6, 6)
- max_steps: -1
- sampling_strategy: oversampling
- body_learning_rate: (1.0770502781075495e-06, 1.0770502781075495e-06)
- head_learning_rate: 0.01
- loss: CosineSimilarityLoss
- distance_metric: cosine_distance
- margin: 0.25
- end_to_end: False
- use_amp: False
- warmup_proportion: 0.1
- seed: 42
- eval_max_steps: -1
- load_best_model_at_end: True
Evaluation
Testing Data, Factors & Metrics
metrics:
- type: accuracy
value: 0.9724770642201835
name: Accuracy
- type: precision
value: 0.9557522123893806
name: Precision
- type: recall
value: 0.9908256880733946
name: Recall
- type: f1
value: 0.972972972972973
name: F1
Factors
[More Information Needed]
Metrics
The model was evaluated using standard classification metrics to measure its performance. Evaluation Metrics
Accuracy: Measures the overall correctness of predictions.
F1-Score: Balances precision and recall, ensuring that both false positives and false negatives are considered.
Precision: Measures how many of the predicted reported speech sentences are actually correct.
Results:
Not Reported Speech: Precision: 0.959 Recall: 0.924 F1-Score: 0.941 Recall: 0.942
Reported Speech: Precision: 0.927 Recall: 0.961 F1: 0.943
Accuracy: 0.942
Hardware used
- Hardware Type: 48 (AMD EPYC 9454), 192 GB memory, 1 Nividia H100
- Hours used: 50
- Cloud Provider: Ucloud SDU
- Compute Region: Cloud services based at University of Southern Denmark, Aarhus University and Aalborg Univesity
Compute Infrastructure
Ucloud-cloud infrastructure available at the danish universities
Framework Versions
- Python: 3.12.3
- SetFit: 1.0.3
- Sentence Transformers: 3.0.1
- Transformers: 4.39.0
- PyTorch: 2.4.1+cu121
- Datasets: 2.21.0
- Tokenizers: 0.15.2
BibTeX:
@article{https://doi.org/10.48550/arxiv.2209.11055, doi = {10.48550/ARXIV.2209.11055}, url = {https://arxiv.org/abs/2209.11055}, author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren}, keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Efficient Few-Shot Learning Without Prompts}, publisher = {arXiv}, year = {2022}, copyright = {Creative Commons Attribution 4.0 International} }
Model Card Authors
- Matias Kokholm Appel - mkap@adm.aau.dk
- Kristian Gade Kjelmann - kgk@adm.aau.dk
- Nana Ohmeyer
Model Card Contact
Model tree for CALDISS-AAU/da-reported-speech-e5
Base model
intfloat/multilingual-e5-large