File size: 6,595 Bytes

---
license: apache-2.0
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
tags:
- text-classification
- register
- web-register
- genre
---
# Web register classification (English model)

A web register classifier for texts in English, fine-tuned from [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large). 
The model is trained with the [Corpus of Online Registers of English (CORE)](https://github.com/TurkuNLP/CORE-corpus) to classify documents based on the [CORE taxonomy](https://turkunlp.org/register-annotation-docs/). 
It is designed to support the development of open language models and for linguists analyzing register variation.

For a multilingual CORE classifier, see [here](https://huggingface.co/TurkuNLP/web-register-classification-multilingual).

## Model Details

### Model Description

- **Developed by:** TurkuNLP
- **Funded by:** The Research Council of Finland, Emil Aaltonen Foundation, University of Turku
- **Shared by:** TurkuNLP
- **Model type:** Language model
- **Language(s) (NLP):** English
- **License:** apache-2.0
- **Finetuned from model:** FacebookAI/xlm-roberta-large

### Model Sources

- **Repository:** Coming soon!
- **Paper:** Coming soon!

## Register labels and their abbreviations

Below is a list of the register labels predicted by the model. Note that some labels are hierarchical; when a sublabel is predicted, its parent label is also predicted. 
For a more detailed description of the label scheme, see [here](https://turkunlp.org/register-annotation-docs/).

The main labels are uppercase. To only include these main labels in the predictions, simply slice the model's output to keep only the uppercase labels.

- **LY:** Lyrical
- **SP:** Spoken
    - **it:** Interview
- **ID:** Interactive discussion
- **NA:** Narrative
    - **ne:** News report
    - **sr:** Sports report
    - **nb:** Narrative blog
- **HI:** How-to or instructions
    - **re:** Recipe
- **IN:** Informational description
    - **en:** Encyclopedia article
    - **ra:** Research article
    - **dtp:** Description of a thing or person
    - **fi:** Frequently asked questions
    - **lt:** Legal terms and conditions
- **OP:** Opinion
    - **rv:** Review
    - **ob:** Opinion blog
    - **rs:** Denominational religious blog or sermon
    - **av:** Advice
- **IP:** Informational persuasion
    - **ds:** Description with intent to sell
    - **ed:** News & opinion blog or editorial

## How to Get Started with the Model

Use the code below to get started with the model.

```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_id = "TurkuNLP/web-register-classification-en"

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Text to be categorized
text = "A text to be categorized"

# Tokenize text
inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)

with torch.no_grad():
    outputs = model(**inputs)

# Apply sigmoid to the logits to get probabilities
probabilities = torch.sigmoid(outputs.logits).squeeze()

# Determine a threshold for predicting labels
threshold = 0.5
predicted_label_indices = (probabilities > threshold).nonzero(as_tuple=True)[0]

# Extract readable labels using id2label
id2label = model.config.id2label
predicted_labels = [id2label[idx.item()] for idx in predicted_label_indices]

print("Predicted labels:", predicted_labels)

```

## Training Details

### Training Data

The model was trained using the Multilingual CORE Corpora, which will be published soon.

### Training Procedure

#### Training Hyperparameters

- **Batch size:** 8
- **Epochs:** 9
- **Learning rate:** 0.00003
- **Precision:** bfloat16 (non-mixed precision)
- **TF32:** Enabled
- **Seed:** 42
- **Max Size:** 512 tokens

#### Inference time

Average inference time (across 1000 iterations), using a single NVIDIA A100 GPU and a batch size of one is **17 ms** for a single example. Wirh bigger batches, inference can be considerably faster.

## Evaluation

Micro-averaged F1 scores and optimized prediction thresholds (test set):

| Language | F1 (All labels) | F1 (Main labels) | Threshold |
| -------- | --------------- | ---------------- | ----------|
| English  | 0.74            | 0.76             | 0.40      |
 

## Technical Specifications

### Compute Infrastructure

- Mahti supercomputer (CSC - IT Center for Science, Finland)
- 1 x NVIDIA A100-SXM4-40GB

#### Software

- torch 2.2.1 
- transformers 4.39.3

## Citation

If you use this model, please cite the following publication:

```bibtex
@misc{henriksson2024untanglingunrestrictedwebautomatic,
      title={Untangling the Unrestricted Web: Automatic Identification of Multilingual Registers}, 
      author={Erik Henriksson and Amanda Myntti and Anni Eskelinen and Selcen Erten-Johansson and Saara Hellström and Veronika Laippala},
      year={2024},
      eprint={2406.19892},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.19892}, 
}
```

Earlier related work include the following:

```bibtex
@article{Laippala.etal2022,
  title = {Register Identification from the Unrestricted Open {{Web}} Using the {{Corpus}} of {{Online Registers}} of {{English}}},
  author = {Laippala, Veronika and R{\"o}nnqvist, Samuel and Oinonen, Miika and Kyr{\"o}l{\"a}inen, Aki-Juhani and Salmela, Anna and Biber, Douglas and Egbert, Jesse and Pyysalo, Sampo},
  year = {2022},
  journal = {Language Resources and Evaluation},
  issn = {1574-0218},
  doi = {10.1007/s10579-022-09624-1},
  url = {https://doi.org/10.1007/s10579-022-09624-1},
}

@article{Skantsi_Laippala_2023,
  title = {Analyzing the unrestricted web: The finnish corpus of online registers},
  doi = {10.1017/S0332586523000021},
  journal = {Nordic Journal of Linguistics},
  author = {Skantsi, Valtteri and Laippala, Veronika},
  year = {2023},
  pages = {1–31}
}
```

## Model Card Contact

Erik Henriksson, Hugging Face username: erikhenriksson