File size: 1,314 Bytes

76d8dc5
e41d125
706c649
e41d125
 
 
d0e4fd1
80d9280
ddff9d7
e41d125
 
 
 
 
 
1ecd0e4
 
e41d125
 
 
 
 
 
 
 
ddff9d7
9c53cf2
6fffe93
e41d125
9181fbf
 
 
58a4326

# Classifying Text into DB07 Codes

This model is [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) fine-tuned to classify Danish descriptions of activities into [Dansk Branchekode DB07](https://www.dst.dk/en/Statistik/dokumentation/nomenklaturer/dansk-branchekode-db07) codes.


## Data
Approximately 2.5 million business names and descriptions of activities from Norwegian and Danish businesses were used to fine-tune the model. The Norwegian descriptions were translated into Danish and the Norwegian SN 2007 codes were translated into Danish DB07 codes.

Activity descriptions and business names were concatenated but separated by the separator token `</s>`. Thus, the model was trained on input texts in the format `f"{description_of_activity}</s>{business_name}"`.

## Quick Start

```python
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("erst/xlm-roberta-base-finetuned-db07")
model = AutoModelForSequenceClassification.from_pretrained("erst/xlm-roberta-base-finetuned-db07")

pl = pipeline(
    "sentiment-analysis",
    model=model,
    tokenizer=tokenizer,
    return_all_scores=False,
)

pl("Vi sælger sko")

pl("We sell clothes</s>Clothing ApS")
```

## License

This model is released under the MIT License.