File size: 1,314 Bytes
76d8dc5 e41d125 706c649 e41d125 d0e4fd1 80d9280 ddff9d7 e41d125 1ecd0e4 e41d125 ddff9d7 9c53cf2 6fffe93 e41d125 9181fbf 58a4326 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# Classifying Text into DB07 Codes
This model is [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) fine-tuned to classify Danish descriptions of activities into [Dansk Branchekode DB07](https://www.dst.dk/en/Statistik/dokumentation/nomenklaturer/dansk-branchekode-db07) codes.
## Data
Approximately 2.5 million business names and descriptions of activities from Norwegian and Danish businesses were used to fine-tune the model. The Norwegian descriptions were translated into Danish and the Norwegian SN 2007 codes were translated into Danish DB07 codes.
Activity descriptions and business names were concatenated but separated by the separator token `</s>`. Thus, the model was trained on input texts in the format `f"{description_of_activity}</s>{business_name}"`.
## Quick Start
```python
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("erst/xlm-roberta-base-finetuned-db07")
model = AutoModelForSequenceClassification.from_pretrained("erst/xlm-roberta-base-finetuned-db07")
pl = pipeline(
"sentiment-analysis",
model=model,
tokenizer=tokenizer,
return_all_scores=False,
)
pl("Vi sælger sko")
pl("We sell clothes</s>Clothing ApS")
```
## License
This model is released under the MIT License.
|