erikhenriksson commited on
Commit
663e709
1 Parent(s): d1e0b7b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -46
README.md CHANGED
@@ -9,18 +9,14 @@ language:
9
  metrics:
10
  - f1
11
  ---
12
- # Web register classification model (multilingual model)
13
 
14
- A web register classification model
15
 
16
  ## Model Details
17
 
18
  ### Model Description
19
 
20
- <!-- Provide a longer summary of what this model is. -->
21
-
22
-
23
-
24
  - **Developed by:** TurkuNLP
25
  - **Funded by:** The Research Council of Finland, Eemil Aaltonen Foundation, University of Turku
26
  - **Shared by:** TurkuNLP
@@ -43,34 +39,56 @@ It is designed to support the development of open language models and for lingui
43
 
44
  ## How to Get Started with the Model
45
 
 
 
46
  ```
 
 
 
47
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
48
 
49
- # Init model
50
- model = AutoModelForSequenceClassification.from_pretrained(cfg.model_path).to(
51
- device
52
- )
53
- model.eval()
54
 
55
- # Get the original model's name and init tokenizer
56
- with open(f"{cfg.model_path}/config.json", "r") as config_file:
57
- config = json.load(config_file)
58
- tokenizer = AutoTokenizer.from_pretrained(config.get("_name_or_path"))
59
 
60
- ```
 
 
 
 
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  ## Training Details
64
 
65
  ### Training Data
66
 
67
- The dataset that the model was trained on will be published soon!
68
 
69
  ### Training Procedure
70
 
71
  #### Training Hyperparameters
72
 
73
  - **Batch size:** 8
 
74
  - **Learning rate:** 0.00005
75
  - **Precision:** bfloat16 (non-mixed precision)
76
  - **TF32:** Enabled
@@ -85,34 +103,6 @@ Average inference time (across 1000 iterations), using a single NVIDIA A100 GPU
85
 
86
  Coming soon
87
 
88
- ### Testing Data, Factors & Metrics
89
-
90
- #### Testing Data
91
-
92
- <!-- This should link to a Dataset Card if possible. -->
93
-
94
- [More Information Needed]
95
-
96
- #### Factors
97
-
98
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
99
-
100
- [More Information Needed]
101
-
102
- #### Metrics
103
-
104
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
105
-
106
- [More Information Needed]
107
-
108
- ### Results
109
-
110
- [More Information Needed]
111
-
112
- #### Summary
113
-
114
-
115
-
116
 
117
  ## Technical Specifications
118
 
@@ -126,7 +116,8 @@ NVIDIA A100-SXM4-40GB
126
 
127
  #### Software
128
 
129
- Pytorch
 
130
 
131
  ## Citation
132
 
 
9
  metrics:
10
  - f1
11
  ---
12
+ # Web register classification (multilingual model)
13
 
14
+ A web register classification model fine-tuned from XLM-RoBERTa-large.
15
 
16
  ## Model Details
17
 
18
  ### Model Description
19
 
 
 
 
 
20
  - **Developed by:** TurkuNLP
21
  - **Funded by:** The Research Council of Finland, Eemil Aaltonen Foundation, University of Turku
22
  - **Shared by:** TurkuNLP
 
39
 
40
  ## How to Get Started with the Model
41
 
42
+ Use the code below to get started with the model.
43
+
44
  ```
45
+ import torch
46
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
47
+
48
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
49
 
50
+ model_id = "TurkuNLP/multilingual-web-register-classification"
 
 
 
 
51
 
52
+ # Load model and tokenizer
53
+ model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device)
54
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
 
55
 
56
+ # Text to be categorized
57
+ text = "A text to be categorized"
58
+
59
+ # Tokenize text
60
+ inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
61
 
62
+ with torch.no_grad():
63
+ outputs = model(**inputs)
64
+
65
+ # Apply sigmoid to the logits to get probabilities
66
+ probabilities = torch.sigmoid(outputs.logits).squeeze()
67
+
68
+ # Determine a threshold for predicting labels (e.g., 0.5)
69
+ threshold = 0.5
70
+ predicted_label_indices = (probabilities > threshold).nonzero(as_tuple=True)[0]
71
+
72
+ # Extract readable labels using id2label
73
+ id2label = model.config.id2label
74
+ predicted_labels = [id2label[idx.item()] for idx in predicted_label_indices]
75
+
76
+ print("Predicted labels:", predicted_labels)
77
+
78
+ ```
79
 
80
  ## Training Details
81
 
82
  ### Training Data
83
 
84
+ The model was trained using the Multilingual CORE Corpora, which will be published soon.
85
 
86
  ### Training Procedure
87
 
88
  #### Training Hyperparameters
89
 
90
  - **Batch size:** 8
91
+ - **Epochs:** 7
92
  - **Learning rate:** 0.00005
93
  - **Precision:** bfloat16 (non-mixed precision)
94
  - **TF32:** Enabled
 
103
 
104
  Coming soon
105
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
  ## Technical Specifications
108
 
 
116
 
117
  #### Software
118
 
119
+ torch 2.2.1
120
+ transformers 4.39.3
121
 
122
  ## Citation
123