mnavas commited on
Commit
2fa4faa
1 Parent(s): 9091bbd
README.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - generated_from_trainer
5
+ metrics:
6
+ - f1
7
+ - accuracy
8
+ model-index:
9
+ - name: roberta-finetuned-CPV_Spanish
10
+ results: []
11
+ ---
12
+
13
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
14
+ should probably proofread and complete it, then remove this comment. -->
15
+
16
+ # roberta-finetuned-CPV_Spanish
17
+
18
+ This model is a fine-tuned version of [PlanTL-GOB-ES/roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne) on a dataset derived from Spanish Public Procurement documents from 2019. The whole fine-tuning process is available in the following [Kaggle notebook](https://www.kaggle.com/code/marianavasloro/fine-tuned-roberta-for-spanish-cpv-codes).
19
+
20
+ It achieves the following results on the evaluation set:
21
+ - Loss: 0.0465
22
+ - F1: 0.7918
23
+ - Roc Auc: 0.8860
24
+ - Accuracy: 0.7376
25
+ - Coverage Error: 10.2744
26
+ - Label Ranking Average Precision Score: 0.7973
27
+
28
+ ## Intended uses & limitations
29
+
30
+ This model only predicts the first two digits of the CPV codes. The list of divisions CPV codes is the following:
31
+
32
+ | Division | English | Spanish | | | |
33
+ |----------|:----------------------------------------------------------------------------------------------------------------:|----------------------------------------------------------------------------------------------------------------------------------------------------|:-:|:-:|:-:|
34
+ | 03 | Agricultural, farming, fishing, forestry and related products | Productos de la agricultura, ganadería, pesca, silvicultura y productos afines | | | |
35
+ | 09 | Petroleum products, fuel, electricity and other sources of energy | Derivados del petróleo, combustibles, electricidad y otras fuentes de energía | | | |
36
+ | 14 | Mining, basic metals and related products | Productos de la minería, de metales de base y productos afines | | | |
37
+ | 15 | Food, beverages, tobacco and related products | Alimentos, bebidas, tabaco y productos afines | | | |
38
+ | 16 | Agricultural machinery | Maquinaria agrícola | | | |
39
+ | 18 | Clothing, footwear, luggage articles and accessories | Prendas de vestir, calzado, artículos de viaje y accesorios | | | |
40
+ | 19 | Leather and textile fabrics, plastic and rubber materials | Piel y textiles, materiales de plástico y caucho | | | |
41
+ | 22 | Printed matter and related products | Impresos y productos relacionados | | | |
42
+ | 24 | Chemical products | Productos químicos | | | |
43
+ | 30 | Office and computing machinery, equipment and supplies except furniture and software packages | Máquinas, equipo y artículos de oficina y de informática, excepto mobiliario y paquetes de software | | | |
44
+ | 31 | Electrical machinery, apparatus, equipment and consumables; lighting | Máquinas, aparatos, equipo y productos consumibles eléctricos; iluminación | | | |
45
+ | 32 | Radio, television, communication, telecommunication and related equipment | Equipos de radio, televisión, comunicaciones y telecomunicaciones y equipos conexos | | | |
46
+ | 33 | Medical equipments, pharmaceuticals and personal care products | Equipamiento y artículos médicos, farmacéuticos y de higiene personal | | | |
47
+ | 34 | Transport equipment and auxiliary products to transportation | Equipos de transporte y productos auxiliares | | | |
48
+ | 35 | Security, fire | Equipo de seguridad, extinción de incendios, policía y defensa | | | |
49
+ | 37 | Musical instruments, sport goods, games, toys, handicraft, art materials and accessories | Instrumentos musicales, artículos deportivos, juegos, juguetes, artículos de artesanía, materiales artísticos y accesorios | | | |
50
+ | 38 | Laboratory, optical and precision equipments (excl. glasses) | Equipo de laboratorio, óptico y de precisión (excepto gafas) | | | |
51
+ | 39 | Furniture (incl. office furniture), furnishings, domestic appliances (excl. lighting) and cleaning products | Mobiliario (incluido el de oficina), complementos de mobiliario, aparatos electrodomésticos (excluida la iluminación) y productos de limpieza | | | |
52
+ | 41 | Collected and purified water | Agua recogida y depurada | | | |
53
+ | 42 | Industrial machinery | Maquinaria industrial | | | |
54
+ | 43 | Machinery for mining, quarrying, construction equipment | Maquinaria para la minería y la explotación de canteras y equipo de construcción | | | |
55
+ | 44 | Construction structures and materials; auxiliary products to construction (except electric apparatus) | Estructuras y materiales de construcción; productos auxiliares para la construcción (excepto aparatos eléctricos) | | | |
56
+ | 45 | Construction work | Trabajos de construcción | | | |
57
+ | 48 | Software package and information systems | Paquetes de software y sistemas de información | | | |
58
+ | 50 | Repair and maintenance services | Servicios de reparación y mantenimiento | | | |
59
+ | 51 | Installation services (except software) | Servicios de instalación (excepto software) | | | |
60
+ | 55 | Hotel, restaurant and retail trade services | Servicios comerciales al por menor de hostelería y restauración | | | |
61
+ | 60 | Transport services (excl. Waste transport) | Servicios de transporte (excluido el transporte de residuos) | | | |
62
+ | 63 | Supporting and auxiliary transport services; travel agencies services | Servicios de transporte complementarios y auxiliares; servicios de agencias de viajes | | | |
63
+ | 64 | Postal and telecommunications services | Servicios de correos y telecomunicaciones | | | |
64
+ | 65 | Public utilities | Servicios públicos | | | |
65
+ | 66 | Financial and insurance services | Servicios financieros y de seguros | | | |
66
+ | 70 | Real estate services | Servicios inmobiliarios | | | |
67
+ | 71 | Architectural, construction, engineering and inspection services | Servicios de arquitectura, construcción, ingeniería e inspección | | | |
68
+ | 72 | IT services: consulting, software development, Internet and support | Servicios TI: consultoría, desarrollo de software, Internet y apoyo | | | |
69
+ | 73 | Research and development services and related consultancy services | Servicios de investigación y desarrollo y servicios de consultoría conexos | | | |
70
+ | 75 | Administration, defence and social security services | Servicios de administración pública, defensa y servicios de seguridad social | | | |
71
+ | 76 | Services related to the oil and gas industry | Servicios relacionados con la industria del gas y del petróleo | | | |
72
+ | 77 | Agricultural, forestry, horticultural, aquacultural and apicultural services | Servicios agrícolas, forestales, hortícolas, acuícolas y apícolas | | | |
73
+ | 79 | Business services: law, marketing, consulting, recruitment, printing and security | Servicios a empresas: legislación, mercadotecnia, asesoría, selección de personal, imprenta y seguridad | | | |
74
+ | 80 | Education and training services | Servicios de enseñanza y formación | | | |
75
+ | 85 | Health and social work services | Servicios de salud y asistencia social | | | |
76
+ | 90 | Sewage, refuse, cleaning and environmental services | Servicios de alcantarillado, basura, limpieza y medio ambiente | | | |
77
+ | 92 | Recreational, cultural and sporting services | Servicios de esparcimiento, culturales y deportivos | | | |
78
+ | 98 | Other community, social and personal services | Otros servicios comunitarios, sociales o personales | | | |
79
+
80
+ ## Training and evaluation data
81
+
82
+ ### Training hyperparameters
83
+
84
+ The following hyperparameters were used during training:
85
+ - learning_rate: 2e-05
86
+ - train_batch_size: 8
87
+ - eval_batch_size: 8
88
+ - seed: 42
89
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
90
+ - lr_scheduler_type: linear
91
+ - num_epochs: 10
92
+
93
+ ### Training results
94
+
95
+ | Training Loss | Epoch | Step | Validation Loss | F1 | Roc Auc | Accuracy | Coverage Error | Label Ranking Average Precision Score |
96
+ |:-------------:|:-----:|:-----:|:---------------:|:------:|:-------:|:--------:|:--------------:|:-------------------------------------:|
97
+ | 0.0354 | 1.0 | 9054 | 0.0362 | 0.7560 | 0.8375 | 0.6963 | 14.0835 | 0.7357 |
98
+ | 0.0311 | 2.0 | 18108 | 0.0331 | 0.7756 | 0.8535 | 0.7207 | 12.7880 | 0.7633 |
99
+ | 0.0235 | 3.0 | 27162 | 0.0333 | 0.7823 | 0.8705 | 0.7283 | 11.5179 | 0.7811 |
100
+ | 0.0157 | 4.0 | 36216 | 0.0348 | 0.7821 | 0.8699 | 0.7274 | 11.5836 | 0.7798 |
101
+ | 0.011 | 5.0 | 45270 | 0.0377 | 0.7799 | 0.8787 | 0.7239 | 10.9173 | 0.7841 |
102
+ | 0.008 | 6.0 | 54324 | 0.0395 | 0.7854 | 0.8787 | 0.7309 | 10.9042 | 0.7879 |
103
+ | 0.0042 | 7.0 | 63378 | 0.0421 | 0.7872 | 0.8823 | 0.7300 | 10.5687 | 0.7903 |
104
+ | 0.0025 | 8.0 | 72432 | 0.0439 | 0.7884 | 0.8867 | 0.7305 | 10.2220 | 0.7934 |
105
+ | 0.0015 | 9.0 | 81486 | 0.0456 | 0.7889 | 0.8872 | 0.7316 | 10.1781 | 0.7945 |
106
+ | 0.001 | 10.0 | 90540 | 0.0465 | 0.7918 | 0.8860 | 0.7376 | 10.2744 | 0.7973 |
107
+
108
+
109
+ ### Framework versions
110
+
111
+ - Transformers 4.16.2
112
+ - Pytorch 1.9.1
113
+ - Datasets 1.18.4
114
+ - Tokenizers 0.11.6
115
+
116
+
117
+ ### Aknowledgments
118
+
119
+ This work has been supported by NextProcurement European Action (grant agreement INEA/CEF/ICT/A2020/2373713-Action 2020-ES-IA-0255) and the Madrid Government (Comunidad de Madrid-Spain) under the Multiannual Agreement with Universidad Politécnica de Madrid in the line Support for R&D projects for Beatriz Galindo researchers, in the context of the V PRICIT (Regional Programme of Research and Technological Innovation). We also acknowledge the participation of Jennifer Tabita for the preparation of the initial set of notebooks, and the AI4Gov master students from the first cohort for their validation of the approach. Source of the data: Ministerio de Hacienda.
config.json ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "PlanTL-GOB-ES/roberta-base-bne",
3
+ "architectures": [
4
+ "RobertaForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.0,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "gradient_checkpointing": false,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.0,
13
+ "hidden_size": 768,
14
+ "id2label": {
15
+ "0": "03",
16
+ "1": "09",
17
+ "2": "14",
18
+ "3": "15",
19
+ "4": "16",
20
+ "5": "18",
21
+ "6": "19",
22
+ "7": "22",
23
+ "8": "24",
24
+ "9": "30",
25
+ "10": "31",
26
+ "11": "32",
27
+ "12": "33",
28
+ "13": "34",
29
+ "14": "35",
30
+ "15": "37",
31
+ "16": "38",
32
+ "17": "39",
33
+ "18": "41",
34
+ "19": "42",
35
+ "20": "43",
36
+ "21": "44",
37
+ "22": "45",
38
+ "23": "48",
39
+ "24": "50",
40
+ "25": "51",
41
+ "26": "55",
42
+ "27": "60",
43
+ "28": "63",
44
+ "29": "64",
45
+ "30": "65",
46
+ "31": "66",
47
+ "32": "70",
48
+ "33": "71",
49
+ "34": "72",
50
+ "35": "73",
51
+ "36": "75",
52
+ "37": "76",
53
+ "38": "77",
54
+ "39": "79",
55
+ "40": "80",
56
+ "41": "85",
57
+ "42": "90",
58
+ "43": "92",
59
+ "44": "98"
60
+ },
61
+ "initializer_range": 0.02,
62
+ "intermediate_size": 3072,
63
+ "label2id": {
64
+ "03": 0,
65
+ "09": 1,
66
+ "14": 2,
67
+ "15": 3,
68
+ "16": 4,
69
+ "18": 5,
70
+ "19": 6,
71
+ "22": 7,
72
+ "24": 8,
73
+ "30": 9,
74
+ "31": 10,
75
+ "32": 11,
76
+ "33": 12,
77
+ "34": 13,
78
+ "35": 14,
79
+ "37": 15,
80
+ "38": 16,
81
+ "39": 17,
82
+ "41": 18,
83
+ "42": 19,
84
+ "43": 20,
85
+ "44": 21,
86
+ "45": 22,
87
+ "48": 23,
88
+ "50": 24,
89
+ "51": 25,
90
+ "55": 26,
91
+ "60": 27,
92
+ "63": 28,
93
+ "64": 29,
94
+ "65": 30,
95
+ "66": 31,
96
+ "70": 32,
97
+ "71": 33,
98
+ "72": 34,
99
+ "73": 35,
100
+ "75": 36,
101
+ "76": 37,
102
+ "77": 38,
103
+ "79": 39,
104
+ "80": 40,
105
+ "85": 41,
106
+ "90": 42,
107
+ "92": 43,
108
+ "98": 44
109
+ },
110
+ "layer_norm_eps": 1e-05,
111
+ "max_position_embeddings": 514,
112
+ "model_type": "roberta",
113
+ "num_attention_heads": 12,
114
+ "num_hidden_layers": 12,
115
+ "pad_token_id": 1,
116
+ "position_embedding_type": "absolute",
117
+ "problem_type": "multi_label_classification",
118
+ "torch_dtype": "float32",
119
+ "transformers_version": "4.16.2",
120
+ "type_vocab_size": 1,
121
+ "use_cache": true,
122
+ "vocab_size": 50262
123
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ec3b704a4ba9c6b5a13fd4000da3684ea15e1d9f446da3b7a008ab1e702ed3fd
3
+ size 498797101
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true}}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "add_prefix_space": false, "errors": "replace", "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "trim_offsets": true, "max_len": 512, "special_tokens_map_file": null, "name_or_path": "PlanTL-GOB-ES/roberta-base-bne", "tokenizer_class": "RobertaTokenizer"}
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:de5181bca1f3a92e7db43208603a46f084990047a8236f7ddce2a6fa818dd85b
3
+ size 3055
vocab.json ADDED
The diff for this file is too large to render. See raw diff