--- license: apache-2.0 tags: - flair - token-classification - sequence-tagger-model language: es datasets: - conll2003 - BSC-LT/NextProcurement-NER-Spanish-UTE-Company-annotated widget: - text: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRÁULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:" - text: "PODACESA OBRAS Y SERVICIOS, S.A realiza la siguiente oferta:" --- # Recognition of UTEs and company mentions in Flair This is a model trained using [Flair](https://github.com/flairNLP/flair/) to recognise mentions of UTEs (Unión Temporal de Empresas) and companies in public tenders. It is a finetune of the flair/ner-spanish-large model (retrained from scratch to include additional tags). Based on document-level XLM-R embeddings and [FLERT](https://arxiv.org/pdf/2011.06993v1.pdf/). ## Demo: How to use in Flair Requires: **[Flair](https://github.com/flairNLP/flair/)** (`pip install flair`) ```python from flair.data import Sentence from flair.models import SequenceTagger # load tagger tagger = SequenceTagger.load("BSC-LT/NextProcurement-NER-Spanish-UTE-Company") # make example sentence sentence = Sentence("PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRÁULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:") # predict NER tags tagger.predict(sentence) # print sentence print(sentence) # print predicted NER spans print('The following NER tags are found:') # iterate over entities and print for entity in sentence.get_spans('ner'): print(entity) ``` This yields the following output: ``` Sentence[24]: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:" _ ["PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L."/UTE, "PODACESA-ECR"/UTE] The following NER tags are found: Span[0:14]: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L." _ UTE (0.995) Span[18:19]: "PODACESA-ECR" _ UTE (0.9955) ``` and with the sentence "PODACESA OBRAS Y SERVICIOS, S.A realiza la siguiente oferta:" ``` Sentence[11]: "PODACESA OBRAS Y SERVICIOS, S.A realiza la siguiente oferta:" _ ["PODACESA OBRAS Y SERVICIOS, S.A"/SINGLE_COMPANY] The following NER tags are found: Span[0:6]: "PODACESA OBRAS Y SERVICIOS, S.A" _ SINGLE_COMPANY (1.0) ``` ## Training: Script to train this model The following Flair script was used to train this model (**TODO: update**): ```python import torch # 1. get the corpus from flair.datasets import CONLL_03_SPANISH corpus = CONLL_03_SPANISH() # 2. what tag do we want to predict? tag_type = 'ner' # 3. make the tag dictionary from the corpus tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type) # 4. initialize fine-tuneable transformer embeddings WITH document context from flair.embeddings import TransformerWordEmbeddings embeddings = TransformerWordEmbeddings( model='xlm-roberta-large', layers="-1", subtoken_pooling="first", fine_tune=True, use_context=True, ) # 5. initialize bare-bones sequence tagger (no CRF, no RNN, no reprojection) from flair.models import SequenceTagger tagger = SequenceTagger( hidden_size=256, embeddings=embeddings, tag_dictionary=tag_dictionary, tag_type='ner', use_crf=False, use_rnn=False, reproject_embeddings=False, ) # 6. initialize trainer with AdamW optimizer from flair.trainers import ModelTrainer trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW) # 7. run training with XLM parameters (20 epochs, small LR) from torch.optim.lr_scheduler import OneCycleLR trainer.train('resources/taggers/ner-spanish-large', learning_rate=5.0e-6, mini_batch_size=4, mini_batch_chunk_size=1, max_epochs=20, scheduler=OneCycleLR, embeddings_storage_mode='none', weight_decay=0., ) ) ``` ## Evaluation Results ``` Results: - F-score (micro) 0.7431 - F-score (macro) 0.7429 - Accuracy 0.5944 By class: precision recall f1-score support UTE 0.7568 0.7887 0.7724 71 SINGLE_COMPANY 0.6538 0.7846 0.7133 65 micro avg 0.7039 0.7868 0.7431 136 macro avg 0.7053 0.7867 0.7429 136 weighted avg 0.7076 0.7868 0.7442 136 ``` ## Additional information ### Author The Language Technologies Unit from Barcelona Supercomputing Center. ### Contact For further information, please send an email to . ### Copyright Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center. ### License [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) ### Funding This work has been promoted and financed by the European Commission Health and Digital Executive Agency, Connecting Europe Facility, Grant Agreement Nº INEA/CEF/ICT/A2020/2373713, Action Title Open Harmonized and Enriched Procurement Data Platform (nextProcurement), Action number 2020-ES-IA-0255. ### Disclaimer
Click to expand The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0. Be aware that the model may have biases and/or any other undesirable distortions. When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence. In no event shall the owner and creator of the model (Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties.