Token Classification
Flair
PyTorch
Spanish
sequence-tagger-model
Edit model card

Recognition of UTEs and company mentions in Flair

This is a model trained using Flair to recognise mentions of UTEs (Unión Temporal de Empresas) and companies in public tenders.

It is a finetune of the flair/ner-spanish-large model (retrained from scratch to include additional tags).

Based on document-level XLM-R embeddings and FLERT.

Demo: How to use in Flair

Requires: Flair (pip install flair)

from flair.data import Sentence
from flair.models import SequenceTagger
# load tagger
tagger = SequenceTagger.load("BSC-LT/NextProcurement-NER-Spanish-UTE-Company")
# make example sentence
sentence = Sentence("PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRÁULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:")
# predict NER tags
tagger.predict(sentence)
# print sentence
print(sentence)
# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

This yields the following output:

Sentence[24]: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:" _ ["PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L."/UTE, "PODACESA-ECR"/UTE]
The following NER tags are found:
Span[0:14]: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L." _ UTE (0.995)
Span[18:19]: "PODACESA-ECR" _ UTE (0.9955)

and with the sentence "PODACESA OBRAS Y SERVICIOS, S.A realiza la siguiente oferta:"

Sentence[11]: "PODACESA OBRAS Y SERVICIOS, S.A realiza la siguiente oferta:" _ ["PODACESA OBRAS Y SERVICIOS, S.A"/SINGLE_COMPANY]
The following NER tags are found:
Span[0:6]: "PODACESA OBRAS Y SERVICIOS, S.A" _ SINGLE_COMPANY (1.0)

Training: Script to train this model

The following Flair script was used to train this model (TODO: update):

import torch
# 1. get the corpus
from flair.datasets import CONLL_03_SPANISH
corpus = CONLL_03_SPANISH()
# 2. what tag do we want to predict?
tag_type = 'ner'
# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
# 4. initialize fine-tuneable transformer embeddings WITH document context
from flair.embeddings import TransformerWordEmbeddings
embeddings = TransformerWordEmbeddings(
    model='xlm-roberta-large',
    layers="-1",
    subtoken_pooling="first",
    fine_tune=True,
    use_context=True,
)
# 5. initialize bare-bones sequence tagger (no CRF, no RNN, no reprojection)
from flair.models import SequenceTagger
tagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type='ner',
    use_crf=False,
    use_rnn=False,
    reproject_embeddings=False,
)
# 6. initialize trainer with AdamW optimizer
from flair.trainers import ModelTrainer
trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW)
# 7. run training with XLM parameters (20 epochs, small LR)
from torch.optim.lr_scheduler import OneCycleLR
trainer.train('resources/taggers/ner-spanish-large',
              learning_rate=5.0e-6,
              mini_batch_size=4,
              mini_batch_chunk_size=1,
              max_epochs=20,
              scheduler=OneCycleLR,
              embeddings_storage_mode='none',
              weight_decay=0.,
              )
)

Evaluation Results

Results:
- F-score (micro) 0.7431
- F-score (macro) 0.7429
- Accuracy 0.5944

By class:
                precision    recall  f1-score   support

           UTE     0.7568    0.7887    0.7724        71
SINGLE_COMPANY     0.6538    0.7846    0.7133        65

     micro avg     0.7039    0.7868    0.7431       136
     macro avg     0.7053    0.7867    0.7429       136
  weighted avg     0.7076    0.7868    0.7442       136

Additional information

Author

The Language Technologies Unit from Barcelona Supercomputing Center.

Contact

For further information, please send an email to langtech@bsc.es.

Copyright

Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.

License

Apache License, Version 2.0

Funding

This work has been promoted and financed by the European Commission Health and Digital Executive Agency, Connecting Europe Facility, Grant Agreement Nº INEA/CEF/ICT/A2020/2373713, Action Title Open Harmonized and Enriched Procurement Data Platform (nextProcurement), Action number 2020-ES-IA-0255.

Disclaimer

Click to expand

The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.

Be aware that the model may have biases and/or any other undesirable distortions.

When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.

In no event shall the owner and creator of the model (Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties.

Downloads last month
134

Datasets used to train BSC-LT/NextProcurement-NER-Spanish-UTE-Company