--- tags: - software engineering - ner - named-entity recognition - token-classification widget: - text: >- In the field of computer graphics, a graphics processing unit (GPU) utilizes algorithms such as ray tracing, a rendering technique, to create realistic lighting effects in applications like Adobe Acrobat and Microsoft Excel. example_title: example 1 - text: >- By utilizing the TensorFlow and FastAPI libraries with Python, we are optimizing neural network training on devices like the Samsung Gear S2 and Intel T5300 processor. example_title: example 2 language: - en datasets: - wikiser license: apache-2.0 --- # Software Entity Recognition with Noise-robust Learning We train a BERT model for the task software entity recognition (SER). The training data leverages WikiSER, a corpus of 1.7M sentences extracted from Wikipedia. The model uses _self-regularization_ during the finetuning process, allowing it to be robust to texts in the software domain, including misannotations, different naming conventions, and others. The model recognizes 12 fine-grained named entities: `Algorithm`, `Application`, `Architecture`, `Data_Structure`, `Device`, `Error_Name`, `General_Concept`, `Language`, `Library`, `License`, `Operating_System`, and `Protocol`. | Type | Examples | |------------------|-------------------------------------------------------| | Algorithm | Auction algorithm, Collaborative filtering | | Application | Adobe Acrobat, Microsoft Excel | | Architecture | Graphics processing unit, Wishbone | | Data_Structure | Array, Hash table, mXOR linked list | | Device | Samsung Gear S2, iPad, Intel T5300 | | Error Name | Buffer overflow, Memory leak | | General_Concept | Memory management, Nouvelle AI | | Language | C++, Java, Python, Rust | | Library | Beautiful Soup, FastAPI | | License | Cryptix General License, MIT License | | Operating_System | Linux, Ubuntu, Red Hat OS, MorphOS | | Protocol | TLS, FTPS, HTTP 404 | ## Model details Paper: https://arxiv.org/abs/2308.10564 Code: https://github.com/taidnguyen/software_entity_recognition Finetuned from model: `bert-base-cased` Checkpoint for large version: https://huggingface.co/taidng/wikiser-bert-large ## How to use ```python from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("taidng/wikiser-bert-base") model = AutoModelForTokenClassification.from_pretrained("taidng/wikiser-bert-base") nlp = pipeline("ner", model=model, tokenizer=tokenizer) example = "Windows XP was originally bundled with Internet Explorer 6." ner_results = nlp(example) print(ner_results) ``` ## Citation ```bibtex @inproceedings{nguyen2023software, title={Software Entity Recognition with Noise-Robust Learning}, author={Nguyen, Tai and Di, Yifeng and Lee, Joohan and Chen, Muhao and Zhang, Tianyi}, booktitle={Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE'23)}, year={2023}, organization={IEEE/ACM} } ```