File size: 3,410 Bytes
d3a3c1e
da7b78c
 
 
 
 
 
 
00a1620
da7b78c
 
00a1620
da7b78c
 
 
 
 
d3a3c1e
 
da7b78c
 
 
 
 
 
 
 
 
00a1620
 
 
 
 
 
 
 
 
 
 
 
 
 
da7b78c
 
00a1620
da7b78c
 
 
00a1620
 
 
da7b78c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
tags:
- software engineering
- ner
- named-entity recognition
- token-classification
widget:
- text: >-
        In the field of computer graphics, a graphics processing unit (GPU) utilizes algorithms such as ray tracing, a rendering technique, to create realistic lighting effects in applications like Adobe Acrobat and Microsoft Excel.
  example_title: example 1
- text: >-
        By utilizing the TensorFlow and FastAPI libraries with Python, we are optimizing neural network training on devices like the Samsung Gear S2 and Intel T5300 processor.
  example_title: example 2
language:
- en
datasets:
- wikiser
license: apache-2.0
---
# Software Entity Recognition with Noise-robust Learning

We train a BERT model for the task software entity recognition (SER).
The training data leverages WikiSER, a corpus of 1.7M sentences extracted from Wikipedia.
The model uses _self-regularization_ during the finetuning process, allowing it to be robust to texts in the software domain, including misannotations, different naming conventions, and others.

The model recognizes 12 fine-grained named entities: `Algorithm`, `Application`, `Architecture`, `Data_Structure`, `Device`, `Error_Name`, `General_Concept`, `Language`,
`Library`, `License`, `Operating_System`, and `Protocol`.

| Type             | Examples                                              |
|------------------|-------------------------------------------------------|
| Algorithm        | Auction algorithm, Collaborative filtering            |
| Application      | Adobe Acrobat, Microsoft Excel                       |
| Architecture     | Graphics processing unit, Wishbone                   |
| Data_Structure   | Array, Hash table, mXOR linked list                  |
| Device           | Samsung Gear S2, iPad, Intel T5300                    |
| Error Name       | Buffer overflow, Memory leak                         |
| General_Concept  | Memory management, Nouvelle AI                       |
| Language         | C++, Java, Python, Rust                               |
| Library          | Beautiful Soup, FastAPI                               |
| License          | Cryptix General License, MIT License                  |
| Operating_System | Linux, Ubuntu, Red Hat OS, MorphOS                   |
| Protocol         | TLS, FTPS, HTTP 404                                   |
## Model details

Paper: https://arxiv.org/abs/2308.10564

Code: https://github.com/taidnguyen/software_entity_recognition

Finetuned from model: `bert-base-cased` 

Checkpoint for large version: https://huggingface.co/taidng/wikiser-bert-large

## How to use

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("taidng/wikiser-bert-base")
model = AutoModelForTokenClassification.from_pretrained("taidng/wikiser-bert-base")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Windows XP was originally bundled with Internet Explorer 6."

ner_results = nlp(example)
print(ner_results)
```

## Citation

```bibtex
@inproceedings{nguyen2023software,
  title={Software Entity Recognition with Noise-Robust Learning},
  author={Nguyen, Tai and Di, Yifeng and Lee, Joohan and Chen, Muhao and Zhang, Tianyi},
  booktitle={Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE'23)},
  year={2023},
  organization={IEEE/ACM}
}
```