File size: 3,293 Bytes
c769b40 dd41067 c769b40 c1fb75b c769b40 6337b65 eb59026 07aaf0e 6337b65 e2ed3e9 8e974d1 6337b65 eef9733 6337b65 cff8bc2 df1521a b102b7b df1521a 838bbd8 df1521a ed15c9a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
---
language: en
datasets:
- wnut_17
license: mit
metrics:
- f1
widget:
- text: "My name is Sylvain and I live in Paris"
example_title: "Parisian"
- text: "My name is Sarah and I live in London"
example_title: "Londoner"
---
# Reddit NER for place names
Fine-tuned `twitter-roberta-base` for named entity recognition, trained using `wnut_17` with 498 additional comments from Reddit. This model is intended solely for place name extraction from social media text, other entities have therefore been removed.
This model was created with two key goals:
1. Improved NER results on social media
2. Target only place names
_**NOTE:** There is a small bug with sub-words having incorrect BILUO tags. The following processing accounts for this._
## Use in `transformers`
```python
from transformers import pipeline
generator = pipeline(
task="ner",
model="cjber/reddit-ner-place_names",
tokenizer="cjber/reddit-ner-place_names",
)
out = generator("I live north of liverpool in Waterloo")
entities = [item["word"] for item in out]
labels = [item["entity"] for item in out]
```
Label idx values are required for the following stages:
```python
class Label:
labels: dict[str, int] = {
"O": 0,
"B-location": 1,
"I-location": 2,
"L-location": 3,
"U-location": 4,
}
idx: dict[int, str] = {v: k for k, v in labels.items()}
count: int = len(labels)
```
Combine subwords:
```python
def combine_subwords(tokens: list[str], tags: list[int]) -> tuple[list[str], list[str]]:
idx = [
idx for idx, token in enumerate(tokens) if token not in ["<s>", "<pad>", "</s>"]
]
tokens = [tokens[i] for i in idx]
tags = [tags[i] for i in idx]
for idx, _ in enumerate(tokens):
idx += 1
if not tokens[-idx + 1].startswith("Ġ"):
tokens[-idx] = tokens[-idx] + tokens[-idx + 1]
subwords = [i for i, _ in enumerate(tokens) if tokens[i].startswith("Ġ")]
tags = [tags[i] for i in subwords]
tokens = [tokens[i][1:] for i in subwords]
tags_str: list[str] = [Label.idx[i] for i in tags]
return tokens, tags_str
names, labels = combine_subwords(entities, [Label.labels[lb] for lb in labels])
```
Combine BILUO tags:
```python
def combine_biluo(tokens: list[str], tags: list[str]) -> tuple[list[str], list[str]]:
tokens_biluo = tokens.copy()
tags_biluo = tags.copy()
for idx, tag in enumerate(tags_biluo):
if idx + 1 < len(tags_biluo) and tag[0] == "B":
i = 1
while tags_biluo[idx + i][0] not in ["B", "O", "U"]:
tokens_biluo[idx] = f"{tokens_biluo[idx]} {tokens_biluo[idx + i]}"
i += 1
if idx + i == len(tokens_biluo):
break
zipped = [
(token, tag)
for (token, tag) in zip(tokens_biluo, tags_biluo)
if tag[0] not in ["I", "L"]
]
if list(zipped):
tokens_biluo, tags_biluo = zip(*zipped)
tags_biluo = [tag[2:] if tag != "O" else tag for tag in tags_biluo]
return list(tokens_biluo), tags_biluo
else:
return [], []
names, labels = combine_biluo(names, labels)
```
This gives:
```python
>>> names
['liverpool', 'Waterloo']
>>> labels
['location', 'location']
``` |