cjber
/

reddit-ner-place_names

@@ -21,8 +21,6 @@ This model was created with two key goals:
 1. Improved NER results on social media
 2. Target only place names
-_**NOTE:** There is a small bug with sub-words having incorrect BILUO tags. The following processing accounts for this._
 ## Use in `transformers`
 ```python
@@ -32,93 +30,23 @@ generator = pipeline(
     task="ner",
     model="cjber/reddit-ner-place_names",
     tokenizer="cjber/reddit-ner-place_names",
 )
 out = generator("I live north of liverpool in Waterloo")
-entities = [item["word"] for item in out]
-labels = [item["entity"] for item in out]
-```
-Label idx values are required for the following stages:
-```python
-class Label:
-    labels: dict[str, int] = {
-        "O": 0,
-        "B-location": 1,
-        "I-location": 2,
-        "L-location": 3,
-        "U-location": 4,
-    }
-    idx: dict[int, str] = {v: k for k, v in labels.items()}
-    count: int = len(labels)
 ```
-Combine subwords:
 ```python
-def combine_subwords(tokens: list[str], tags: list[int]) -> tuple[list[str], list[str]]:
-    idx = [
-        idx for idx, token in enumerate(tokens) if token not in ["<s>", "<pad>", "</s>"]
-    ]
-    tokens = [tokens[i] for i in idx]
-    tags = [tags[i] for i in idx]
-    for idx, _ in enumerate(tokens):
-        idx += 1
-        if not tokens[-idx + 1].startswith("Ġ"):
-            tokens[-idx] = tokens[-idx] + tokens[-idx + 1]
-    subwords = [i for i, _ in enumerate(tokens) if tokens[i].startswith("Ġ")]
-    tags = [tags[i] for i in subwords]
-    tokens = [tokens[i][1:] for i in subwords]
-    tags_str: list[str] = [Label.idx[i] for i in tags]
-    return tokens, tags_str
-names, labels = combine_subwords(entities, [Label.labels[lb] for lb in labels])
-```
-Combine BILUO tags:
-```python
-def combine_biluo(tokens: list[str], tags: list[str]) -> tuple[list[str], list[str]]:
-    tokens_biluo = tokens.copy()
-    tags_biluo = tags.copy()
-    for idx, tag in enumerate(tags_biluo):
-        if idx + 1 < len(tags_biluo) and tag[0] == "B":
-            i = 1
-            while tags_biluo[idx + i][0] not in ["B", "O", "U"]:
-                tokens_biluo[idx] = f"{tokens_biluo[idx]} {tokens_biluo[idx + i]}"
-                i += 1
-                if idx + i == len(tokens_biluo):
-                    break
-    zipped = [
-        (token, tag)
-        for (token, tag) in zip(tokens_biluo, tags_biluo)
-        if tag[0] not in ["I", "L"]
-    ]
-    if list(zipped):
-        tokens_biluo, tags_biluo = zip(*zipped)
-        tags_biluo = [tag[2:] if tag != "O" else tag for tag in tags_biluo]
-        return list(tokens_biluo), tags_biluo
-    else:
-        return [], []
-names, labels = combine_biluo(names, labels)
-```
-This gives:
-```python
->>> names
-['liverpool', 'Waterloo']
->>> labels
-['location', 'location']
 ```

 1. Improved NER results on social media
 2. Target only place names
 ## Use in `transformers`
 ```python
     task="ner",
     model="cjber/reddit-ner-place_names",
     tokenizer="cjber/reddit-ner-place_names",
+    aggregation_strategy="simple",
 )
 out = generator("I live north of liverpool in Waterloo")
 ```
+Out gives:
 ```python
+[{'entity_group': 'location',
+  'score': 0.94054973,
+  'word': ' liverpool',
+  'start': 16,
+  'end': 25},
+ {'entity_group': 'location',
+  'score': 0.99520856,
+  'word': ' Waterloo',
+  'start': 29,
+  'end': 37}]
 ```