{ "cells": [ { "cell_type": "raw", "metadata": {}, "source": [ "---\n", "title: 21 Named Entiry Recognition using Transformer\n", "description: An implementation of Transformer to perform token classification and identify species in PubMed abstracts\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "m8qFH7JQE4ht" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "djp2XOqO88qw" }, "source": [ "# Named Entity Recognition (NER)\n", "\n", "NER is an information extraction technique to identify and classify named entities in text. These entities can be pre-defined and generic like location names, organizations, time and etc, or they can be very specific like the example with the resume.\n", "\n", "The goal of a named entity recognition (NER) system is to identify all textual mentions of the named entities. This can be broken down into two sub-tasks: identifying the boundaries of the NE, and identifying its type.\n", "\n", "Named entity recognition is a task that is well-suited to the type of classifier-based approach. In particular, a tagger can be built that labels each word in a sentence using the IOB format, where chunks are labelled by their appropriate type.\n", "\n", "The IOB Tagging system contains tags of the form:\n", "\n", "* B - {CHUNK_TYPE} – for the word in the Beginning chunk\n", "* I - {CHUNK_TYPE} – for words Inside the chunk\n", "* O – Outside any chunk\n", "\n", "## Approaches to NER\n", "* **Classical Approaches:** mostly rule-based.\n", "* **Machine Learning Approaches:** there are two main methods in this category: \n", " * Treat the problem as a multi-class classification where named entities are our labels so we can apply different classification algorithms. The problem here is that identifying and labeling named entities require thorough understanding of the context of a sentence and sequence of the word labels in it, which this method ignores that.\n", " * Conditional Random Field (CRF) model. It is a probabilistic graphical model that can be used to model sequential data such as labels of words in a sentence. The CRF model is able to capture the features of the current and previous labels in a sequence but it cannot understand the context of the forward labels; this shortcoming plus the extra feature engineering involved with training a CRF model, makes it less appealing to be adapted by the industry.\n", "* **Deep Learning Approaches:** Bidirectional RNNs, Transformers" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2KLIs8HHyrHd" }, "outputs": [], "source": [ "%%capture\n", "!pip install datasets" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zy7PRVmH7ssP" }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from datasets import load_dataset\n", "\n", "plt.style.use('fivethirtyeight')" ] }, { "cell_type": "markdown", "metadata": { "id": "c-nHc7hB7wUF" }, "source": [ "# Getting the Dataset and EDA\n", "\n", "We will be working on S800 Corpus, which is a novel abstract-based manually annotated corpus. S800 comprises 800 PubMed abstracts in which organism mentions were identified and mapped to the corresponding NCBI Taxonomy identifiers.\n", "\n", "It is available on Hugging Face Datasets Hub" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "oiFtTMyry0rK" }, "outputs": [], "source": [ "%%capture\n", "dataset = load_dataset('species_800')\n", "\n", "train_df = pd.DataFrame(dataset['train']).explode(['tokens', 'ner_tags']).dropna()\n", "valid_df = pd.DataFrame(dataset['validation']).explode(['tokens', 'ner_tags']).dropna()\n", "test_df = pd.DataFrame(dataset['test']).explode(['tokens', 'ner_tags']).dropna()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "YwmEqJAxzTTj", "outputId": "0a2965db-a864-470c-e31d-c7a21aa344a0" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " | id | \n", "tokens | \n", "ner_tags | \n", "
---|---|---|---|
0 | \n", "0 | \n", "Methanoregula | \n", "1 | \n", "
0 | \n", "0 | \n", "formicica | \n", "2 | \n", "
0 | \n", "0 | \n", "sp | \n", "0 | \n", "
0 | \n", "0 | \n", ". | \n", "0 | \n", "
0 | \n", "0 | \n", "nov | \n", "0 | \n", "