{ "cells": [ { "cell_type": "markdown", "id": "1c71aba7-c0f3-4378-9b63-55529e0994b4", "metadata": {}, "source": [ "# Data\n", "\n", "Мы используем следующий датасет для файнтюнинга:\n", "\n", "- [датасет](https://zenodo.org/record/7695390) из [недавнего исследования](https://www.biorxiv.org/content/10.1101/2023.04.10.536208v1) с названиями и лейблами статей из PubMed. \n", "\n", "В нём 20 миллионов статей, но приведены только заголовки (без абстрактов — их можно дополнительно [получить](https://www.nlm.nih.gov/databases/download/pubmed_medline.html) по PMID статей). Файнтюнинг модели на таком объёме данных потребует определённых времени и вычислительных ресурсов (примерные затраты [приведены в статье](https://www.biorxiv.org/content/10.1101/2023.04.10.536208v1)), поэтому ниже мы воспользуемся упрощённым датасетом и будем тренировать только на заголовках статей." ] }, { "cell_type": "markdown", "id": "e9874f4a-3898-4c89-a0f7-04eeabf2b389", "metadata": { "tags": [] }, "source": [ "# Models\n", "\n", "В качестве базовой модели мы используем BERT, натренированный на биомедицинских данных (из PubMed). \n", "\n", "- [BiomedNLP-PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract)" ] }, { "cell_type": "markdown", "id": "991e48e7-897f-45a3-8a0b-539ea67b4eb5", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "2f130f05-21ee-46f9-889f-488e8c676aba", "metadata": {}, "source": [ "# Imports" ] }, { "cell_type": "code", "execution_count": 1, "id": "757a0582-1b8c-4f1c-b26f-544688e391f4", "metadata": { "tags": [] }, "outputs": [], "source": [ "import torch\n", "import transformers\n", "import numpy as np\n", "import pandas as pd\n", "from tqdm import tqdm\n", "\n", "import torch\n", "from datasets import Dataset, ClassLabel\n", "from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoModelForSequenceClassification\n", "from transformers import TrainingArguments, Trainer\n", "from transformers import pipeline\n", "import evaluate" ] }, { "cell_type": "markdown", "id": "daa2aa21-de67-44a9-a0ff-1a913e425ccc", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "03847b87-d096-49a5-b6e2-023fa08b94c2", "metadata": {}, "source": [ "# Load data" ] }, { "cell_type": "markdown", "id": "b3e902ea-4e0f-4d76-b27b-59e472b2b556", "metadata": {}, "source": [ "Загрузим данные для файнтюнинга — в частности, нам понадобятся названия статей и теги (абстрактов в этих данных нет)." ] }, { "cell_type": "code", "execution_count": 2, "id": "1be8f69e-bd7d-4ca9-ba9f-044b8e7bc497", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = pd.read_csv(\"pubmed_landscape_data.csv\")" ] }, { "cell_type": "code", "execution_count": 62, "id": "ae78e0e8-a600-4607-8c1e-82ecdae17e2d", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = df[df.Labels != \"unlabeled\"]\n", "df = df[~df.Title.isnull()]" ] }, { "cell_type": "code", "execution_count": 63, "id": "7715556f-8709-40cf-aa8c-3fecbfa3c1f4", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(7123406, 10)\n" ] }, { "data": { "text/html": [ "
\n", " | Title | \n", "Journal | \n", "PMID | \n", "Year | \n", "x | \n", "y | \n", "Labels | \n", "Colors | \n", "text | \n", "label | \n", "
---|---|---|---|---|---|---|---|---|---|---|
18 | \n", "Determination of some in vitro growth requirem... | \n", "Journal of general microbiology | \n", "1133574 | \n", "1975.0 | \n", "-140.830 | \n", "26.596 | \n", "microbiology | \n", "#B79762 | \n", "Determination of some in vitro growth requirem... | \n", "microbiology | \n", "
19 | \n", "Degradation of agar by a gram-negative bacterium. | \n", "Journal of general microbiology | \n", "1133575 | \n", "1975.0 | \n", "-72.913 | \n", "-4.436 | \n", "microbiology | \n", "#B79762 | \n", "Degradation of agar by a gram-negative bacterium. | \n", "microbiology | \n", "
20 | \n", "Choroid plexus isografts in rats. | \n", "Journal of neuropathology and experimental neu... | \n", "1133586 | \n", "1975.0 | \n", "-46.561 | \n", "96.421 | \n", "neurology | \n", "#009271 | \n", "Choroid plexus isografts in rats. | \n", "neurology | \n", "
29 | \n", "Preliminary report on a mass screening program... | \n", "The Journal of pediatrics | \n", "1133648 | \n", "1975.0 | \n", "45.033 | \n", "39.256 | \n", "pediatric | \n", "#004D43 | \n", "Preliminary report on a mass screening program... | \n", "pediatric | \n", "
30 | \n", "Hepatic changes in young infants with cystic f... | \n", "The Journal of pediatrics | \n", "1133649 | \n", "1975.0 | \n", "118.380 | \n", "61.870 | \n", "pediatric | \n", "#004D43 | \n", "Hepatic changes in young infants with cystic f... | \n", "pediatric | \n", "