{ "cells": [ { "cell_type": "markdown", "id": "a9fffce5-83e3-4838-8335-acb2e3b50c35", "metadata": {}, "source": [ "# 2.1 DNA分词器构建" ] }, { "cell_type": "markdown", "id": "f28b0950-37dc-4f78-ae6c-9fca33d513fc", "metadata": {}, "source": [ "## **分词算法**\n", "\n", "### **什么是分词**\n", "分词就是把一个文本序列,分成一个一个的token/词,对于英文这种天生带空格的语言,一般使用空格和标点分词就行了,而对于中文等语言,并没有特殊的符号来分词,因此,一般需要设计专门的分词算法,对于大模型而言,一般需要处理多种语言,因此,也需要专门的分词算法。\n", "\n", "在大模型(如 BERT、GPT 系列、T5 等)中,分词器(tokenizer)扮演着至关重要的角色。它负责将原始文本转换为模型可以处理的格式,即将文本分解成 token 序列,并将这些 token 映射到模型词汇表中的唯一 ID。分词器的选择和配置直接影响模型的性能和效果。以下是几种常见的分词器及其特点,特别关注它们在大型语言模型中的应用。\n", "\n", "### 1. **WordPiece 分词器**\n", "\n", "- **使用场景**:广泛应用于 BERT 及其变体。\n", "- **工作原理**:基于频率统计,从语料库中学习最有效的词汇表。它根据子词(subword)在文本中的出现频率来决定如何分割单词。例如,“playing” 可能被分为 “play” 和 “##ing”,其中“##”表示该部分是前一个 token 的延续。\n", "- **优点**:\n", " - 处理未知词汇能力强,能够将未见过的词汇分解为已知的子词。\n", " - 兼容性好,适合多种语言任务。\n", "- **缺点**:\n", " - 需要额外的标记(如 `##`)来指示子词,可能影响某些应用场景下的可读性。\n", "\n", "### 2. **Byte Pair Encoding (BPE)**\n", "\n", "- **使用场景**:广泛应用于 GPT 系列、RoBERTa、XLM-R 等模型。\n", "- **工作原理**:通过迭代地合并最常见的字符对来构建词汇表。BPE 是一种无监督的学习方法,能够在不依赖于预先定义的词汇表的情况下进行分词。\n", "- **优点**:\n", " - 灵活性高,适应性强,尤其适用于多语言模型。\n", " - 不需要特殊标记,生成的词汇表更简洁。\n", "- **缺点**:\n", " - 对于某些语言或领域特定的词汇,可能会产生较短的子词,导致信息丢失。\n", "\n", "### 3. **SentencePiece**\n", "\n", "- **使用场景**:常见于 T5、mBART 等多语言模型。\n", "- **工作原理**:结合了 BPE 和 WordPiece 的优点,同时支持字符级和词汇级分词。它可以在没有空格的语言(如中文、日文)中表现良好。\n", "- **优点**:\n", " - 支持无空格语言,适合多语言处理。\n", " - 学习速度快,适应性强。\n", "- **缺点**:\n", " - 对于某些特定领域的专业术语,可能需要额外的预处理步骤。\n", "\n", "### 4. **Character-Level Tokenizer**\n", "\n", "- **使用场景**:较少用于大型语言模型,但在某些特定任务(如拼写检查、手写识别)中有应用。\n", "- **工作原理**:直接将每个字符视为一个 token。这种方式简单直接,但通常会导致较大的词汇表。\n", "- **优点**:\n", " - 简单易实现,不需要复杂的训练过程。\n", " - 对于字符级别的任务非常有效。\n", "- **缺点**:\n", " - 词汇表较大,计算资源消耗较多。\n", " - 捕捉上下文信息的能力较弱。\n", "\n", "### 5. **Unigram Language Model**\n", "\n", "- **使用场景**:主要用于 SentencePiece 中。\n", "- **工作原理**:基于概率分布,选择最优的分词方案以最大化似然函数。这种方法类似于 BPE,但在构建词汇表时考虑了更多的统计信息。\n", "- **优点**:\n", " - 统计基础强,优化效果好。\n", " - 适应性强,适用于多种语言和任务。\n", "- **缺点**:\n", " - 计算复杂度较高,训练时间较长。\n", "\n", "### 分词器的关键特性\n", "\n", "无论选择哪种分词器,以下几个关键特性都是设计和应用中需要考虑的:\n", "\n", "- **词汇表大小**:决定了模型所能识别的词汇量。较大的词汇表可以捕捉更多细节,但也增加了内存和计算需求。\n", "- **处理未知词汇的能力**:好的分词器应该能够有效地处理未登录词(OOV, Out-Of-Vocabulary),将其分解为已知的子词。\n", "- **多语言支持**:对于多语言模型,分词器应能处理不同语言的文本,尤其是那些没有明显分隔符的语言。\n", "- **效率和速度**:分词器的执行速度直接影响整个数据处理管道的效率,尤其是在大规模数据集上。\n", "- **兼容性和灵活性**:分词器应与目标模型架构兼容,并且能够灵活适应不同的任务需求。" ] }, { "cell_type": "markdown", "id": "165e2594-277d-44d0-b582-77859a0bc0b2", "metadata": {}, "source": [ "## DNA等生物序列分词\n", "在生物信息学中,DNA 和蛋白质序列的处理与自然语言处理(NLP)有相似之处,但也有其独特性。为了提取这些生物序列的特征并用于机器学习或深度学习模型,通常需要将长序列分解成更小的片段(类似于 NLP 中的“分词”),以便更好地捕捉局部和全局特征。以下是几种常见的方法,用于对 DNA 和蛋白质序列进行“分词”,以提取有用的特征。\n", "\n", "### 1. **K-mer 分解**\n", "\n", "**定义**:K-mer 是指长度为 k 的连续子序列。例如,在 DNA 序列中,一个 3-mer 可能是 \"ATG\" 或 \"CGA\"。\n", "\n", "**应用**:\n", "- **DNA 序列**:常用的 k 值范围从 3 到 6。较小的 k 值可以捕捉到更细粒度的信息,而较大的 k 值则有助于识别更长的模式。\n", "- **蛋白质序列**:k 值通常较大,因为氨基酸的数量较多(20 种),较长的 k-mer 可以捕捉到重要的结构域或功能区域。\n", "\n", "**优点**:\n", "- 简单且直观,易于实现。\n", "- 可以捕捉到短序列中的局部特征。\n", "\n", "**缺点**:\n", "- 对于非常长的序列,生成的 k-mer 数量会非常大,导致维度爆炸问题。\n", "- 不同位置的 k-mer 之间缺乏上下文关系。" ] }, { "cell_type": "code", "execution_count": 2, "id": "29c390ef-2e9d-493e-9991-69ecb835b52b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DNA 3-mers: ['ATG', 'TGC', 'GCG', 'CGT', 'GTA', 'TAC', 'ACG', 'CGT', 'GTA']\n", "Protein 4-mers: ['MKQH', 'KQHK', 'QHKA', 'HKAM', 'KAMI', 'AMIV', 'MIVA', 'IVAL', 'VALI', 'ALIV', 'LIVL', 'IVLI', 'VLIT', 'LITA', 'ITAY']\n" ] } ], "source": [ "#示例代码(Python)\n", "\n", "def k_mer(seq, k):\n", " return [seq[i:i+k] for i in range(len(seq) - k + 1)]\n", "\n", "dna_sequence = \"ATGCGTACGTA\"\n", "protein_sequence = \"MKQHKAMIVALIVLITAY\"\n", "\n", "print(\"DNA 3-mers:\", k_mer(dna_sequence, 3))\n", "print(\"Protein 4-mers:\", k_mer(protein_sequence, 4))" ] }, { "cell_type": "markdown", "id": "7ced2bfb-bd42-425a-a3ad-54c9573609c5", "metadata": {}, "source": [ "### 2. **滑动窗口**\n", "\n", "**定义**:滑动窗口方法通过设定一个固定大小的窗口沿着序列移动,并在每个位置提取窗口内的子序列。这与 K-mer 类似,但允许重叠。\n", "\n", "**应用**:\n", "- **DNA 和蛋白质序列**:窗口大小可以根据具体任务调整,如基因预测、蛋白质结构预测等。\n", "\n", "**优点**:\n", "- 提供了更多的灵活性,可以控制窗口的步长和大小。\n", "- 有助于捕捉局部和全局特征。\n", "\n", "**缺点**:\n", "- 计算复杂度较高,尤其是当窗口大小较大时。" ] }, { "cell_type": "code", "execution_count": 4, "id": "82cecf91-0076-4c12-b11c-b35120581ef9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sliding window (DNA, size=3, step=1): ['ATG', 'TGC', 'GCG', 'CGT', 'GTA', 'TAC', 'ACG', 'CGT', 'GTA']\n", "Sliding window (Protein, size=4, step=2): ['MKQH', 'QHKA', 'KAMI', 'MIVA', 'VALI', 'LIVL', 'VLIT', 'ITAY']\n" ] } ], "source": [ "def sliding_window(seq, window_size, step=1):\n", " return [seq[i:i+window_size] for i in range(0, len(seq) - window_size + 1, step)]\n", "\n", "dna_sequence = \"ATGCGTACGTA\"\n", "protein_sequence = \"MKQHKAMIVALIVLITAY\"\n", "\n", "print(\"Sliding window (DNA, size=3, step=1):\", sliding_window(dna_sequence, 3))\n", "print(\"Sliding window (Protein, size=4, step=2):\", sliding_window(protein_sequence, 4, step=2))" ] }, { "cell_type": "markdown", "id": "c33ab920-b451-4846-93d4-20da5a4e1001", "metadata": {}, "source": [ "### 3. **词表分词和嵌入式表示**\n", "\n", "**定义**:使用预训练的嵌入模型(如 Word2Vec、BERT 等)来将每个 token 映射到高维向量空间中。对于生物序列,可以使用专门设计的嵌入模型,如 ProtTrans、ESM 等。\n", "\n", "**应用**:\n", "- **DNA 和蛋白质序列**:嵌入模型可以捕捉到序列中的语义信息和上下文依赖关系。\n", "\n", "**优点**:\n", "- 捕捉到丰富的语义信息,适合复杂的下游任务。\n", "- 可以利用大规模预训练模型的优势。\n", "\n", "**缺点**:\n", "- 需要大量的计算资源来进行预训练。\n", "- 模型复杂度较高,解释性较差。" ] }, { "cell_type": "code", "execution_count": 5, "id": "02bf2af0-6077-4b27-8822-f1c3f22914fa", "metadata": {}, "outputs": [], "source": [ "import subprocess\n", "import os\n", "# 设置环境变量, autodl一般区域\n", "result = subprocess.run('bash -c \"source /etc/network_turbo && env | grep proxy\"', shell=True, capture_output=True, text=True)\n", "output = result.stdout\n", "for line in output.splitlines():\n", " if '=' in line:\n", " var, value = line.split('=', 1)\n", " os.environ[var] = value\n", "\n", "\"\"\"\n", "import os\n", "\n", "# 设置环境变量, autodl专区 其他idc\n", "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n", "\n", "# 打印环境变量以确认设置成功\n", "print(os.environ.get('HF_ENDPOINT'))\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 15, "id": "d43b60ee-67f2-4d06-95ea-966c01084fc4", "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n", "To disable this warning, you can either:\n", "\t- Avoid using `tokenizers` before the fork if possible\n", "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "['ATGCG', 'TACG', 'T', 'A']\n", "Embeddings shape: torch.Size([1, 4, 768])\n" ] } ], "source": [ "from transformers import AutoTokenizer, AutoModel\n", "import torch\n", "\n", "# 加载预训练的蛋白质嵌入模型\n", "tokenizer = AutoTokenizer.from_pretrained(\"dnagpt/gpt_dna_v0\")\n", "model = AutoModel.from_pretrained(\"dnagpt/gpt_dna_v0\")\n", "\n", "dna_sequence = \"ATGCGTACGTA\"\n", "print(tokenizer.tokenize(dna_sequence))\n", "\n", "# 编码序列\n", "inputs = tokenizer(dna_sequence, return_tensors=\"pt\")\n", "\n", "# 获取嵌入\n", "with torch.no_grad():\n", " outputs = model(**inputs)\n", " embeddings = outputs.last_hidden_state\n", "\n", "print(\"Embeddings shape:\", embeddings.shape)" ] }, { "cell_type": "markdown", "id": "c24f10dc-1117-4493-9333-5ed6d898f44a", "metadata": {}, "source": [ "### **训练DNA BPE分词器**\n", "\n", "以上方法展示了如何对 DNA 和蛋白质序列进行“分词”,以提取有用的特征。选择哪种方法取决于具体的任务需求和数据特性。对于简单的分类或回归任务,K-mer 分解或滑动窗口可能是足够的;而对于更复杂的任务,如序列标注或结构预测,基于词汇表的方法或嵌入表示可能会提供更好的性能。\n", "\n", "目前大部分生物序列大模型的论文中,使用最多的依然是传统的K-mer,但一些SOTA的论文则以BEP为主。而BEP分词也是目前GPT、llama等主流自然语言大模型使用的基础分词器。\n", "\n", "因此,我们也演示下从头训练一个DNA BPE分词器的方法。\n", "\n", "我们首先看下GPT2模型,默认的分词器,对DNA序列分词的结果:" ] }, { "cell_type": "code", "execution_count": 10, "id": "43f1eb8b-1cc2-4ab5-aa8e-2a63132be98c", "metadata": {}, "outputs": [], "source": [ "from tokenizers import (\n", " decoders,\n", " models,\n", " normalizers,\n", " pre_tokenizers,\n", " processors,\n", " trainers,\n", " Tokenizer,\n", ")\n", "from transformers import AutoTokenizer" ] }, { "cell_type": "code", "execution_count": 15, "id": "27e88f7b-1399-418b-9b91-f970762fac0c", "metadata": {}, "outputs": [], "source": [ "gpt2_tokenizer = AutoTokenizer.from_pretrained('gpt2')\n", "gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token" ] }, { "cell_type": "code", "execution_count": 16, "id": "4b015db7-63ba-4909-b02f-07634b3d5584", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['T', 'GG', 'C', 'GT', 'GA', 'AC', 'CC', 'GG', 'G', 'AT', 'C', 'GG', 'G']" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gpt2_tokenizer.tokenize(\"TGGCGTGAACCCGGGATCGGG\")" ] }, { "cell_type": "markdown", "id": "a246fbc9-9e29-4b63-bdf7-f80635d06d1e", "metadata": {}, "source": [ "可以看到,gpt2模型因为是以英文为主的BPE分词模型,分解的都是1到2个字母的结果,这样显然很难充分表达生物语义,因此,我们使用DNA序列来训练1个BPE分词器,代码也非常简单:" ] }, { "cell_type": "code", "execution_count": 2, "id": "8357a695-1c29-4b5c-8099-d2e337189410", "metadata": {}, "outputs": [], "source": [ "tokenizer = Tokenizer(models.BPE())\n", "tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False, use_regex=False) #use_regex=False,空格当成一般字符串\n", "trainer = trainers.BpeTrainer(vocab_size=30000, special_tokens=[\"<|endoftext|>\"]) #3w words" ] }, { "cell_type": "code", "execution_count": 3, "id": "32c95888-1498-45cf-8453-421219cc7d45", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n" ] } ], "source": [ "tokenizer.train([\"../01-data_env/data/dna_1g.txt\"], trainer=trainer) #all file list, take 10-20 min" ] }, { "cell_type": "code", "execution_count": 4, "id": "5ffdd717-72ed-4a37-bafc-b4a0f61f8ff1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['TG', 'GCGTGAA', 'CCCGG', 'GATCGG', 'G']\n" ] } ], "source": [ "encoding = tokenizer.encode(\"TGGCGTGAACCCGGGATCGGG\")\n", "print(encoding.tokens)" ] }, { "cell_type": "markdown", "id": "a96e7838-6c23-4446-bf86-b098cd93214a", "metadata": {}, "source": [ "可以看到,以DNA数据训练的分词器,分词效果明显要好的多,各种长度的词都有。" ] }, { "cell_type": "code", "execution_count": 5, "id": "f1d757c1-702b-4147-9207-471f422f67b2", "metadata": {}, "outputs": [], "source": [ "tokenizer.save(\"dna_bpe_dict.json\")" ] }, { "cell_type": "code", "execution_count": 6, "id": "caf8ecea-359e-487b-b456-fab546b9da0d", "metadata": {}, "outputs": [], "source": [ "#然后我们可以使用from_file() 方法从该文件里重新加载 Tokenizer 对象:\n", "new_tokenizer = Tokenizer.from_file(\"dna_bpe_dict.json\")" ] }, { "cell_type": "code", "execution_count": 7, "id": "8ec6f045-bc30-4012-8027-a879df8def3a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('dna_bpe_dict/tokenizer_config.json',\n", " 'dna_bpe_dict/special_tokens_map.json',\n", " 'dna_bpe_dict/vocab.json',\n", " 'dna_bpe_dict/merges.txt',\n", " 'dna_bpe_dict/added_tokens.json',\n", " 'dna_bpe_dict/tokenizer.json')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#要在 🤗 Transformers 中使用这个标记器,我们必须将它包裹在一个 PreTrainedTokenizerFast 类中\n", "from transformers import GPT2TokenizerFast\n", "dna_tokenizer = GPT2TokenizerFast(tokenizer_object=new_tokenizer)\n", "dna_tokenizer.save_pretrained(\"dna_bpe_dict\")\n", "#dna_tokenizer.push_to_hub(\"dna_bpe_dict_1g\", organization=\"dnagpt\", use_auth_token=\"hf_*****\") # push to huggingface" ] }, { "cell_type": "code", "execution_count": 11, "id": "f84506d8-6208-4027-aad7-2b68a1bc16d6", "metadata": {}, "outputs": [], "source": [ "tokenizer_new = AutoTokenizer.from_pretrained('dna_bpe_dict')" ] }, { "cell_type": "code", "execution_count": 12, "id": "d40d4d53-6fed-445c-afb5-c0346ab854c8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['TG', 'GCGTGAA', 'CCCGG', 'GATCGG', 'G']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenizer_new.tokenize(\"TGGCGTGAACCCGGGATCGGG\")" ] }, { "cell_type": "code", "execution_count": null, "id": "640302f6-f740-41a4-ae92-ca4c43d97493", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 5 }