{ "cells": [ { "cell_type": "markdown", "id": "eca241b7", "metadata": {}, "source": [ "# Fine-tuning a custom Sentence Transformers model using synthetic data\n", "\n", "This notebook shows at a high level how we can define a pipeline for generating synthetic datasets for training/fine-tuning Sentence Transformers models for a custom domain using an LLM to help you generate relevant data.\n", "\n", "## Why fine-tune?\n", "\n", "There are already many good open source embedding models you can use but you may:\n", "\n", "- work in a specific domain where existing embeddings might not work super well\n", "- have a specific concept of similarity you want to capture\n", "- want to optimize for a particular task\n", "\n", "In all of these cases, even a little fine-tuning might help.\n", "\n", "## How to get custom data?\n", "\n", "One of the main barriers to fine-tuning a custom model has been the cost and effort involved in creating the datasets needed for this training. Recently, there has been an increased usage of LLMs for generating synthetic datasets. We'll see in this series of notebook how we can use an LLM for creating training datasets for fine-tuning a sentence similarity model.\n", "\n", "Before we start creating our dataset we do some initial exploration and prep of the dataset we're working with. \n" ] }, { "cell_type": "markdown", "id": "6177d35d", "metadata": {}, "source": [ "
\n", " Tip: We focus on a particular dataset in this case but you should be able to fairly easily adapt the notebook to use any other dataset on the Hugging Face Hub. \n", "
" ] }, { "cell_type": "markdown", "id": "35e484b3", "metadata": {}, "source": [ "If you are running this notebook in Collab you can use the following command to install the necessary libraries. If you are running in the Synthetic datasets workshop Space everything is already installed." ] }, { "cell_type": "code", "execution_count": null, "id": "ed4c4b75", "metadata": {}, "outputs": [], "source": [ "#%pip install datasets>=2.18.0 llama_index rich" ] }, { "cell_type": "markdown", "id": "dc1663d4", "metadata": {}, "source": [ "## 01. Preparing the data\n", "\n", "In this notebook, we'll focus on exploring the dataset and preparing it for generating our synthetic data. Depending on how well you know your dataset already you might spend less time on this step. However, it's always good to have a look at the data before starting to generate synthetic data since the approach you'll take might depend on the data you have." ] }, { "cell_type": "code", "execution_count": 1, "id": "d60d4d45-7eed-46cd-8404-5b645357daca", "metadata": { "tags": [] }, "outputs": [], "source": [ "import random\n", "import uuid\n", "from multiprocessing import cpu_count\n", "from typing import Any, Dict, Optional\n", "\n", "from datasets import load_dataset\n", "from huggingface_hub import login\n", "from llama_index.core import Document\n", "from llama_index.core.node_parser import SentenceSplitter\n", "from rich import print as rich_print" ] }, { "cell_type": "code", "execution_count": 2, "id": "bce86058", "metadata": { "tags": [] }, "outputs": [], "source": [ "NUM_PROC = cpu_count()" ] }, { "cell_type": "markdown", "id": "ca75b4b7", "metadata": {}, "source": [ "## Authenticate with the Hub\n", "\n", "You will need to authenticate with the Hub to be able to push datasets to the Hub. You can create a token by going to [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and creating a new token. You will need a token with write access. It's suggested to create a new token for this workshop (you can always revoke it later)." ] }, { "cell_type": "code", "execution_count": 3, "id": "787899d1", "metadata": { "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ccceaaffb40445dc86f2564c1a12deea", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(HTML(value='
[\n", " 'That Congress— (1) recognizes and celebrates the abolition of slavery more than 150 years ago in the Latin \n", "American countries of Mexico, Chile, Uruguay, Bolivia, Colombia, Ecuador, Argentina, Peru, and Venezuela; (2) \n", "recognizes the social, political, and cultural contributions of enslaved blacks and their descendants in Latin \n", "America; (3) acknowledges the impact of slavery and the existence of racial discrimination that have led to \n", "disparate social conditions and lack of civil liberties in Latin America; (4) urges the United States Government to\n", "work with the governments of Latin American countries to promote the visibility of the descendants of enslaved \n", "blacks in such countries and to recognize the importance of supporting international and regional efforts to \n", "eliminate racial and ethnic discrimination, such as the International Convention on the Elimination of All Forms of\n", "Racial Discrimination (signed at New York on December 21, 1965); and (5) urges the countries of Latin America to \n", "work with the United States and the international community to assist in addressing poverty and other targets in \n", "accordance with the United Nations Millennium Development Goals (as contained in United Nations General Assembly \n", "Resolution 55/2 (September 2000)).',\n", " 'That it is the sense of Congress that— (1) the United States should support the principles of democracy and \n", "constitutional rule in the Republic of Haiti, under which President Jean-Bertrand Aristide was elected, and oppose \n", "any and all attempts to remove President Aristide from office prior to the completion of his term under the \n", "Constitution of Haiti; (2) the United States should condemn the violent activities of groups of thugs, former \n", "members of Haiti’s disbanded army, and paramilitary organizations in Haiti; and (3) the United States, working with\n", "the United Nations, the Organization of American States (OAS), and other countries, should immediately provide \n", "assistance to Haiti to strengthen, reinforce, and professionalize the Haitian police force in order to enable the \n", "Haitian police force to restore law and order and preserve democracy in Haiti.'\n", "]\n", "\n" ], "text/plain": [ "\u001b[1m[\u001b[0m\n", " \u001b[32m'That Congress— \u001b[0m\u001b[32m(\u001b[0m\u001b[32m1\u001b[0m\u001b[32m)\u001b[0m\u001b[32m recognizes and celebrates the abolition of slavery more than 150 years ago in the Latin \u001b[0m\n", "\u001b[32mAmerican countries of Mexico, Chile, Uruguay, Bolivia, Colombia, Ecuador, Argentina, Peru, and Venezuela; \u001b[0m\u001b[32m(\u001b[0m\u001b[32m2\u001b[0m\u001b[32m)\u001b[0m\u001b[32m \u001b[0m\n", "\u001b[32mrecognizes the social, political, and cultural contributions of enslaved blacks and their descendants in Latin \u001b[0m\n", "\u001b[32mAmerica; \u001b[0m\u001b[32m(\u001b[0m\u001b[32m3\u001b[0m\u001b[32m)\u001b[0m\u001b[32m acknowledges the impact of slavery and the existence of racial discrimination that have led to \u001b[0m\n", "\u001b[32mdisparate social conditions and lack of civil liberties in Latin America; \u001b[0m\u001b[32m(\u001b[0m\u001b[32m4\u001b[0m\u001b[32m)\u001b[0m\u001b[32m urges the United States Government to\u001b[0m\n", "\u001b[32mwork with the governments of Latin American countries to promote the visibility of the descendants of enslaved \u001b[0m\n", "\u001b[32mblacks in such countries and to recognize the importance of supporting international and regional efforts to \u001b[0m\n", "\u001b[32meliminate racial and ethnic discrimination, such as the International Convention on the Elimination of All Forms of\u001b[0m\n", "\u001b[32mRacial Discrimination \u001b[0m\u001b[32m(\u001b[0m\u001b[32msigned at New York on December 21, 1965\u001b[0m\u001b[32m)\u001b[0m\u001b[32m; and \u001b[0m\u001b[32m(\u001b[0m\u001b[32m5\u001b[0m\u001b[32m)\u001b[0m\u001b[32m urges the countries of Latin America to \u001b[0m\n", "\u001b[32mwork with the United States and the international community to assist in addressing poverty and other targets in \u001b[0m\n", "\u001b[32maccordance with the United Nations Millennium Development Goals \u001b[0m\u001b[32m(\u001b[0m\u001b[32mas contained in United Nations General Assembly \u001b[0m\n", "\u001b[32mResolution 55/2 \u001b[0m\u001b[32m(\u001b[0m\u001b[32mSeptember 2000\u001b[0m\u001b[32m)\u001b[0m\u001b[32m)\u001b[0m\u001b[32m.'\u001b[0m,\n", " \u001b[32m'That it is the sense of Congress that— \u001b[0m\u001b[32m(\u001b[0m\u001b[32m1\u001b[0m\u001b[32m)\u001b[0m\u001b[32m the United States should support the principles of democracy and \u001b[0m\n", "\u001b[32mconstitutional rule in the Republic of Haiti, under which President Jean-Bertrand Aristide was elected, and oppose \u001b[0m\n", "\u001b[32many and all attempts to remove President Aristide from office prior to the completion of his term under the \u001b[0m\n", "\u001b[32mConstitution of Haiti; \u001b[0m\u001b[32m(\u001b[0m\u001b[32m2\u001b[0m\u001b[32m)\u001b[0m\u001b[32m the United States should condemn the violent activities of groups of thugs, former \u001b[0m\n", "\u001b[32mmembers of Haiti’s disbanded army, and paramilitary organizations in Haiti; and \u001b[0m\u001b[32m(\u001b[0m\u001b[32m3\u001b[0m\u001b[32m)\u001b[0m\u001b[32m the United States, working with\u001b[0m\n", "\u001b[32mthe United Nations, the Organization of American States \u001b[0m\u001b[32m(\u001b[0m\u001b[32mOAS\u001b[0m\u001b[32m)\u001b[0m\u001b[32m, and other countries, should immediately provide \u001b[0m\n", "\u001b[32massistance to Haiti to strengthen, reinforce, and professionalize the Haitian police force in order to enable the \u001b[0m\n", "\u001b[32mHaitian police force to restore law and order and preserve democracy in Haiti.'\u001b[0m\n", "\u001b[1m]\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "rich_print(ds[4:6]['text'])" ] }, { "cell_type": "markdown", "id": "df7129ec", "metadata": {}, "source": [ "We can see these texts are relatively short but if we take a look at other examples in this dataset we'll see there are some much longer ones. For most datasets we're working with which haven't already been preprocessed in some way, we'll find that we need to do some work to split the texts into smaller segments. " ] }, { "cell_type": "markdown", "id": "1c8a813d", "metadata": {}, "source": [ "## Chunking our text\n", "\n", "We'll need to split our text into smaller chunks to be able to use it for training a sentence similarity model. There are two main reasons for this:\n", "\n", "- Sentence Transformers models have a maximum input length for text/tokens they can process. This number depends on the model you're using. \n", "- Longer sections of text are more likely to be about multiple topics which can make it harder for the model to learn a specific type of similarity.\n", "\n", "Whilst the maximum embedding size for many open source models has grown recently we may still want to split our text into smaller chunks to ensure we have logical units of text to work with.\n", "\n", "### How to decide on the right chunk size\n", "\n", "Deciding on the right chunk size can be a bit of a balancing act and can depend on the specific dataset you're working with and the end application for your embedding model. One of the main applications of a custom sentence similarity model is to help improve the performance of a Retrieval Augmented Generation (RAG) application. In this case, you might want to split your text into chunks that are similar in length to the passages you'll be working with in your RAG application. \n", "\n", "\n", "### Splitting with Llama-index\n", "\n", "There are many libraries that have been developed for helping with RAG applications that can also help us with splitting our text into chunks. One of these is `Llama-index` which we'll use in this notebook.\n", "\n", "LLama-index has many different approaches for splitting texts (see [node_parsers](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/)). In this notebook we'll use the rather simple `SentenceSplitter` which splits text into sentences:\n", "\n", ">In general, this class tries to keep sentences and paragraphs together. Therefore compared to the original TokenTextSplitter, there are less likely to be hanging sentences or parts of sentences at the end of the node chunk.\n", "\n", "If your data is in a format like HTML or Markdown, other parsers are likely to be worth exploring. There is also a `SemanticSplitterNodeParser` which \"splits a document into Nodes, with each node being a group of semantically related sentences.\". This could be worth exploring but is more computationally expensive to use and depending on the text you are working with might not lead to much better results.\n", "\n", "### What size should we split our text into?\n", "\n", "If we look at the doc string for `SentenceSplitter` we can see that the default value for `max_tokens` is `1024`. We might want to adjust this to see what size makes sense for our data. \n" ] }, { "cell_type": "code", "execution_count": 7, "id": "cba67505", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[0;31mInit signature:\u001b[0m\n", "\u001b[0mSentenceSplitter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mseparator\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m' '\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mchunk_size\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m1024\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mchunk_overlap\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m200\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mtokenizer\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mOptional\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mCallable\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mparagraph_separator\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'\\n\\n\\n'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mchunking_tokenizer_fn\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mOptional\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mCallable\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mList\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0msecondary_chunking_regex\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'[^,.;。?!]+[,.;。?!]?'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mcallback_manager\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mOptional\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mllama_index\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcore\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcallbacks\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbase\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mCallbackManager\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0minclude_metadata\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mbool\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0minclude_prev_next_rel\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mbool\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mid_func\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mOptional\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mCallable\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mint\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mllama_index\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcore\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mschema\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDocument\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mDocstring:\u001b[0m \n", "Parse text with a preference for complete sentences.\n", "\n", "In general, this class tries to keep sentences and paragraphs together. Therefore\n", "compared to the original TokenTextSplitter, there are less likely to be\n", "hanging sentences or parts of sentences at the end of the node chunk.\n", "\u001b[0;31mInit docstring:\u001b[0m Initialize with parameters.\n", "\u001b[0;31mFile:\u001b[0m ~/Documents/tutorials/space/synthetic-data-workshop/.venv/lib/python3.11/site-packages/llama_index/core/node_parser/text/sentence.py\n", "\u001b[0;31mType:\u001b[0m ModelMetaclass\n", "\u001b[0;31mSubclasses:\u001b[0m " ] } ], "source": [ "?SentenceSplitter" ] }, { "cell_type": "code", "execution_count": 8, "id": "ba57172b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1024, 200)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splitter = SentenceSplitter()\n", "splitter.chunk_size, splitter.chunk_overlap" ] }, { "cell_type": "markdown", "id": "d51435a3", "metadata": {}, "source": [ "Let's load an example text and see how different sizes of chunks look." ] }, { "cell_type": "code", "execution_count": 9, "id": "14bdace0", "metadata": {}, "outputs": [], "source": [ "doc = Document.from_dict({\"text\": ds[200]['text']})" ] }, { "cell_type": "code", "execution_count": 10, "id": "b0f2861c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[TextNode(id_='b77fc036-8f99-415c-8142-f9a672f450bc', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={: RelatedNodeInfo(node_id='e59781d0-ad81-4411-9415-697b650e3d7e', node_type=, metadata={}, hash='19c0d72fca6cf938248ff70d34f7af953ce06be4b57f1748652d7fa878b87185'), : RelatedNodeInfo(node_id='5f7e3f35-a6dd-481b-968a-c70b2717c3d5', node_type=, metadata={}, hash='45627c584bb296bcb3c49c428092399194c37c6726210acedd5b22530af6d5a8')}, text='1. Short title; table of contents \\n(a) Short title \\nThis Act may be cited as the National Forest Organizational Camp Fee Improvement Act of 2003. (b) Table of contents \\nThe table of contents for this Act is as follows: Sec. 1. Short title; table of contents Sec. 2. Findings, purpose, and definitions Sec. 3. Fees for occupancy and use of National Forest System lands and facilities by organizational camps Sec. 4. Implementation Sec. 5. Relationship to other laws Sec. 6. Deposit and expenditure of use fees Sec. 7. Ministerial issuance or amendment authorization 2. Findings, purpose, and definitions \\n(a) Findings \\nCongress finds the following: (1) Organizational camps, such as those administered by the Boy Scouts, Girl Scouts, and faith-based and community-based organizations, provide a valuable service to young people, individuals with a disability, and their families by promoting physical, mental, and spiritual health through activities conducted in a natural environment. (2) The 192,000,0000 acres of national forests and grasslands of the National Forest System managed for multiple uses by the Forest Service provides an ideal setting for such organizational camps. (3) The Federal Government should charge land use fees for the occupancy and use of National Forest System lands by such organizational camps that, while based on the fair market value of the land in use, also recognize the benefits provided to society by such organizational camps, do not preclude the ability of such organizational camps from utilizing these lands, and permit capital investment in, and maintenance of, camp facilities by such organizational camps or their sponsoring organizations. (4) Organizational camps should— (A) ensure that their facilities meet applicable building and safety codes, including fire and health codes; (B) have annual inspections as required by local law, including at a minimum inspections for fire and food safety; and (C) have in place safety plans that address fire and medical emergencies and encounters with wildlife. (b) Purpose \\nIt is the purpose of this Act to establish a land use fee system that provides for an equitable return to the Federal Government for the occupancy and use of National Forest System lands by organizational camps that serve young people or individuals with a disability. (c) Definitions \\nIn this Act: (1) The term organizational camp means a public or semi-public camp that— (A) is developed on National Forest System lands by a nonprofit organization or governmental entity; (B) provides a valuable service to the public by using such lands as a setting to introduce young people or individuals with a disability to activities that they may not otherwise experience and to educate them on natural resource issues; and (C) does not have as its primary purpose raising revenue through commercial activities. (2) The term Secretary means the Secretary of Agriculture, acting through the Chief of the Forest Service. (3) The term individual with a disability has the meaning given the term in section 7 of the Rehabilitation Act of 1973 (29 U.S.C. 705). (4) The term children at risk means children who are raised in poverty or in single-parent homes or are subject to such circumstances as parental drug abuse, homelessness, or child abuse. (5) The term change in control means— (A) in the case of a corporation, the sale or transfer of a controlling interest in the corporation; (B) in the case of a partnership or limited liability company, the sale or transfer of a controlling interest in the partnership or limited liability company; and (C) in the case of an individual, the sale or transfer of an organizational camp to another party. 3. Fees for occupancy and use of National Forest System lands and facilities by organizational camps \\n(a) Land use fee \\n(1) Percentage of land value \\nThe Secretary shall charge an annual land use fee for each organizational camp for its occupancy and use of National Forest System lands equal to five percent of the product of the following: (A) The total number of acres of National Forest System lands authorized for the organizational camp. (B) The estimated per-acre market value of land and buildings in the county where the camp is located, as reported in the most recent Census of Agriculture conducted by the National Agricultural Statistics Service. (2) Annual adjustment \\nThe land use fee determined under paragraph (1) for an organizational camp shall be adjusted annually by the annual compounded rate of change between the two most recent Censuses of Agriculture. (3) Reduction in fees \\n(A) Based on type of participants \\nThe Secretary shall reduce the land use fee determined under paragraph (1) for an organizational camp if the organizational camp is attended by individuals with a disability or children at risk.', start_char_idx=0, end_char_idx=4828, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n", " TextNode(id_='5f7e3f35-a6dd-481b-968a-c70b2717c3d5', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={: RelatedNodeInfo(node_id='e59781d0-ad81-4411-9415-697b650e3d7e', node_type=, metadata={}, hash='19c0d72fca6cf938248ff70d34f7af953ce06be4b57f1748652d7fa878b87185'), : RelatedNodeInfo(node_id='b77fc036-8f99-415c-8142-f9a672f450bc', node_type=, metadata={}, hash='bc3a3fa910ae2a4b931df0affbcb1e9ceb8a31876ea173f0411d363da53cb335'), : RelatedNodeInfo(node_id='0191012f-76d7-46b3-bdd3-ba84581f04c3', node_type=, metadata={}, hash='d935674e66f4f56c1e706195b7d0e764369299c171aadf0e091a8167e1daecf3')}, text=\"(B) The estimated per-acre market value of land and buildings in the county where the camp is located, as reported in the most recent Census of Agriculture conducted by the National Agricultural Statistics Service. (2) Annual adjustment \\nThe land use fee determined under paragraph (1) for an organizational camp shall be adjusted annually by the annual compounded rate of change between the two most recent Censuses of Agriculture. (3) Reduction in fees \\n(A) Based on type of participants \\nThe Secretary shall reduce the land use fee determined under paragraph (1) for an organizational camp if the organizational camp is attended by individuals with a disability or children at risk. The amount of the reduction for a year shall bear the same ratio to the land use fee determined under paragraph (1) for the organizational camp as the total number of individuals with a disability and children at risk who attend the organizational camp bears to the total number of individuals who attend the organizational camp for the year. (B) Based on type of programs \\nAfter making the reduction required by subparagraph (A), the Secretary shall also reduce the land use fee determined under paragraph (1) for an organizational camp if the organizational camp provides youth programs for individuals attending the camp consisting of organized and supervised social, citizenship, character-building, or faith-based activities oriented to outdoor-recreation experiences. The amount of the reduction for a year shall be equal to 60 percent of the land use fee determined under paragraph (1), as adjusted under subparagraph (A). (C) Relation to minimum fee \\nNotwithstanding subparagraphs (A) and (B), the reductions made under this paragraph may not reduce the land use fee for an organizational camp below the minimum land use fee required to be charged under paragraph (4). (D) Special considerations \\nFor purposes of determining the amount of the land use fee reduction required under subparagraph (A) or (B), the Secretary may not take into consideration the existence of sponsorships or scholarships to assist individuals in attending the organizational camp. (4) Minimum land use fee \\nThe Secretary shall charge a minimum land use fee under paragraph (1) that represents, on average, the Secretary's cost annually to administer an organizational camp special use authorization in the National Forest Region in which the organizational camp is located. Notwithstanding paragraph (3) or subsection (d), the minimum land use fee shall not be subject to a reduction or waiver. (b) Facility use fee \\n(1) Percentage of facilities value \\nIf an organizational camp uses a Government-owned facility on National Forest System lands pursuant to section 7 of the Act of April 24, 1950 (commonly known as the Granger-Thye Act; 16 U.S.C. 580d), the Secretary shall charge, in addition to the land use fee imposed under subsection (a), a facility use fee equal to five percent of the value of the authorized facilities, as determined by the Secretary. (2) Reduction in fees prohibited \\nNotwithstanding subsection (d), the facility use fees determined under paragraph (1) shall not be subject to a reduction or waiver. (c) Fee related to receipt of other revenues \\nIf an organizational camp derives revenue from the use of National Forest System lands or authorized facilities described in subsection (b) for purposes other than to introduce young people or individuals with a disability to activities that they may not otherwise experience and to educate them on natural resource issues, the Secretary shall charge, in addition to the land use fee imposed under subsection (a) and the facility use fee imposed under subsection (b), an additional fee equal to five percent of that revenue. (d) Work-in-lieu program \\nSubject to subsections (a)(4) and (b)(2), section 3 of the Federal Timber Contract Payment Modification Act (16 U.S.C. 539f) shall apply to the use fees imposed under this section. 4. Implementation \\n(a) Prompt implementation \\nThe Secretary shall issue direction regarding implementation of this Act by interim directive within 180 days after the date of the enactment of this Act. The Secretary shall implement this Act beginning with the first billing cycle for organizational camp special use authorizations occurring more than 180 days after the date of the enactment of this Act. (b) Phase-in of use fee increases \\nIn issuing any direction regarding implementation of this Act under subsection (a), the Secretary shall consider whether to phase-in any significant increases in annual land or facility use fees for organizational camps. 5. Relationship to other laws \\nExcept as specifically provided by this Act, nothing in this Act supersedes or otherwise affects any provision of law, regulation, or policy regarding the issuance or administration of authorizations for organizational camps regarding the occupancy and use of National Forest System lands. 6. Deposit and expenditure of use fees \\n(a) Deposit and availability \\nUnless subject to section 7 of the Act of April 24, 1950 (commonly known as the Granger-Thye Act; 16 U.S.C.\", start_char_idx=4143, end_char_idx=9275, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n", " TextNode(id_='0191012f-76d7-46b3-bdd3-ba84581f04c3', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={: RelatedNodeInfo(node_id='e59781d0-ad81-4411-9415-697b650e3d7e', node_type=, metadata={}, hash='19c0d72fca6cf938248ff70d34f7af953ce06be4b57f1748652d7fa878b87185'), : RelatedNodeInfo(node_id='5f7e3f35-a6dd-481b-968a-c70b2717c3d5', node_type=, metadata={}, hash='45627c584bb296bcb3c49c428092399194c37c6726210acedd5b22530af6d5a8')}, text=\"The Secretary shall implement this Act beginning with the first billing cycle for organizational camp special use authorizations occurring more than 180 days after the date of the enactment of this Act. (b) Phase-in of use fee increases \\nIn issuing any direction regarding implementation of this Act under subsection (a), the Secretary shall consider whether to phase-in any significant increases in annual land or facility use fees for organizational camps. 5. Relationship to other laws \\nExcept as specifically provided by this Act, nothing in this Act supersedes or otherwise affects any provision of law, regulation, or policy regarding the issuance or administration of authorizations for organizational camps regarding the occupancy and use of National Forest System lands. 6. Deposit and expenditure of use fees \\n(a) Deposit and availability \\nUnless subject to section 7 of the Act of April 24, 1950 (commonly known as the Granger-Thye Act; 16 U.S.C. 580d), use fees collected by the Secretary under this Act shall be deposited in a special account in the Treasury and shall remain available to the Secretary for expenditure, without further appropriation until expended, for the purposes described in subsection (c). (b) Transfer \\nUpon request of the Secretary, the Secretary of the Treasury shall transfer to the Secretary from the special account such amounts as the Secretary may request. The Secretary shall accept and use such amounts in accordance with subsection (c). (c) Use \\nUse fees deposited pursuant to subsection (a) and transferred to the Secretary under subsection (b) shall be expended for monitoring of Forest Service special use authorizations, administration of the Forest Service's special program, interpretive programs, environmental analysis, environmental restoration, and similar purposes. 7. Ministerial issuance or amendment authorization \\n(a) NEPA exception \\nThe ministerial issuance or amendment of an organizational camp special use authorization shall not be subject to the National Environmental Policy Act of 1969 (42 U.S.C. 4321 et seq.). (b) Rule of construction \\nFor purposes of subsection (a), the ministerial issuance or amendment of an authorization occurs only when the issuance or amendment of the authorization would not change the physical environment or the activities, facilities, or program of the operations governed by the authorization, and at least one of the following apply: (1) The authorization is issued upon a change in control of the holder of an existing authorization. (2) The holder, upon expiration of an authorization, is issued a new authorization. (3) The authorization is amended— (A) to effectuate administrative changes, such as modification of the land use fee or conversion to a new special use authorization form; or (B) to include nondiscretionary environmental standards or to conform with current law.\", start_char_idx=8318, end_char_idx=11200, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n')]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splits = splitter.get_nodes_from_documents([doc])\n", "splits" ] }, { "cell_type": "code", "execution_count": 11, "id": "d6375311", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
1. Short title; table of contents \n",
       "(a) Short title \n",
       "This Act may be cited as the National Forest Organizational Camp Fee Improvement Act of 2003. (b) Table of contents\n",
       "The table of contents for this Act is as follows: Sec. 1. Short title; table of contents Sec. 2. Findings, purpose,\n",
       "and definitions Sec. 3. Fees for occupancy and use of National Forest System lands and facilities by organizational\n",
       "camps Sec. 4. Implementation Sec. 5. Relationship to other laws Sec. 6. Deposit and expenditure of use fees Sec. 7.\n",
       "Ministerial issuance or amendment authorization 2. Findings, purpose, and definitions \n",
       "(a) Findings \n",
       "Congress finds the following: (1) Organizational camps, such as those administered by the Boy Scouts, Girl Scouts, \n",
       "and faith-based and community-based organizations, provide a valuable service to young people, individuals with a \n",
       "disability, and their families by promoting physical, mental, and spiritual health through activities conducted in \n",
       "a natural environment. (2) The 192,000,0000 acres of national forests and grasslands of the National Forest System \n",
       "managed for multiple uses by the Forest Service provides an ideal setting for such organizational camps. (3) The \n",
       "Federal Government should charge land use fees for the occupancy and use of National Forest System lands by such \n",
       "organizational camps that, while based on the fair market value of the land in use, also recognize the benefits \n",
       "provided to society by such organizational camps, do not preclude the ability of such organizational camps from \n",
       "utilizing these lands, and permit capital investment in, and maintenance of, camp facilities by such organizational\n",
       "camps or their sponsoring organizations. (4) Organizational camps should— (A) ensure that their facilities meet \n",
       "applicable building and safety codes, including fire and health codes; (B) have annual inspections as required by \n",
       "local law, including at a minimum inspections for fire and food safety; and (C) have in place safety plans that \n",
       "address fire and medical emergencies and encounters with wildlife. (b) Purpose \n",
       "It is the purpose of this Act to establish a land use fee system that provides for an equitable return to the \n",
       "Federal Government for the occupancy and use of National Forest System lands by organizational camps that serve \n",
       "young people or individuals with a disability. (c) Definitions \n",
       "In this Act: (1) The term organizational camp means a public or semi-public camp that— (A) is developed on National\n",
       "Forest System lands by a nonprofit organization or governmental entity; (B) provides a valuable service to the \n",
       "public by using such lands as a setting to introduce young people or individuals with a disability to activities \n",
       "that they may not otherwise experience and to educate them on natural resource issues; and (C) does not have as its\n",
       "primary purpose raising revenue through commercial activities. (2) The term Secretary means the Secretary of \n",
       "Agriculture, acting through the Chief of the Forest Service. (3) The term individual with a disability has the \n",
       "meaning given the term in section 7 of the Rehabilitation Act of 1973 (29 U.S.C. 705). (4) The term children at \n",
       "risk means children who are raised in poverty or in single-parent homes or are subject to such circumstances as \n",
       "parental drug abuse, homelessness, or child abuse. (5) The term change in control means— (A) in the case of a \n",
       "corporation, the sale or transfer of a controlling interest in the corporation; (B) in the case of a partnership or\n",
       "limited liability company, the sale or transfer of a controlling interest in the partnership or limited liability \n",
       "company; and (C) in the case of an individual, the sale or transfer of an organizational camp to another party. 3. \n",
       "Fees for occupancy and use of National Forest System lands and facilities by organizational camps \n",
       "(a) Land use fee \n",
       "(1) Percentage of land value \n",
       "The Secretary shall charge an annual land use fee for each organizational camp for its occupancy and use of \n",
       "National Forest System lands equal to five percent of the product of the following: (A) The total number of acres \n",
       "of National Forest System lands authorized for the organizational camp. (B) The estimated per-acre market value of \n",
       "land and buildings in the county where the camp is located, as reported in the most recent Census of Agriculture \n",
       "conducted by the National Agricultural Statistics Service. (2) Annual adjustment \n",
       "The land use fee determined under paragraph (1) for an organizational camp shall be adjusted annually by the annual\n",
       "compounded rate of change between the two most recent Censuses of Agriculture. (3) Reduction in fees \n",
       "(A) Based on type of participants \n",
       "The Secretary shall reduce the land use fee determined under paragraph (1) for an organizational camp if the \n",
       "organizational camp is attended by individuals with a disability or children at risk.\n",
       "
\n" ], "text/plain": [ "\u001b[1;36m1\u001b[0m. Short title; table of contents \n", "\u001b[1m(\u001b[0ma\u001b[1m)\u001b[0m Short title \n", "This Act may be cited as the National Forest Organizational Camp Fee Improvement Act of \u001b[1;36m2003\u001b[0m. \u001b[1m(\u001b[0mb\u001b[1m)\u001b[0m Table of contents\n", "The table of contents for this Act is as follows: Sec. \u001b[1;36m1\u001b[0m. Short title; table of contents Sec. \u001b[1;36m2\u001b[0m. Findings, purpose,\n", "and definitions Sec. \u001b[1;36m3\u001b[0m. Fees for occupancy and use of National Forest System lands and facilities by organizational\n", "camps Sec. \u001b[1;36m4\u001b[0m. Implementation Sec. \u001b[1;36m5\u001b[0m. Relationship to other laws Sec. \u001b[1;36m6\u001b[0m. Deposit and expenditure of use fees Sec. \u001b[1;36m7\u001b[0m.\n", "Ministerial issuance or amendment authorization \u001b[1;36m2\u001b[0m. Findings, purpose, and definitions \n", "\u001b[1m(\u001b[0ma\u001b[1m)\u001b[0m Findings \n", "Congress finds the following: \u001b[1m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m Organizational camps, such as those administered by the Boy Scouts, Girl Scouts, \n", "and faith-based and community-based organizations, provide a valuable service to young people, individuals with a \n", "disability, and their families by promoting physical, mental, and spiritual health through activities conducted in \n", "a natural environment. \u001b[1m(\u001b[0m\u001b[1;36m2\u001b[0m\u001b[1m)\u001b[0m The \u001b[1;36m192\u001b[0m,\u001b[1;36m000\u001b[0m,\u001b[1;36m0000\u001b[0m acres of national forests and grasslands of the National Forest System \n", "managed for multiple uses by the Forest Service provides an ideal setting for such organizational camps. \u001b[1m(\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1m)\u001b[0m The \n", "Federal Government should charge land use fees for the occupancy and use of National Forest System lands by such \n", "organizational camps that, while based on the fair market value of the land in use, also recognize the benefits \n", "provided to society by such organizational camps, do not preclude the ability of such organizational camps from \n", "utilizing these lands, and permit capital investment in, and maintenance of, camp facilities by such organizational\n", "camps or their sponsoring organizations. \u001b[1m(\u001b[0m\u001b[1;36m4\u001b[0m\u001b[1m)\u001b[0m Organizational camps should— \u001b[1m(\u001b[0mA\u001b[1m)\u001b[0m ensure that their facilities meet \n", "applicable building and safety codes, including fire and health codes; \u001b[1m(\u001b[0mB\u001b[1m)\u001b[0m have annual inspections as required by \n", "local law, including at a minimum inspections for fire and food safety; and \u001b[1m(\u001b[0mC\u001b[1m)\u001b[0m have in place safety plans that \n", "address fire and medical emergencies and encounters with wildlife. \u001b[1m(\u001b[0mb\u001b[1m)\u001b[0m Purpose \n", "It is the purpose of this Act to establish a land use fee system that provides for an equitable return to the \n", "Federal Government for the occupancy and use of National Forest System lands by organizational camps that serve \n", "young people or individuals with a disability. \u001b[1m(\u001b[0mc\u001b[1m)\u001b[0m Definitions \n", "In this Act: \u001b[1m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m The term organizational camp means a public or semi-public camp that— \u001b[1m(\u001b[0mA\u001b[1m)\u001b[0m is developed on National\n", "Forest System lands by a nonprofit organization or governmental entity; \u001b[1m(\u001b[0mB\u001b[1m)\u001b[0m provides a valuable service to the \n", "public by using such lands as a setting to introduce young people or individuals with a disability to activities \n", "that they may not otherwise experience and to educate them on natural resource issues; and \u001b[1m(\u001b[0mC\u001b[1m)\u001b[0m does not have as its\n", "primary purpose raising revenue through commercial activities. \u001b[1m(\u001b[0m\u001b[1;36m2\u001b[0m\u001b[1m)\u001b[0m The term Secretary means the Secretary of \n", "Agriculture, acting through the Chief of the Forest Service. \u001b[1m(\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1m)\u001b[0m The term individual with a disability has the \n", "meaning given the term in section \u001b[1;36m7\u001b[0m of the Rehabilitation Act of \u001b[1;36m1973\u001b[0m \u001b[1m(\u001b[0m\u001b[1;36m29\u001b[0m U.S.C. \u001b[1;36m705\u001b[0m\u001b[1m)\u001b[0m. \u001b[1m(\u001b[0m\u001b[1;36m4\u001b[0m\u001b[1m)\u001b[0m The term children at \n", "risk means children who are raised in poverty or in single-parent homes or are subject to such circumstances as \n", "parental drug abuse, homelessness, or child abuse. \u001b[1m(\u001b[0m\u001b[1;36m5\u001b[0m\u001b[1m)\u001b[0m The term change in control means— \u001b[1m(\u001b[0mA\u001b[1m)\u001b[0m in the case of a \n", "corporation, the sale or transfer of a controlling interest in the corporation; \u001b[1m(\u001b[0mB\u001b[1m)\u001b[0m in the case of a partnership or\n", "limited liability company, the sale or transfer of a controlling interest in the partnership or limited liability \n", "company; and \u001b[1m(\u001b[0mC\u001b[1m)\u001b[0m in the case of an individual, the sale or transfer of an organizational camp to another party. \u001b[1;36m3\u001b[0m. \n", "Fees for occupancy and use of National Forest System lands and facilities by organizational camps \n", "\u001b[1m(\u001b[0ma\u001b[1m)\u001b[0m Land use fee \n", "\u001b[1m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m Percentage of land value \n", "The Secretary shall charge an annual land use fee for each organizational camp for its occupancy and use of \n", "National Forest System lands equal to five percent of the product of the following: \u001b[1m(\u001b[0mA\u001b[1m)\u001b[0m The total number of acres \n", "of National Forest System lands authorized for the organizational camp. \u001b[1m(\u001b[0mB\u001b[1m)\u001b[0m The estimated per-acre market value of \n", "land and buildings in the county where the camp is located, as reported in the most recent Census of Agriculture \n", "conducted by the National Agricultural Statistics Service. \u001b[1m(\u001b[0m\u001b[1;36m2\u001b[0m\u001b[1m)\u001b[0m Annual adjustment \n", "The land use fee determined under paragraph \u001b[1m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m for an organizational camp shall be adjusted annually by the annual\n", "compounded rate of change between the two most recent Censuses of Agriculture. \u001b[1m(\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1m)\u001b[0m Reduction in fees \n", "\u001b[1m(\u001b[0mA\u001b[1m)\u001b[0m Based on type of participants \n", "The Secretary shall reduce the land use fee determined under paragraph \u001b[1m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m for an organizational camp if the \n", "organizational camp is attended by individuals with a disability or children at risk.\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "rich_print(splits[0].text)" ] }, { "cell_type": "code", "execution_count": 12, "id": "4779af81", "metadata": {}, "outputs": [], "source": [ "splitter = SentenceSplitter(chunk_size=128, chunk_overlap=0)" ] }, { "cell_type": "code", "execution_count": 13, "id": "6d026858", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
1. Short title; table of contents \n",
       "(a) Short title \n",
       "This Act may be cited as the National Forest Organizational Camp Fee Improvement Act of 2003. (b) Table of contents\n",
       "The table of contents for this Act is as follows: Sec. 1. Short title; table of contents Sec. 2. Findings, purpose,\n",
       "and definitions Sec. 3. Fees for occupancy and use of National Forest System lands and facilities by organizational\n",
       "camps Sec. 4. Implementation Sec. 5. Relationship to other laws Sec. 6.\n",
       "
\n" ], "text/plain": [ "\u001b[1;36m1\u001b[0m. Short title; table of contents \n", "\u001b[1m(\u001b[0ma\u001b[1m)\u001b[0m Short title \n", "This Act may be cited as the National Forest Organizational Camp Fee Improvement Act of \u001b[1;36m2003\u001b[0m. \u001b[1m(\u001b[0mb\u001b[1m)\u001b[0m Table of contents\n", "The table of contents for this Act is as follows: Sec. \u001b[1;36m1\u001b[0m. Short title; table of contents Sec. \u001b[1;36m2\u001b[0m. Findings, purpose,\n", "and definitions Sec. \u001b[1;36m3\u001b[0m. Fees for occupancy and use of National Forest System lands and facilities by organizational\n", "camps Sec. \u001b[1;36m4\u001b[0m. Implementation Sec. \u001b[1;36m5\u001b[0m. Relationship to other laws Sec. \u001b[1;36m6\u001b[0m.\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "splits = splitter.get_nodes_from_documents([doc])\n", "rich_print(splits[0].text)" ] }, { "cell_type": "code", "execution_count": 14, "id": "763f90a9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "20" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_size = 12\n", "documents = [Document.from_dict({\"text\": ds[i]['text']}) for i in range(10)]\n", "splits = splitter.get_nodes_from_documents(documents)\n", "len(splits)\n", "# uncomment to see the output\n", "# for split in splits:\n", "# rich_print(split.text)" ] }, { "cell_type": "markdown", "id": "1ed1a8cd", "metadata": {}, "source": [ "For this particular dataset, since the texts are quite dense in the topics they cover, it seems to make sense to aim for a smaller chunk size like 128. This will help us to ensure that we're capturing the specific topics in the text. If you are using a different dataset you might want to experiment with different chunk sizes to see what works best for your data." ] }, { "cell_type": "markdown", "id": "41fb3f85", "metadata": {}, "source": [ "## Process our full dataset\n", "\n", "Now that we've decided on a chunk size, let's process our full dataset. We'll split each text into chunks and save these to a new dataset." ] }, { "cell_type": "code", "execution_count": 15, "id": "ada9122f-a505-4bc0-80f6-228be2067891", "metadata": { "tags": [] }, "outputs": [], "source": [ "def split_texts(\n", " examples: Dict[str, Any],\n", " text_column_name: str = \"text\",\n", " id_column_name: Optional[str] = None,\n", " splitter: Optional[SentenceSplitter] = None,\n", "):\n", " if splitter is None:\n", " # if not provided, use the default splitter\n", " splitter = SentenceSplitter()\n", " texts = examples[text_column_name]\n", " if id_column_name is None:\n", " # Generate random ids if not provided\n", " ids = [str(uuid.uuid4()) for _ in range(len(texts))]\n", " else:\n", " ids = examples[id_column_name]\n", " sections = []\n", " ids_ = []\n", " for text, id_ in zip(texts, ids):\n", " # Create a document for each text\n", " document = Document(text=text)\n", " # Split the document into nodes\n", " nodes = splitter.get_nodes_from_documents([document])\n", " # Extract the text from each node\n", " sentences = [n.text for n in nodes]\n", " # Extend the sections list with these sentences\n", " sections.extend(sentences)\n", " # Extend the ids_ list with the corresponding id, repeated for each sentence\n", " ids_.extend([id_] * len(sentences))\n", " return {\"section\": sections, \"id\": ids_}" ] }, { "cell_type": "markdown", "id": "d7221f82", "metadata": {}, "source": [ "We can now split the full dataset. \n", "\n", "If you are using a different dataset remember to adjust the `text_column_name` if the name of the column containing the text for your dataset is different. If there is an `id` column you can specify that as well otherwise set this to `None` and the function will generate an id for each row." ] }, { "cell_type": "code", "execution_count": 16, "id": "336c0f08", "metadata": {}, "outputs": [], "source": [ "splitter = SentenceSplitter(chunk_size=128, chunk_overlap=0)" ] }, { "cell_type": "code", "execution_count": 17, "id": "3bb3807f-37ac-4e52-b37a-dc7ebc4a3446", "metadata": { "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9aad33479c3348619a33327e168b5f45", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Map (num_proc=8): 0%| | 0/125246 [00:00