{ "cells": [ { "cell_type": "markdown", "id": "eca241b7", "metadata": {}, "source": [ "# Fine-tuning a custom Sentence Transformers model using synthetic data\n", "\n", "This notebook shows at a high level how we can define a pipeline for generating synthetic datasets for training/fine-tuning Sentence Transformers models for a custom domain using an LLM to help you generate relevant data.\n", "\n", "## Why fine-tune?\n", "\n", "There are already many good open source embedding models you can use but you may:\n", "\n", "- work in a specific domain where existing embeddings might not work super well\n", "- have a specific concept of similarity you want to capture\n", "- want to optimize for a particular task\n", "\n", "In all of these cases, even a little fine-tuning might help.\n", "\n", "## How to get custom data?\n", "\n", "One of the main barriers to fine-tuning a custom model has been the cost and effort involved in creating the datasets needed for this training. Recently, there has been an increased usage of LLMs for generating synthetic datasets. We'll see in this series of notebook how we can use an LLM for creating training datasets for fine-tuning a sentence similarity model.\n", "\n", "Before we start creating our dataset we do some initial exploration and prep of the dataset we're working with. \n" ] }, { "cell_type": "markdown", "id": "6177d35d", "metadata": {}, "source": [ "
1. Short title; table of contents \n", "(a) Short title \n", "This Act may be cited as the National Forest Organizational Camp Fee Improvement Act of 2003. (b) Table of contents\n", "The table of contents for this Act is as follows: Sec. 1. Short title; table of contents Sec. 2. Findings, purpose,\n", "and definitions Sec. 3. Fees for occupancy and use of National Forest System lands and facilities by organizational\n", "camps Sec. 4. Implementation Sec. 5. Relationship to other laws Sec. 6. Deposit and expenditure of use fees Sec. 7.\n", "Ministerial issuance or amendment authorization 2. Findings, purpose, and definitions \n", "(a) Findings \n", "Congress finds the following: (1) Organizational camps, such as those administered by the Boy Scouts, Girl Scouts, \n", "and faith-based and community-based organizations, provide a valuable service to young people, individuals with a \n", "disability, and their families by promoting physical, mental, and spiritual health through activities conducted in \n", "a natural environment. (2) The 192,000,0000 acres of national forests and grasslands of the National Forest System \n", "managed for multiple uses by the Forest Service provides an ideal setting for such organizational camps. (3) The \n", "Federal Government should charge land use fees for the occupancy and use of National Forest System lands by such \n", "organizational camps that, while based on the fair market value of the land in use, also recognize the benefits \n", "provided to society by such organizational camps, do not preclude the ability of such organizational camps from \n", "utilizing these lands, and permit capital investment in, and maintenance of, camp facilities by such organizational\n", "camps or their sponsoring organizations. (4) Organizational camps should— (A) ensure that their facilities meet \n", "applicable building and safety codes, including fire and health codes; (B) have annual inspections as required by \n", "local law, including at a minimum inspections for fire and food safety; and (C) have in place safety plans that \n", "address fire and medical emergencies and encounters with wildlife. (b) Purpose \n", "It is the purpose of this Act to establish a land use fee system that provides for an equitable return to the \n", "Federal Government for the occupancy and use of National Forest System lands by organizational camps that serve \n", "young people or individuals with a disability. (c) Definitions \n", "In this Act: (1) The term organizational camp means a public or semi-public camp that— (A) is developed on National\n", "Forest System lands by a nonprofit organization or governmental entity; (B) provides a valuable service to the \n", "public by using such lands as a setting to introduce young people or individuals with a disability to activities \n", "that they may not otherwise experience and to educate them on natural resource issues; and (C) does not have as its\n", "primary purpose raising revenue through commercial activities. (2) The term Secretary means the Secretary of \n", "Agriculture, acting through the Chief of the Forest Service. (3) The term individual with a disability has the \n", "meaning given the term in section 7 of the Rehabilitation Act of 1973 (29 U.S.C. 705). (4) The term children at \n", "risk means children who are raised in poverty or in single-parent homes or are subject to such circumstances as \n", "parental drug abuse, homelessness, or child abuse. (5) The term change in control means— (A) in the case of a \n", "corporation, the sale or transfer of a controlling interest in the corporation; (B) in the case of a partnership or\n", "limited liability company, the sale or transfer of a controlling interest in the partnership or limited liability \n", "company; and (C) in the case of an individual, the sale or transfer of an organizational camp to another party. 3. \n", "Fees for occupancy and use of National Forest System lands and facilities by organizational camps \n", "(a) Land use fee \n", "(1) Percentage of land value \n", "The Secretary shall charge an annual land use fee for each organizational camp for its occupancy and use of \n", "National Forest System lands equal to five percent of the product of the following: (A) The total number of acres \n", "of National Forest System lands authorized for the organizational camp. (B) The estimated per-acre market value of \n", "land and buildings in the county where the camp is located, as reported in the most recent Census of Agriculture \n", "conducted by the National Agricultural Statistics Service. (2) Annual adjustment \n", "The land use fee determined under paragraph (1) for an organizational camp shall be adjusted annually by the annual\n", "compounded rate of change between the two most recent Censuses of Agriculture. (3) Reduction in fees \n", "(A) Based on type of participants \n", "The Secretary shall reduce the land use fee determined under paragraph (1) for an organizational camp if the \n", "organizational camp is attended by individuals with a disability or children at risk.\n", "\n" ], "text/plain": [ "\u001b[1;36m1\u001b[0m. Short title; table of contents \n", "\u001b[1m(\u001b[0ma\u001b[1m)\u001b[0m Short title \n", "This Act may be cited as the National Forest Organizational Camp Fee Improvement Act of \u001b[1;36m2003\u001b[0m. \u001b[1m(\u001b[0mb\u001b[1m)\u001b[0m Table of contents\n", "The table of contents for this Act is as follows: Sec. \u001b[1;36m1\u001b[0m. Short title; table of contents Sec. \u001b[1;36m2\u001b[0m. Findings, purpose,\n", "and definitions Sec. \u001b[1;36m3\u001b[0m. Fees for occupancy and use of National Forest System lands and facilities by organizational\n", "camps Sec. \u001b[1;36m4\u001b[0m. Implementation Sec. \u001b[1;36m5\u001b[0m. Relationship to other laws Sec. \u001b[1;36m6\u001b[0m. Deposit and expenditure of use fees Sec. \u001b[1;36m7\u001b[0m.\n", "Ministerial issuance or amendment authorization \u001b[1;36m2\u001b[0m. Findings, purpose, and definitions \n", "\u001b[1m(\u001b[0ma\u001b[1m)\u001b[0m Findings \n", "Congress finds the following: \u001b[1m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m Organizational camps, such as those administered by the Boy Scouts, Girl Scouts, \n", "and faith-based and community-based organizations, provide a valuable service to young people, individuals with a \n", "disability, and their families by promoting physical, mental, and spiritual health through activities conducted in \n", "a natural environment. \u001b[1m(\u001b[0m\u001b[1;36m2\u001b[0m\u001b[1m)\u001b[0m The \u001b[1;36m192\u001b[0m,\u001b[1;36m000\u001b[0m,\u001b[1;36m0000\u001b[0m acres of national forests and grasslands of the National Forest System \n", "managed for multiple uses by the Forest Service provides an ideal setting for such organizational camps. \u001b[1m(\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1m)\u001b[0m The \n", "Federal Government should charge land use fees for the occupancy and use of National Forest System lands by such \n", "organizational camps that, while based on the fair market value of the land in use, also recognize the benefits \n", "provided to society by such organizational camps, do not preclude the ability of such organizational camps from \n", "utilizing these lands, and permit capital investment in, and maintenance of, camp facilities by such organizational\n", "camps or their sponsoring organizations. \u001b[1m(\u001b[0m\u001b[1;36m4\u001b[0m\u001b[1m)\u001b[0m Organizational camps should— \u001b[1m(\u001b[0mA\u001b[1m)\u001b[0m ensure that their facilities meet \n", "applicable building and safety codes, including fire and health codes; \u001b[1m(\u001b[0mB\u001b[1m)\u001b[0m have annual inspections as required by \n", "local law, including at a minimum inspections for fire and food safety; and \u001b[1m(\u001b[0mC\u001b[1m)\u001b[0m have in place safety plans that \n", "address fire and medical emergencies and encounters with wildlife. \u001b[1m(\u001b[0mb\u001b[1m)\u001b[0m Purpose \n", "It is the purpose of this Act to establish a land use fee system that provides for an equitable return to the \n", "Federal Government for the occupancy and use of National Forest System lands by organizational camps that serve \n", "young people or individuals with a disability. \u001b[1m(\u001b[0mc\u001b[1m)\u001b[0m Definitions \n", "In this Act: \u001b[1m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m The term organizational camp means a public or semi-public camp that— \u001b[1m(\u001b[0mA\u001b[1m)\u001b[0m is developed on National\n", "Forest System lands by a nonprofit organization or governmental entity; \u001b[1m(\u001b[0mB\u001b[1m)\u001b[0m provides a valuable service to the \n", "public by using such lands as a setting to introduce young people or individuals with a disability to activities \n", "that they may not otherwise experience and to educate them on natural resource issues; and \u001b[1m(\u001b[0mC\u001b[1m)\u001b[0m does not have as its\n", "primary purpose raising revenue through commercial activities. \u001b[1m(\u001b[0m\u001b[1;36m2\u001b[0m\u001b[1m)\u001b[0m The term Secretary means the Secretary of \n", "Agriculture, acting through the Chief of the Forest Service. \u001b[1m(\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1m)\u001b[0m The term individual with a disability has the \n", "meaning given the term in section \u001b[1;36m7\u001b[0m of the Rehabilitation Act of \u001b[1;36m1973\u001b[0m \u001b[1m(\u001b[0m\u001b[1;36m29\u001b[0m U.S.C. \u001b[1;36m705\u001b[0m\u001b[1m)\u001b[0m. \u001b[1m(\u001b[0m\u001b[1;36m4\u001b[0m\u001b[1m)\u001b[0m The term children at \n", "risk means children who are raised in poverty or in single-parent homes or are subject to such circumstances as \n", "parental drug abuse, homelessness, or child abuse. \u001b[1m(\u001b[0m\u001b[1;36m5\u001b[0m\u001b[1m)\u001b[0m The term change in control means— \u001b[1m(\u001b[0mA\u001b[1m)\u001b[0m in the case of a \n", "corporation, the sale or transfer of a controlling interest in the corporation; \u001b[1m(\u001b[0mB\u001b[1m)\u001b[0m in the case of a partnership or\n", "limited liability company, the sale or transfer of a controlling interest in the partnership or limited liability \n", "company; and \u001b[1m(\u001b[0mC\u001b[1m)\u001b[0m in the case of an individual, the sale or transfer of an organizational camp to another party. \u001b[1;36m3\u001b[0m. \n", "Fees for occupancy and use of National Forest System lands and facilities by organizational camps \n", "\u001b[1m(\u001b[0ma\u001b[1m)\u001b[0m Land use fee \n", "\u001b[1m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m Percentage of land value \n", "The Secretary shall charge an annual land use fee for each organizational camp for its occupancy and use of \n", "National Forest System lands equal to five percent of the product of the following: \u001b[1m(\u001b[0mA\u001b[1m)\u001b[0m The total number of acres \n", "of National Forest System lands authorized for the organizational camp. \u001b[1m(\u001b[0mB\u001b[1m)\u001b[0m The estimated per-acre market value of \n", "land and buildings in the county where the camp is located, as reported in the most recent Census of Agriculture \n", "conducted by the National Agricultural Statistics Service. \u001b[1m(\u001b[0m\u001b[1;36m2\u001b[0m\u001b[1m)\u001b[0m Annual adjustment \n", "The land use fee determined under paragraph \u001b[1m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m for an organizational camp shall be adjusted annually by the annual\n", "compounded rate of change between the two most recent Censuses of Agriculture. \u001b[1m(\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1m)\u001b[0m Reduction in fees \n", "\u001b[1m(\u001b[0mA\u001b[1m)\u001b[0m Based on type of participants \n", "The Secretary shall reduce the land use fee determined under paragraph \u001b[1m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m for an organizational camp if the \n", "organizational camp is attended by individuals with a disability or children at risk.\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "rich_print(splits[0].text)" ] }, { "cell_type": "code", "execution_count": 12, "id": "4779af81", "metadata": {}, "outputs": [], "source": [ "splitter = SentenceSplitter(chunk_size=128, chunk_overlap=0)" ] }, { "cell_type": "code", "execution_count": 13, "id": "6d026858", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
1. Short title; table of contents \n", "(a) Short title \n", "This Act may be cited as the National Forest Organizational Camp Fee Improvement Act of 2003. (b) Table of contents\n", "The table of contents for this Act is as follows: Sec. 1. Short title; table of contents Sec. 2. Findings, purpose,\n", "and definitions Sec. 3. Fees for occupancy and use of National Forest System lands and facilities by organizational\n", "camps Sec. 4. Implementation Sec. 5. Relationship to other laws Sec. 6.\n", "\n" ], "text/plain": [ "\u001b[1;36m1\u001b[0m. Short title; table of contents \n", "\u001b[1m(\u001b[0ma\u001b[1m)\u001b[0m Short title \n", "This Act may be cited as the National Forest Organizational Camp Fee Improvement Act of \u001b[1;36m2003\u001b[0m. \u001b[1m(\u001b[0mb\u001b[1m)\u001b[0m Table of contents\n", "The table of contents for this Act is as follows: Sec. \u001b[1;36m1\u001b[0m. Short title; table of contents Sec. \u001b[1;36m2\u001b[0m. Findings, purpose,\n", "and definitions Sec. \u001b[1;36m3\u001b[0m. Fees for occupancy and use of National Forest System lands and facilities by organizational\n", "camps Sec. \u001b[1;36m4\u001b[0m. Implementation Sec. \u001b[1;36m5\u001b[0m. Relationship to other laws Sec. \u001b[1;36m6\u001b[0m.\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "splits = splitter.get_nodes_from_documents([doc])\n", "rich_print(splits[0].text)" ] }, { "cell_type": "code", "execution_count": 14, "id": "763f90a9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "20" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_size = 12\n", "documents = [Document.from_dict({\"text\": ds[i]['text']}) for i in range(10)]\n", "splits = splitter.get_nodes_from_documents(documents)\n", "len(splits)\n", "# uncomment to see the output\n", "# for split in splits:\n", "# rich_print(split.text)" ] }, { "cell_type": "markdown", "id": "1ed1a8cd", "metadata": {}, "source": [ "For this particular dataset, since the texts are quite dense in the topics they cover, it seems to make sense to aim for a smaller chunk size like 128. This will help us to ensure that we're capturing the specific topics in the text. If you are using a different dataset you might want to experiment with different chunk sizes to see what works best for your data." ] }, { "cell_type": "markdown", "id": "41fb3f85", "metadata": {}, "source": [ "## Process our full dataset\n", "\n", "Now that we've decided on a chunk size, let's process our full dataset. We'll split each text into chunks and save these to a new dataset." ] }, { "cell_type": "code", "execution_count": 15, "id": "ada9122f-a505-4bc0-80f6-228be2067891", "metadata": { "tags": [] }, "outputs": [], "source": [ "def split_texts(\n", " examples: Dict[str, Any],\n", " text_column_name: str = \"text\",\n", " id_column_name: Optional[str] = None,\n", " splitter: Optional[SentenceSplitter] = None,\n", "):\n", " if splitter is None:\n", " # if not provided, use the default splitter\n", " splitter = SentenceSplitter()\n", " texts = examples[text_column_name]\n", " if id_column_name is None:\n", " # Generate random ids if not provided\n", " ids = [str(uuid.uuid4()) for _ in range(len(texts))]\n", " else:\n", " ids = examples[id_column_name]\n", " sections = []\n", " ids_ = []\n", " for text, id_ in zip(texts, ids):\n", " # Create a document for each text\n", " document = Document(text=text)\n", " # Split the document into nodes\n", " nodes = splitter.get_nodes_from_documents([document])\n", " # Extract the text from each node\n", " sentences = [n.text for n in nodes]\n", " # Extend the sections list with these sentences\n", " sections.extend(sentences)\n", " # Extend the ids_ list with the corresponding id, repeated for each sentence\n", " ids_.extend([id_] * len(sentences))\n", " return {\"section\": sections, \"id\": ids_}" ] }, { "cell_type": "markdown", "id": "d7221f82", "metadata": {}, "source": [ "We can now split the full dataset. \n", "\n", "If you are using a different dataset remember to adjust the `text_column_name` if the name of the column containing the text for your dataset is different. If there is an `id` column you can specify that as well otherwise set this to `None` and the function will generate an id for each row." ] }, { "cell_type": "code", "execution_count": 16, "id": "336c0f08", "metadata": {}, "outputs": [], "source": [ "splitter = SentenceSplitter(chunk_size=128, chunk_overlap=0)" ] }, { "cell_type": "code", "execution_count": 17, "id": "3bb3807f-37ac-4e52-b37a-dc7ebc4a3446", "metadata": { "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9aad33479c3348619a33327e168b5f45", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Map (num_proc=8): 0%| | 0/125246 [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Dataset({\n", " features: ['id', 'congress', 'bill_type', 'bill_number', 'bill_version', 'sections', 'sections_length', 'text', 'text_length', 'summary', 'summary_length', 'title'],\n", " num_rows: 125246\n", "})" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chunked_ds = ds.map(\n", " split_texts,\n", " batched=True,\n", " num_proc=NUM_PROC,\n", " remove_columns=list(ds.column_names),\n", " fn_kwargs={\"text_column_name\": \"text\", \"id_column_name\": \"id\", \"splitter\": splitter},\n", ")\n", "ds" ] }, { "cell_type": "code", "execution_count": 18, "id": "e1319f17", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Dataset({\n", " features: ['id', 'section'],\n", " num_rows: 3446013\n", "})" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chunked_ds" ] }, { "cell_type": "code", "execution_count": 19, "id": "1c8c01ec", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['(4) Coal \\nThe term coal means bituminous\\t\\t\\t\\tcoal, subbituminous coal, and lignite. (d) Aggregate\\t\\t\\t\\tcredits \\n(1) In\\t\\t\\t\\tgeneral \\nNo credit shall be allowed under this section with\\t\\t\\t\\trespect to any qualifying clean coal project unless such project is certified\\t\\t\\t\\tby the Secretary under subsection (e).',\n", " '1. Short\\t\\t\\t title; table of contents \\n(a) Short\\t\\t\\t title \\nThis Act may be cited\\t\\t\\t as the Skilled Worker Immigration and\\t\\t\\t Fairness Act. (b) Table of\\t\\t\\t contents \\nThe table of contents for this Act is as follows: Sec. 1. Short title; table of\\t\\t\\t\\tcontents. Sec. 2. H–1B visas. Sec. 3. Employment-based immigration. Sec. 4. H–1B visa fraud and abuse protections. 2.',\n", " 'Remote control locomotive use \\n(a) Prohibition \\nNo railroad carrier shall operate or cause to be operated on the general system of railroad transportation a remote control locomotive to carry hazardous materials. (b) Penalty \\n(1) A railroad carrier that knowingly violates this section or a rule issued under this section is liable to the United States Government for a civil penalty of at least $5,000 but not more than $50,000 for each violation.']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_idx = random.sample(range(len(chunked_ds)),k=3)\n", "chunked_ds.select(sample_idx)[:]['section']" ] }, { "cell_type": "markdown", "id": "f9f757c5", "metadata": {}, "source": [ "## Pushing the data to the hub\n", "\n", "We can save the data locally to use in the next notebook but it's often easier to work with the data if we push it to the hub. This way we can easily access the data in the next notebook." ] }, { "cell_type": "code", "execution_count": 21, "id": "93cb06cc", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "cbf35fcd72694d0e9e842daaccfd7a46", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Uploading the dataset shards: 0%| | 0/4 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a06440fe1467488dadfb8a8e3cbda147", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Creating parquet from Arrow format: 0%| | 0/862 [00:00, ?ba/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "33aaab93d1d74744a7a5bcdc14db7a18", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Creating parquet from Arrow format: 0%| | 0/862 [00:00, ?ba/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "d2b8ac857f83403aa7f5be36e9a5b95d", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Creating parquet from Arrow format: 0%| | 0/862 [00:00, ?ba/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "dfda42582a124c928a06726e6aebf5fd", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Creating parquet from Arrow format: 0%| | 0/862 [00:00, ?ba/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "CommitInfo(commit_url='https://huggingface.co/datasets/davanstrien/bill_summary_us_chunks/commit/e9c23f8e002cda39422c1a39bc95c8e5cd37213b', commit_message='Upload dataset', commit_description='', oid='e9c23f8e002cda39422c1a39bc95c8e5cd37213b', pr_url=None, pr_revision=None, pr_num=None)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chunked_ds.push_to_hub(\"davanstrien/bill_summary_us_chunks\")" ] }, { "cell_type": "markdown", "id": "c4f070a4", "metadata": {}, "source": [ "## Next steps\n", "\n", "In the next notebook, we'll look at how we can use an LLM to generate synthetic data for fine-tuning our custom Sentence Transformers model. If you are running this notebook in the Synthetic Dataset Workshop Space you can find the next notebook in the workspace. If you are running this notebook locally you can find the next notebook in the Hugging Face repository. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" } }, "nbformat": 4, "nbformat_minor": 5 }