GitVac

Don't forget to vacuum your git repo.

What's GitVac?

GitVac is like a vacuum cleaner for code fixes. It's a series of 3B, 8B, 14B, and 32B models. The goal? Have AI extract correct roleplay scenarios that can be further used to fine-tune a powerful on-prem coding agent.

How were the models made?

I distilled samples from r1 through multiple rounds of trial and error. About 2.4k questions fired off, with 1.1k making the verification cut. My rough estimate puts r1 around 15% after multiple tries.

Training Data

The data to train the models came from r1 outputs via distalation. However to gauge accuracuy of the model o3/r1 were used to do eval with the same prompts.

How is verification done?

A lot of models are already trained on function calling syntax. In my earlier attempts at prompting, I found that presenting patches and having models assemble functions together was not the most effective approach. After testing various methods, I settled on using a pre-created function calling structure that the model could simply copy and adapt. Additionally this made it easy to verify if the functions were within the formatted answer.

For every patch, the AI needs to...

Drop a read call for each file it needs to check out:

<tool_call=read_file>{"name": "read_file", "parameters": {"path": "pandas/tests/io/formats/test_to_markdown.py"}}</tool_call>

Then work some string-replacing magic (sometimes one tweak, sometimes a whole bunch):

<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "modin/pandas/test/test_series.py", "old_string": "    df_equals(modin_result, pandas_result)\n\n\n# This tests the bug from https://github.com/modin-project/modin/issues/3736\ndef test_loc_setting_categorical_series():\n    modin_series = pd.Series([\"a\", \"b\", \"c\"], dtype=\"category\")\n", "new_string": "    df_equals(modin_result, pandas_result)\n\n\ndef test_loc_with_boolean_series():\n    modin_series, pandas_series = create_test_series([1, 2, 3])\n    modin_mask, pandas_mask = create_test_series([True, False, False])\n    modin_result = modin_series.loc[modin_mask]\n    pandas_result = pandas_series.loc[pandas_mask]\n    df_equals(modin_result, pandas_result)\n\n\n# This tests the bug from https://github.com/modin-project/modin/issues/3736\ndef test_loc_setting_categorical_series():\n    modin_series = pd.Series([\"a\", \"b\", \"c\"], dtype=\"category\")\n"}}</tool_call>

The TL;DR is in order to finish the problem you must open the files in the patch and you must apply all the diffs. In order to get very good accuracy these tool calls must be returned one-to-one string accuracy. That's because I search for the exact string. If each read call is done once and tool calls are all called then that's a good dataset!

Additionally please be aware this dataset allows for terminal access but is only used for deletes from github patches and nothing else:

<tool_call=terminal_access>{"name": "terminal_access", "parameters": {"command": "...", "working_dir": "..."}}</tool_call>

As a final check we make sure the model calls:

<tool_call=attempt_completion>{"name": "attempt_completion", "parameters": {}}</tool_call>

Final Output

The final out is a list of XML tags that represent a reason and a action.

<reason>
    This is where all the analysis and reasoning lives
</reason>
<action>
    This is where the tool call magic happens
</action>

The model outputs a reason for each action call in the patch

GitVac-7B-Turbo

<reason>
I will begin by reading the test file to understand the expected behavior and then proceed to inspect the main implementation file. After confirming the current implementation, I will apply the required modifications to update the import statements and add the missing functions. Finally, I will update the test file to include new tests for the non-zero and where functions.
</reason>
<action>
<tool_call=read_file>{"name": "read_file", "parameters": {"path": "dask/array/tests/test_masked.py"}}</tool_call>
</action>

<reason>
Now I will inspect the main implementation file to determine the current state of the import section.
</reason>
<action>
<tool_call=read_file>{"name": "read_file", "parameters": {"path": "dask/array/ma.py"}}</tool_call>
</action>

<reason>
Before making modifications, I will list the directory structure to ensure that all necessary files are present.
</reason>
<action>
<tool_call=list_directory>{"name": "list_directory", "parameters": {"path": "dask/array", "recursive": "true"}}</tool_call>
</action>

<reason>
I will now perform a ripgrep search for "np.ma.core" to locate any references to the numpy masked array functions in the repository.
</reason>
<action>
<tool_call=ripgrep_search>{"name": "ripgrep_search", "parameters": {"pattern": "np.ma.core", "path": "dask/array", "file_pattern": "*.py"}}</tool_call>
</action>

<reason>
Based on the input, I will now update the import section in the main implementation file to include the missing "elemwise" and "nonzero" imports.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dask/array/ma.py", "old_string": "import numpy as np\n\nfrom dask.array import chunk\nfrom dask.array.core import asanyarray, blockwise, map_blocks\nfrom dask.array.reductions import reduction\nfrom dask.array.routines import _average\nfrom dask.base import normalize_token\nfrom dask.utils import derived_from\n\n", "new_string": "import numpy as np\n\nfrom dask.array import chunk\nfrom dask.array.core import asanyarray, blockwise, elemwise, map_blocks\nfrom dask.array.reductions import reduction\nfrom dask.array.routines import _average\nfrom dask.array.routines import nonzero as _nonzero\nfrom dask.base import normalize_token\nfrom dask.utils import derived_from\n\n"}}</tool_call>
</action>

<reason>
Next, I will update the import section in the same file to include "nonzero" from dask.array.routines.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dask/array/ma.py", "old_string": "import numpy as np\n\nfrom dask.array import chunk\nfrom dask.array.core import asanyarray, blockwise, map_blocks\nfrom dask.array.reductions import reduction\nfrom dask.array.routines import _average\nfrom dask.base import normalize_token\nfrom dask.utils import derived_from\n\n", "new_string": "import numpy as np\n\nfrom dask.array import chunk\nfrom dask.array.core import asanyarray, blockwise, elemwise, map_blocks\nfrom dask.array.reductions import reduction\nfrom dask.array.routines import _average\nfrom dask.array.routines import nonzero as _nonzero\nfrom dask.base import normalize_token\nfrom dask.utils import derived_from\n\n"}}</tool_call>
</action>

<reason>
I will now update the "empty_like" function to include the new non-zero function and then add the "nonzero" and "where" functions with the appropriate docstring and behavior.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dask/array/ma.py", "old_string": "def empty_like(a, **kwargs):\n    a = asanyarray(a)\n    return a.map_blocks(np.ma.core.empty_like, **kwargs)\n", "new_string": "def empty_like(a, **kwargs):\n    a = asanyarray(a)\n    return a.map_blocks(np.ma.core.empty_like, **kwargs)\n\n\n@derived_from(np.ma.core)\ndef nonzero(a):\n    return _nonzero(getdata(a) * ~getmaskarray(a))\n\n\n@derived_from(np.ma.core)\ndef where(condition, x=None, y=None):\n    if (x is None) != (y is None):\n        raise ValueError(\"either both or neither of x and y should be given\")\n    if (x is None) and (y is None):\n        return nonzero(condition)\n    else:\n        return elemwise(np.ma.where, condition, x, y)\n"}}</tool_call>
</action>

<reason>
I will now update the import section in the test file to ensure that it imports the new non-zero function.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dask/array/ma.py", "old_string": "import numpy as np\n\nfrom dask.array import chunk\nfrom dask.array.core import asanyarray, blockwise, map_blocks\nfrom dask.array.reductions import reduction\nfrom dask.array.routines import _average\nfrom dask.base import normalize_token\nfrom dask.utils import derived_from\n\n", "new_string": "import numpy as np\n\nfrom dask.array import chunk\nfrom dask.array.core import asanyarray, blockwise, elemwise, map_blocks\nfrom dask.array.reductions import reduction\nfrom dask.array.routines import _average\nfrom dask.array.routines import nonzero as _nonzero\nfrom dask.base import normalize_token\nfrom dask.utils import derived_from\n\n"}}</tool_call>
</action>

<reason>
Next, I will update the import section in the test file to include the missing "elemwise" import.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dask/array/ma.py", "old_string": "import numpy as np\n\nfrom dask.array import chunk\nfrom dask.array.core import asanyarray, blockwise, map_blocks\nfrom dask.array.reductions import reduction\nfrom dask.array.routines import _average\nfrom dask.base import normalize_token\nfrom dask.utils import derived_from\n\n", "new_string": "import numpy as np\n\nfrom dask.array import chunk\nfrom dask.array.core import asanyarray, blockwise, elemwise, map_blocks\nfrom dask.array.reductions import reduction\nfrom dask.array.routines import _average\nfrom dask.array.routines import nonzero as _nonzero\nfrom dask.base import normalize_token\nfrom dask.utils import derived_from\n\n"}}</tool_call>
</action>

<reason>
I will now update the test file to add tests for the new non-zero and where functions.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dask/array/tests/test_masked.py", "old_string": "        assert_eq(da.ma.getmaskarray(res), np.ma.getmaskarray(sol))\n    else:\n        assert_eq(res, sol)\n", "new_string": "        assert_eq(da.ma.getmaskarray(res), np.ma.getmaskarray(sol))\n    else:\n        assert_eq(res, sol)\n\n\ndef test_nonzero():\n    data = np.arange(9).reshape((3, 3))\n    mask = np.array([[True, False, False], [True, True, False], [True, False, True]])\n    a = np.ma.array(data, mask=mask)\n    d_a = da.ma.masked_array(data=data, mask=mask, chunks=2)\n\n    for c1, c2 in [\n        (a > 4, d_a > 4),\n        (a, d_a),\n        (a <= -2, d_a <= -2),\n        (a == 0, d_a == 0),\n    ]:\n        sol = np.ma.nonzero(c1)\n        res = da.ma.nonzero(c2)\n\n        assert isinstance(res, type(sol))\n        assert len(res) == len(sol)\n\n        for i in range(len(sol)):\n            assert_eq(res[i], sol[i])\n\n\ndef test_where():\n    # Copied and adapted from the da.where test.\n    x = np.random.randint(10, size=(15, 14))\n    mask = np.random.choice(a=[False, True], size=(15, 14), p=[0.5, 0.5])\n    x[5, 5] = x[4, 4] = 0  # Ensure some false elements\n    d = da.ma.masked_array(x, mask=mask, chunks=(4, 5))\n    x = np.ma.array(x, mask=mask)\n    y = np.random.randint(10, size=15).astype(np.uint8)\n    e = da.from_array(y, chunks=(4,))\n\n    # Nonzero test\n    sol = np.ma.where(x)\n    res = da.ma.where(d)\n    for i in range(len(sol)):\n        assert_eq(res[i], sol[i])\n\n    for c1, c2 in [\n        (d > 5, x > 5),\n        (d, x),\n        (1, 1),\n        (5, 5),\n        (True, True),\n        (np.True_, np.True_),\n        (0, 0),\n        (False, False),\n        (np.False_, np.False_),\n    ]:\n        for b1, b2 in [(0, 0), (-e[:, None], -y[:, None]), (e[:14], y[:14])]:\n            w1 = da.ma.where(c1, d, b1)\n            w2 = np.ma.where(c2, x, b2)\n            assert_eq(w1, w2)\n"}}</tool_call>
</action>

<reason>
I have now applied all the necessary modifications to both the implementation and test files. This update ensures that the non-zero and where functions preserve the mask of masked arrays, as required.
</reason>
<action>
<tool_call=attempt_completion>{"name": "attempt_completion", "parameters": {}}</tool_call>
</action>

GitVac-R-14B

<think>

Okay, let's tackle this problem. The user is having an issue where using the ClusterArn as a filter in the describe_db_clusters method with Boto3 is returning a "cluster not found" error even though the cluster exists. The problem analysis shows that when passing the DBClusterIdentifier with an ARN, it's failing. The expected behavior is that using the ARN should work as a filter, just like Boto3 allows passing either ClusterArn or DBClusterIdentifier.

First, I need to understand the problem thoroughly. The user's test case creates a cluster with a specific DBClusterIdentifier, then tries to describe it using both the identifier and the ARN. The test fails when using the ARN. The hints mention that this is a bug in Moto, the mocking library for AWS services. The dataset examples show code changes in moto/rds/models.py and tests/test_rds/test_rds_clusters.py. The changes involve modifying the describe_db_clusters method to split the cluster_identifier by ':' and take the last part, effectively extracting the cluster ID from the ARN. Also, the test cases are updated to capture the ARN when creating a cluster and then using it to describe the clusters.

Now, the dataset examples show a series of tool calls: reading files, searching for patterns, and replacing strings. The actions involve reading the models.py and test files, searching for specific patterns like "describe_db_clusters" and "DBClusterArn", and replacing code blocks to handle ARNs correctly. The output structure is a sequence of reasoning and actions, ending with an attempt_completion call.

For the dataset structure, each example has a reason and an action. The actions involve using read_file, ripgrep_search, list_directory, and str_replace_editor. The input generation needs to cover various scenarios where DBClusterIdentifier is passed as an ARN or a plain cluster ID, ensuring that the filtering works correctly. The output derivation must check if the describe_db_clusters call returns the correct clusters when using the ARN or the ID.

In the dataset examples provided, the actions are split into multiple steps: reading the files, searching for patterns, and modifying the code. Each step is a separate action with a reason. The output derivation involves checking the length of DBClusters and ensuring that the correct clusters are returned when filtering by ARN or ID.

The key components here are the file modifications in moto's models and the test cases. The challenge is ensuring that the filtering logic correctly handles ARNs by extracting the cluster ID. Potential pitfalls include incorrect string splitting (e.g., if the ARN has more than one colon), or not properly updating the test cases to capture the ARN and use it correctly.

The best scenario to generate would involve creating a cluster with a specific ID, obtaining its ARN, and then querying describe_db_clusters with both the ID and the ARN to verify that both return the cluster. Edge cases might include clusters with special characters in their IDs, very long ARNs, or clusters created in different regions.

In the output derivation, the main checks are whether the cluster is found when using the ARN and when using the plain ID. The dataset examples show that after modifying the code, the describe_db_clusters method correctly splits the ARN to get the cluster ID, allowing the lookup to succeed.

So, to generate the dataset, I need to create test cases that cover different ARN formats, ensure that the filtering logic in the code correctly processes them, and validate that the test expectations are met. The input generation should include valid ARNs with varying lengths and regions, and the output derivation must confirm that the clusters are found when queried with either the ARN or the plain ID.

</think>

<reason>
I will first inspect the current implementation by reading the moto/rds/models.py file to understand how describe_db_clusters is handling the cluster_identifier.
</reason>
<action>
<tool_call=read_file>{"name": "read_file", "parameters": {"path": "moto/rds/models.py"}}</tool_call>
</action>

<reason>
Next, I will inspect the test file tests/test_rds/test_rds_clusters.py to see how the tests are structured and what expectations are being set for describe_db_clusters.
</reason>
<action>
<tool_call=read_file>{"name": "read_file", "parameters": {"path": "tests/test_rds/test_rds_clusters.py"}}</tool_call>
</action>

<reason>
To further verify the current implementation and test expectations, I will search for any references to "describe_db_clusters" in the moto/rds/models.py file.
</reason>
<action>
<tool_call=ripgrep_search>{"name": "ripgrep_search", "parameters": {"pattern": "describe_db_clusters", "path": "moto/rds/models.py", "file_pattern": "*.py"}}</tool_call>
</action>

<reason>
I will also search for "DBClusterArn" in the test file to see how the tests are handling the ARN values.
</reason>
<action>
<tool_call=ripgrep_search>{"name": "ripgrep_search", "parameters": {"pattern": "DBClusterArn", "path": "tests/test_rds/test_rds_clusters.py", "file_pattern": "*.py"}}</tool_call>
</action>

<reason>
Now that I have a good understanding of the current implementation and tests, I will list the directory structure of the moto project to ensure I am working with the correct files.
</reason>
<action>
<tool_call=list_directory>{"name": "list_directory", "parameters": {"path": ".", "recursive": "true"}}</tool_call>
</action>

<reason>
Based on the investigation, I will now update the describe_db_clusters implementation in moto/rds/models.py to properly handle ARN values by extracting the cluster identifier from the ARN.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "moto/rds/models.py", "old_string": "\n    def describe_db_clusters(self, cluster_identifier):\n        if cluster_identifier:\n            if cluster_identifier in self.clusters:\n                return [self.clusters[cluster_identifier]]\n            if cluster_identifier in self.neptune.clusters:\n", "new_string": "\n    def describe_db_clusters(self, cluster_identifier):\n        if cluster_identifier:\n            # ARN to identifier\n            # arn:aws:rds:eu-north-1:123456789012:cluster:cluster --> cluster-id\n            cluster_identifier = cluster_identifier.split(\":\")[-1]\n            if cluster_identifier in self.clusters:\n                return [self.clusters[cluster_identifier]]\n            if cluster_identifier in self.neptune.clusters:\n"}}</tool_call>
</action>

<reason>
I will now update the test file tests/test_rds/test_rds_clusters.py to capture the ARN value returned when creating a cluster and use it to verify the describe_db_clusters functionality.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "tests/test_rds/test_rds_clusters.py", "old_string": "        MasterUserPassword=\"hunter2_\",\n    )\n\n    client.create_db_cluster(\n        DBClusterIdentifier=\"cluster-id2\",\n        Engine=\"aurora\",\n        MasterUsername=\"root\",\n        MasterUserPassword=\"hunter2_\",\n    )\n\n    client.describe_db_clusters()[\"DBClusters\"].should.have.length_of(2)\n\n", "new_string": "        MasterUserPassword=\"hunter2_\",\n    )\n\n    cluster_arn = client.create_db_cluster(\n        DBClusterIdentifier=\"cluster-id2\",\n        Engine=\"aurora\",\n        MasterUsername=\"root\",\n        MasterUserPassword=\"hunter2_\",\n    )[\"DBCluster\"][\"DBClusterArn\"]\n\n    client.describe_db_clusters()[\"DBClusters\"].should.have.length_of(2)\n\n"}}</tool_call>
</action>

<reason>
I will update the test file tests/test_rds/test_rds_clusters.py to verify that describe_db_clusters works when filtering by ARN.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "tests/test_rds/test_rds_clusters.py", "old_string": "        \"DBClusters\"\n    ].should.have.length_of(1)\n\n\n@mock_rds\ndef test_delete_db_cluster():\n", "new_string": "        \"DBClusters\"\n    ].should.have.length_of(1)\n\n    client.describe_db_clusters(DBClusterIdentifier=cluster_arn)[\n        \"DBClusters\"\n    ].should.have.length_of(1)\n\n\n@mock_rds\ndef test_delete_db_cluster():\n"}}</tool_call>
</action>

<reason>
I have now applied the required modifications to both the implementation and test files. I will now complete the process.
</reason>
<action>
<tool_call=attempt_completion>{"name": "attempt_completion", "parameters": {}}</tool_call>
</action>

Cursor Format and Your Format

My first good experience with agents comes from using Cursor. Hence, I call this Cursor Format - you simply tell the agent exactly which files to read instead of making it hunt around. I love this format because it matches how I actually code - about 95% of the time, I already know which files need changing. No need for the agent to play detective... This approach is just faster and more focused. When you're building your own agents, think about how you actually work and shape your training data to match that flow. The format you choose makes a huge difference in how your agent learns to help you.

Roleplay Functions

Additionally during dataset creation stage on the last layer the entire format gets rewritten and possible functions are inserted:


<tool_call=ripgrep_search>{"name": "ripgrep_search", "parameters": {"pattern": "your_pattern", "path": "your_path", "file_pattern": "your_file_pattern"}}</tool_call>

<tool_call=list_directory>{"name": "list_directory", "parameters": {"path": "your_path", "recursive": "true_or_false"}}</tool_call>

The model already knows the solution but must follow a realistic problem-solving process rather than jumping straight to the answer. This creates more natural, educational datasets. When faced with a task like "Add a unit test for feature X," the model doesn't immediately open the correct file. Instead, it first explores the codebase as a human developer would - perhaps listing directories to locate the test folder structure before opening files in the tests/directory. Similarly, when handling issues that reference specific code, the model simulates a search process. It "discovers" the relevant code through a search query, finds a match, and only then reads the file - mimicking how a real developer would track down code mentioned in an issue. This approach creates training data that teaches future models not just what to fix, but how to systematically approach problem-solving in a codebase.

Example Roleplay:

<tool_call=read_file>{"name": "read_file", "parameters": {"path": "tests/unit/remote/ssh/test_ssh.py"}}</tool_call>

...

<reason>
Before proceeding further, I will list the directory contents of the dvc/remote/ssh folder to verify its structure.
</reason>
<action>
<tool_call=list_directory>{"name": "list_directory", "parameters": {"path": "dvc/remote/ssh", "recursive": "true"}}</tool_call>
</action>

<reason>
I also want to search through the repository for "ssh_config" to see references that might need updates.
</reason>
<action>
<tool_call=ripgrep_search>{"name": "ripgrep_search", "parameters": {"pattern": "ssh_config", "path": ".", "file_pattern": "*"}}</tool_call>
</action>

<reason>
Let's insert the import for the os module in the SSH initialization file. This change is required for accessing the SSH config file.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dvc/remote/ssh/__init__.py", "old_string": "from __future__ import unicode_literals\n\nimport getpass\nimport logging\n\n", "new_string": "from __future__ import unicode_literals\n\nimport os\nimport getpass\nimport logging\n\n"}}</tool_call>
</action>

This slight variation in data introduces a new skill for search/list.

About the Models

This work draws inspiration from numerous research papers, but is particularly influenced by the SWE-Gym project and its comprehensive dataset.

All Models

The GitVac series offers specialized models

Model Name	Parameters	Link
Llama-GitVac-Turbo-3B	3 Billion	🤗 Hugging Face
Llama-GitVac-Turbo-8B	8 Billion	🤗 Hugging Face
GitVac-R-14B	14 Billion	🤗 Hugging Face
GitVac-R-32B	32 Billion	🤗 Hugging Face

Turbo Models

Turbo models are all about being fast and budget-friendly. These produce datasets without the tags.

Model	Base	Context Size	Best For
Llama-GitVac-Turbo-3B	Llama 3.1	128k	CPU usage, edge devices
Llama-GitVac-Turbo-8B	Llama 3.1	128k	Better quality, still fast

When to Use Turbo Models

You can use these to quickly extract ordered GitHub patches with minimal reasoning. These can be useful to build other types of custom pipelines.

Reasoning Models

The reasoning models were tuned to do a little bit of thinking and planning around dataset creation.

Model	Base	Context Size	Max Output	Special Features
GitVac-R-14B	Qwen 2.5	128k	32,000 tokens	Detailed reasoning
GitVac-R-32B	Qwen 2.5	128k	32,000 tokens	Most comprehensive analysis

When to Use Reasoning Models

These models create pre-made actions that are higher quality than the turbo models. The original dataset was created without thinking tags. To distill the thinking process, I presented each of the 1.1k samples to Deepseek R1 and asked it to create an in-depth plan for how to write the solution. This helped capture the reasoning behind each fix.

Benchmarks

I started with 2,400 patches/issues.

Only 1,100 problems could be solved by r1
Each problem was attempted up to 3 times. earlier scripts were doing up to 10.
The remaining 1,300 problems were ones that these top models failed to solve even after 9,000 total attempts

To evaluate GitVac models, I randomly selected a sample size from the unfinished dataset.

Performance Results

Tested against the remainder 1300 r1 could not pass. These datasets were never seen in the models training.

Model	Success Rate	Notes
O3	30%	used for eval
R1	15%	used for eval/training
GitVac-3B-Turbo	42%	Fast but not accurate. Missing tags in outputs.
Llama-GitVac-7B-Turbo	90%	Sweet spot for speed/quality/cost. No think tags.
GitVac-R-14B	92%	Excellent reasoning capabilities for price/speed
GitVac-R-32B	98%	Accurate and tested up to 500 samples

How to use

Start by gathering your patches and extracting all the necessary components - string replacements, file reads, terminal deletes, problem statements and any hints.

Combine the tool calls into a list and shuffle them randomly. This randomization turned out to be a crucial factor in improving dataset quality.

Initially, I presented the function calls in a fixed order (writes, reads, deletes). The models would blindly follow this pattern - making changes before reading files, which makes no logical sense. Simply instructing them to do otherwise in the prompt had no effect.

A breakthrough came when I randomized the function calls. This seemed to break the models out of their rigid patterns and activate more natural problem-solving behaviors. They started properly reading files before modifying them and demonstrated more realistic roleplay capabilities.

There is no system prompt

System Prompt Script

   prompt = f"""
Create a synthetic dataset based on the provided technical problem statement, following these instructions carefully:
The dataset must be accurate and realistic based on the problem statement. You may use other functions such as list_directory, ripgrep_search to add more context dataset.
Your goal is to use the provided tools to search and list directories, and then output a tool call that reflects your analysis and reasoning.
 
Problem Analysis:
<problem_statement>
"""+instance['problem_statement']+"""
</problem_statement>

Guidance Integration (if provided):
<hints>
"""+instance['hints_text']+"""
</hints>

Function Requirements:
Given to you in a random list for you to figure out the logical order.
<functions>
"""+final_calls+"""
</functions>

Available Code Interaction Tools (Use EXACTLY these formats):
<functions_to_roleplay_with>
    To search for specific patterns in the codebase, use:
    <tool_call=ripgrep_search>{"name": "ripgrep_search", "parameters": {"pattern": "your_pattern", "path": "your_path", "file_pattern": "your_file_pattern"}}</tool_call>
    To list directories and understand the file structure, use:
    <tool_call=list_directory>{"name": "list_directory", "parameters": {"path": "your_path", "recursive": "true_or_false"}}</tool_call>
</functions_to_roleplay_with>

Output Requirements:
First provide detailed technical reasoning
Then output ONE PRECISE tool call
At the end when the problem is done the last call should list the reason why and call <tool_call=attempt_completion>{"name": "attempt_completion", "parameters": {}}</tool_call>
Follow EXACT XML structure shown below:
<reason>
    Analysis and reasoning for the action
</reason>
<action>
    Tool call to be used
</action>

Important Notes:
Replace ALL placeholders with values relevant to problem statement
Parameters MUST reflect actual technical needs from the problem
Only output ONE action per response
Maintain strict JSON syntax within tool calls
You must use all functions provided in <functions_to_roleplay_with> as is within the answer.
Please create a roleplay dataset by using the provided tools to search and list directories, and then output a tool call that reflects your analysis and reasoning.

Think before you respond.
"""
    print(f"Total estimated tokens: {estimated_tokens}")

    response = client.chat.completions.create(
        model=model_name,  # Adjust model name if needed
        temperature=0.6,
        top_p=0.95,
        messages=[
            {"role": "user", "content": prompt}
        ],
        max_completion_tokens=32000,
        stream=True
    )
    
    # Collect the streamed response
    collected_messages = []
    for chunk in response:
        if chunk.choices[0].delta.content is not None:
            chunk_message = chunk.choices[0].delta.content
            # print(chunk_message, end="", flush=True)
            collected_messages.append(chunk_message)
    
    response_content = ''.join(collected_messages)
    
    return response_content

Example Input Prompt


Create a synthetic dataset based on the provided technical problem statement, following these instructions carefully:
The dataset must be accurate and realistic based on the problem statement. You may use other functions such as list_directory, ripgrep_search to add more context dataset.
Your goal is to use the provided tools to search and list directories, and then output a tool call that reflects your analysis and reasoning.
 
Problem Analysis:
<problem_statement>
alias `list` as `ls`?
I have been reaching for `dvc ls` out of habit instead of `dvc list`. Should we introduce an alias for `dvc list`?

</problem_statement>

Guidance Integration (if provided):
<hints>

</hints>

Function Requirements:
Given to you in a random list for you to figure out the logical order.
<functions>
<tool_call=read_file>{"name": "read_file", "parameters": {"path": "dvc/command/ls/__init__.py"}}</tool_call>
<tool_call=read_file>{"name": "read_file", "parameters": {"path": "tests/unit/command/ls/test_ls.py"}}</tool_call>

<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dvc/command/ls/__init__.py", "old_string": "    )\n    list_parser = subparsers.add_parser(\n        \"list\",\n        parents=[parent_parser],\n        description=append_doc_link(LIST_HELP, \"list\"),\n        help=LIST_HELP,\n", "new_string": "    )\n    list_parser = subparsers.add_parser(\n        \"list\",\n        aliases=[\"ls\"],\n        parents=[parent_parser],\n        description=append_doc_link(LIST_HELP, \"list\"),\n        help=LIST_HELP,\n"}}</tool_call>

<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dvc/command/ls/__init__.py", "old_string": "    )\n    list_parser = subparsers.add_parser(\n        \"list\",\n        parents=[parent_parser],\n        description=append_doc_link(LIST_HELP, \"list\"),\n        help=LIST_HELP,\n", "new_string": "    )\n    list_parser = subparsers.add_parser(\n        \"list\",\n        aliases=[\"ls\"],\n        parents=[parent_parser],\n        description=append_doc_link(LIST_HELP, \"list\"),\n        help=LIST_HELP,\n"}}</tool_call>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "tests/unit/command/ls/test_ls.py", "old_string": "        \"\\x1b[01;34msrc\\x1b[0m\",\n        \"\\x1b[01;32mrun.sh\\x1b[0m\",\n    ]\n", "new_string": "        \"\\x1b[01;34msrc\\x1b[0m\",\n        \"\\x1b[01;32mrun.sh\\x1b[0m\",\n    ]\n\n\ndef test_list_alias():\n    cli_args = parse_args([\"ls\", \"local_dir\"])\n    assert cli_args.func == CmdList\n"}}</tool_call>

<tool_call=read_file>{"name": "read_file", "parameters": {"path": "dvc/command/ls/__init__.py"}}</tool_call>

</functions>

Available Code Interaction Tools (Use EXACTLY these formats):
<functions_to_roleplay_with>
    To search for specific patterns in the codebase, use:
    <tool_call=ripgrep_search>{"name": "ripgrep_search", "parameters": {"pattern": "your_pattern", "path": "your_path", "file_pattern": "your_file_pattern"}}</tool_call>
    To list directories and understand the file structure, use:
    <tool_call=list_directory>{"name": "list_directory", "parameters": {"path": "your_path", "recursive": "true_or_false"}}</tool_call>
</functions_to_roleplay_with>

Output Requirements:
First provide detailed technical reasoning
Then output ONE PRECISE tool call
At the end when the problem is done the last call should list the reason why and call <tool_call=attempt_completion>{"name": "attempt_completion", "parameters": {}}</tool_call>
Follow EXACT XML structure shown below:
<reason>
    Analysis and reasoning for the action
</reason>
<action>
    Tool call to be used
</action>

Important Notes:
Replace ALL placeholders with values relevant to problem statement
Parameters MUST reflect actual technical needs from the problem
Only output ONE action per response
Maintain strict JSON syntax within tool calls
You must use all functions provided in <functions_to_roleplay_with> as is within the answer.
Please create a roleplay dataset by using the provided tools to search and list directories, and then output a tool call that reflects your analysis and reasoning.

Think before you respond.

Cost & Details

The total cost for this project was approximately $400, with the majority spent on inference. I used an automated script to handle the full training pipeline - from finetuning through evaluation across all model sizes up to 32B parameters. The hardware setup included an A100 80GB GPU and a rented H200 140GB+ GPU from RunPod. Training times varied from 1.5 hours for smaller models up to 8 hours for the largest ones. All reasoning models went through 3 epochs of training.

Post Processing

In the post-processing stage, we feed all this data to another model that actually runs through each tool call step. This model grabs the real files, executes the commands, and collects all the outputs. Here you can add more information to functions such as line numbers in the string replace.

This does a few things:

Makes sure our tool calls actually work (we already confirmed this in the first step)
Adds real file contents and command outputs to our training data
Updates the reasoning to further improve its reasoning

With this dataset, we can fine-tune to get a base model. This model can then be further improved through RLHF (Reinforcement Learning from Human Feedback) and GRPO (Guided Reward Policy Optimization) training, where it will continuously learn from new datasets generated by the pipeline. This creates a virtuous cycle of improvement, with each iteration building on the knowledge gained from previous runs. I should probably write up a whole separate post on this extended pipeline someday. For now enjoy this repo!

vkerkez
/

GitVac-R-14B