Sentient Simulations AI Python 3.7 Decompiler - 6.7b - v0.9

1. Introduction

The Sentient Simulations AI Python Decompiler is a deepseek-ai/deepseek-coder-6.7b-base finetune for the specific task of decompiling Python 3.7 bytecode back to its original Python source code.

Full model - LINK
GGUF - LINK

2. Data Preparation

The Sentient Simulations AI Python Decompiler data used Python 3.7 source code that was then compiled to bytecode. The bytecode was used as the input and the source code was used as the output to teach the model how to generate the original source code from Python bytecode. Below are the steps to prepare the data.

Grab a ton of Python code, or use something like The Stack v2, and compile it using the version of Python you want to use
Throw out any code that doesn't compile to that version of Python
Remove all comments from the code
Format all the code using Python black for consistency
Format the bytecode in a way that reduces tokens and is easier for the AI to read - I tried with custom tokens initially but I got inconsistent results
Generate input output pairs for the training data
Axolotl Sample packing was used to prepare the data with a constant context of 16k tokens

4. Training

The model was trained for 4 days on 3x3090s using Deepspeed Zero 3 Offload at 16k context.

5. Prerequisites

Create a Python 3.7 environment to get the byte code
Create a Python 3.10 environment to run the decompiler
Download the python files and GGUF file

conda create -n pydecompiler-37 python=3.7 -y
conda create -n pydecompiler-310 python=3.10 -y
conda activate pydecompiler-310
pip install huggingface-hub
mkdir pydecompiler
cd pydecompiler
huggingface-cli download GusPuffy/sentient-simulations-pydecompiler-3.7-6.7b-v0.9 --local-dir . --local-dir-use-symlinks False --include "*.py"
huggingface-cli download GusPuffy/sentient-simulations-pydecompiler-3.7-6.7b-v0.9-GGUF --local-dir . --local-dir-use-symlinks False --include "sentient-simulations-pydecompiler-3.7-6.7b-v0.9-q8_0.gguf"

Install llama cpp, make sure to use the prefix during install for whatever backend you want to use if you have GPUs or only want to use CPU

6. Test Example

Convert a python file to source and byte code using Python 3.7

conda activate pydecompiler-37
python bytecode.py bytecode.py > bytecode-decompiled.pycb

The bytecode has been written to bytecode-decompiled.pycb, now we need to switch to the Python 3.10 environment to run the decompiler on the test example

conda activate pydecompiler-310
python decompile.py bytecode-decompiled.pycb

Compare the contents of the AI decompiled code in 'bytecode-decompiled.py' with the actual contents of 'bytecode.py'

7. Decompilation of a Directory of .pyc files

If you have a bunch of files you want to decompile, you can run the following command to decompile the entire directory.

Convert the .pyc files to bytecode strings using Python 3.7

conda activate pydecompiler-37
python convert_pyc_to_bytecode.py directory_with_files

Decompile the bytecode strings back to Python source code (This will take some time depending on how many files there are)

conda activate pydecompiler-310
python decompile.py directory_with_files

Note that files over 15k token count are skipped. Files over 10k will most likely be truncated due to the context limit.

At the end of the decompiled source code is a note if the context limit was reached or the AI decided it was complete:

# Finish Reason: stop
# Finish Reason: length

8. Next Iteration

The dataset has shown pretty amazing results for decompiling files under 16k context using only a 7b model.

I would like to train a larger context model or the 34b version of deepseek coder.

Let me know if you have compute available and you are interested in training a longer context version of this tool!

9. License

This is a fune tune of deepseek so checkout their license LICENSE-MODEL for more details.

10. Contact

If you have any questions, please raise an issue or find me on discord

GusPuffy
/

sentient-simulations-pydecompiler-3.7-6.7b-v0.9