Edit model card

Model Card for Model ID

This model is designed to increase the readability of decompiled C code.

Model Details

Model Description

This model is designed to increase the readability of decompiled C code. It accepts C code decompiled using Hex-Rays or RetDec programs (tested on these variants) as input. Model output: C code with structure, functionality and names of variables and functions close to the original source code.

  • Developed by: [Kislov Konstantin Aleksandrovich, a student of the NRNU MEPhI]
  • Model type: [LLM based on a transformer]
  • Language(s) (NLP): [English, the C code language]
  • License: [The model is based on Llama 2, therefore, the features of commercial use are described in the license for the Llama 2. A custom commercial license is available at: https://ai.meta.com/resources/models-and-libraries/llama-downloads/]
  • Finetuned from model [optional]: [CodeLlama-7b-hf]

Model Sources [optional]

Uses

The intended target audience is reverse engineers and information security specialists. The model should help them and simplify the process of analyzing decompiled C code.

Bias, Risks, and Limitations

The model may not work correctly without using the minimum prompt and if the input exceeds the limit of 4096 tokens.

Recommendations

For more correct operation of the model, it is advisable to use the prompt, which was used for fine tuning (presented in the laptop for demonstration). If the input size is too large, it is recommended to divide the code into several small segments, for example, so that each contains several functions.

How to Get Started with the Model

You can start using the model using this notebook: https://www.kaggle.com/code/kislovka/example-of-using-decllama/edit. The checkpoint is uploaded on the model's Hugging Face page: https://huggingface.co/KonstantinKislov/CodeLlama_adapter_for_solving_the_problem_of_increasing_the_readability_of_decompiled_C_code

Training Details

Training Data

Dataset for training: https://www.kaggle.com/datasets/kislovka/95131hr. Dataset for evaluation: https://www.kaggle.com/datasets/kislovka/decllama-evaluation-dataset

Training Procedure

Training Hyperparameters

  • Training regime: [fp16 mixed precision]

Speeds, Sizes, Times [optional]

During the training, 1 training example was processed in an average of 5 seconds. At the state of the last checkpoint, the model processed about 70 percent of the dataset. It took about 90 hours.

Evaluation

Testing Data, Factors & Metrics

Testing Data

Dataset for evaluation: https://www.kaggle.com/datasets/kislovka/decllama-evaluation-dataset

Metrics

Sentence BLEU, Corpus BLEU, AED, subjective assessment of the quality of the generated code from 0 to 2.

Results

When testing with examples of code decompiled using Hex-Rays (8.3.0.230608): Sentence BLEU = 44.74%, Corpus BLEU = 42.57%, AED = 0.67, subjective assessment: 0 - 5%, 1 - 8%, 2 - 87%, mean subjective score = 1.82. When testing with examples of code decompiled using RetDec (v5.0): Sentence BLEU = 37.02%, Corpus BLEU = 34.68%, AED = 0.71, subjective assessment: 0 - 20%, 1 - 33%, 2 - 47%, mean subjective score = 1.27.

Summary

Environmental Impact

  • Hardware Type: [Tesla K80]
  • Hours used: [87.5]
  • Cloud Provider: [stands in Kaggle by default]
  • Compute Region: [nearest to Moscow, Russia]
  • Carbon Emitted: [7.09]

Technical Specifications [optional]

Model Architecture and Objective

Llama 2 architecture, Text-to-text generation

Compute Infrastructure

GPU T4 x2 was used in Kaggle.

Hardware

Tesla K80

Software

Kaggle, Hex-Rays (8.3.0.230608), RetDec (v5.0), GCC 11.4.0.

More Information [optional]

Model is individual project within the framework of the Artificial Intelligence course 2023/2024 of the Samsung IT Academy. Project topic: Exploring the possibilities of LLM fine tuning to solve the problem of improving the readability of decompiled C code.

Model Card Authors [optional]

Kislov Konstantin Aleksandrovich, student of the NRNU MEPhI.

Model Card Contact

Mail: kostik_kislov@list.ru. GitHub: https://github.com/KislovKonstantin

Framework versions

  • PEFT 0.11.1
Downloads last month
1
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for KonstantinKislov/CodeLlama_adapter_for_solving_the_problem_of_increasing_the_readability_of_decompiled_C_code

Adapter
this model