Model Card for Model ID
This model is designed to increase the readability of decompiled C code.
Model Details
Model Description
This model is designed to increase the readability of decompiled C code. It accepts C code decompiled using Hex-Rays or RetDec programs (tested on these variants) as input. Model output: C code with structure, functionality and names of variables and functions close to the original source code.
- Developed by: [Kislov Konstantin Aleksandrovich, a student of the NRNU MEPhI]
- Model type: [LLM based on a transformer]
- Language(s) (NLP): [English, the C code language]
- License: [The model is based on Llama 2, therefore, the features of commercial use are described in the license for the Llama 2. A custom commercial license is available at: https://ai.meta.com/resources/models-and-libraries/llama-downloads/]
- Finetuned from model [optional]: [CodeLlama-7b-hf]
Model Sources [optional]
- Repository: [https://github.com/KislovKonstantin/Samsung-project-DecLlama]
- Demo [optional]: [https://www.kaggle.com/code/kislovka/example-of-using-decllama/edit]
Uses
The intended target audience is reverse engineers and information security specialists. The model should help them and simplify the process of analyzing decompiled C code.
Bias, Risks, and Limitations
The model may not work correctly without using the minimum prompt and if the input exceeds the limit of 4096 tokens.
Recommendations
For more correct operation of the model, it is advisable to use the prompt, which was used for fine tuning (presented in the laptop for demonstration). If the input size is too large, it is recommended to divide the code into several small segments, for example, so that each contains several functions.
How to Get Started with the Model
You can start using the model using this notebook: https://www.kaggle.com/code/kislovka/example-of-using-decllama/edit. The checkpoint is uploaded on the model's Hugging Face page: https://huggingface.co/KonstantinKislov/CodeLlama_adapter_for_solving_the_problem_of_increasing_the_readability_of_decompiled_C_code
Training Details
Training Data
Dataset for training: https://www.kaggle.com/datasets/kislovka/95131hr. Dataset for evaluation: https://www.kaggle.com/datasets/kislovka/decllama-evaluation-dataset
Training Procedure
Training Hyperparameters
- Training regime: [fp16 mixed precision]
Speeds, Sizes, Times [optional]
During the training, 1 training example was processed in an average of 5 seconds. At the state of the last checkpoint, the model processed about 70 percent of the dataset. It took about 90 hours.
Evaluation
Testing Data, Factors & Metrics
Testing Data
Dataset for evaluation: https://www.kaggle.com/datasets/kislovka/decllama-evaluation-dataset
Metrics
Sentence BLEU, Corpus BLEU, AED, subjective assessment of the quality of the generated code from 0 to 2.
Results
When testing with examples of code decompiled using Hex-Rays (8.3.0.230608): Sentence BLEU = 44.74%, Corpus BLEU = 42.57%, AED = 0.67, subjective assessment: 0 - 5%, 1 - 8%, 2 - 87%, mean subjective score = 1.82. When testing with examples of code decompiled using RetDec (v5.0): Sentence BLEU = 37.02%, Corpus BLEU = 34.68%, AED = 0.71, subjective assessment: 0 - 20%, 1 - 33%, 2 - 47%, mean subjective score = 1.27.
Summary
Environmental Impact
- Hardware Type: [Tesla K80]
- Hours used: [87.5]
- Cloud Provider: [stands in Kaggle by default]
- Compute Region: [nearest to Moscow, Russia]
- Carbon Emitted: [7.09]
Technical Specifications [optional]
Model Architecture and Objective
Llama 2 architecture, Text-to-text generation
Compute Infrastructure
GPU T4 x2 was used in Kaggle.
Hardware
Tesla K80
Software
Kaggle, Hex-Rays (8.3.0.230608), RetDec (v5.0), GCC 11.4.0.
More Information [optional]
Model is individual project within the framework of the Artificial Intelligence course 2023/2024 of the Samsung IT Academy. Project topic: Exploring the possibilities of LLM fine tuning to solve the problem of improving the readability of decompiled C code.
Model Card Authors [optional]
Kislov Konstantin Aleksandrovich, student of the NRNU MEPhI.
Model Card Contact
Mail: kostik_kislov@list.ru. GitHub: https://github.com/KislovKonstantin
Framework versions
- PEFT 0.11.1
- Downloads last month
- 1
Model tree for KonstantinKislov/CodeLlama_adapter_for_solving_the_problem_of_increasing_the_readability_of_decompiled_C_code
Base model
codellama/CodeLlama-7b-hf