--- library_name: transformers tags: - Compiler - LLVM - Intermediate Representation - IR - Path - Hot Path datasets: - zhaojer/compiler_hot_paths language: - en base_model: - google-bert/bert-base-uncased --- # Model Card for BERT Hot Path Predictor ## Model Details ### Model Description This BERT model performs hot path prediction: Given a path (i.e. a sequence of LLVM IR instructions), predict whether it is "hot" (1) or "cold" (0). It was fine-tuned on the [hot paths dataset](https://huggingface.co/datasets/zhaojer/compiler_hot_paths) for 3 epochs with standard learning hyperparameters. - **Model type:** Binary Sequence Classification - **Language(s) (NLP):** English, Compiler/LLVM - **Finetuned from model:** google-bert/bert-base-uncased - **Dataset used:** zhaojer/compiler_hot_paths ## Uses The model can be used to predict whether a path is hot or cold, which is important information for compiler optimizations. Here is an instance of the prediction pipeline: 1. Given a program (written in C, C++, Fortran, or other languages supported by LLVM), compile it into LLVM IR (e.g., `clang -S -emit-llvm program.c -o program.ll`) 2. Select a sequence of instructions (in the unit of basic blocks) from the IR file; use this as the input to the model. 3. Load the present model and feed it the selected input, the model will then output either 0 (cold path) or 1 (hot path). The model can be further fine-tuned using additional data. Please see zhaojer/compiler_hot_paths dataset card for more information on the expected data used for fine-tuning. ## How to Get Started with the Model Use the code below to get started with the model. ``` from transformers import BertForSequenceClassification, BertTokenizer, pipeline # Load saved model saved_model = BertForSequenceClassification.from_pretrained("zhaojer/bert-hot-path-predictor") saved_tokenizer = BertTokenizer.from_pretrained("zhaojer/bert-hot-path-predictor") # Pipeline for predictions classifier = pipeline("text-classification", model=saved_model, tokenizer=saved_tokenizer) # Example prediction new_path = "%26 = load i32, ptr %21, align 4\n%27 = load i32, ptr %11, align\n%28 = icmp slt i32 %26, %27\nbr i1 %28, label %29, label %59\n\nstore i32 0, ptr %22, align 4\nbr label %30" prediction = classifier(new_path) print(prediction) ``` ## Training Details ### Training Data The model was fine-tuned on the hot paths dataset: [zhaojer/compiler_hot_paths](https://huggingface.co/datasets/zhaojer/compiler_hot_paths) The dataset is already split into train, validation, test sets with necessary columns/data needed for training/fine-tuning. No further preprocessing was performed for the data. The data (in the `path` column) were tokenized using the standard `BertTokenizer` for the `bert-base-uncased` model. ### Training Procedure We defined accuracy and AUROC as evaluation metrics for the model. The model was fine-tuned for 3 epochs with standard hyperparameters, which took about 10 minutes to complete using NVIDIA T4 GPU. #### Detailed Training Hyperparameters - `evaluation_strategy="epoch"` - `logging_strategy="epoch"` - `save_strategy="epoch"` - `num_train_epochs=3` - `per_device_train_batch_size=16` - `per_device_eval_batch_size=16` - `learning_rate=5e-5` - `load_best_model_at_end=True` - `metric_for_best_model="accuracy"` Note: Anything not explicitly stated used default value. ## Evaluation ### Testing Data The testing data consist of 68 hot paths and 92 cold paths generated from 4 distinct C programs. They are also from [zhaojer/compiler_hot_paths](https://huggingface.co/datasets/zhaojer/compiler_hot_paths); please see its dataset card for how the testing data were created. The model had never seen these testing data previously. ### Metrics We evaluated the model on the testing data using the following metrics: - Loss (available by default) - Accuracy - AUROC - Precision, Recall, F1 score - Confusion matrix ### Results | Loss | Accuracy | AUROC | Precision | Recall | F1 | | ---- | -------- | ----- | --------- | ------ | ---- | | 0.0620 | 0.9875 | 0.9952| 1.0000 | 0.9706 | 0.99 | | | Actually Hot | Actually Cold | | ------------- | ----------- | ------------ | | Predicted Hot | 66 | 0 | | Predicted Cold| 2 | 92 |