|
--- |
|
license: mit |
|
--- |
|
# Dockerfile Commit Classification Model |
|
|
|
This is a Logistic Regression model enhanced with a rule-based system for multi-label classification of Dockerfile-related commit messages. It combines machine learning with domain-specific rules to achieve accurate categorization. |
|
|
|
## Files |
|
- `logistic_model.joblib`: Trained Logistic Regression model. |
|
- `tfidf_vectorizer.joblib`: TF-IDF vectorizer for text preprocessing. |
|
- `label_binarizer.joblib`: MultiLabelBinarizer for encoding/decoding labels. |
|
|
|
## Features |
|
- **Hybrid Approach**: Combines machine learning with rule-based adjustments for better classification. |
|
- **Dockerfile-Specific Labels**: Categorizes commit messages into predefined classes: |
|
- `bug fix` |
|
- `code refactoring` |
|
- `feature addition` |
|
- `maintenance/other` |
|
- `Not enough information` |
|
- **Multi-Label Support**: Each commit message can belong to multiple categories. |
|
|
|
## How to Use |
|
To use this model, load the files and preprocess your data as follows: |
|
|
|
```python |
|
from joblib import load |
|
|
|
# Load the model and preprocessing artifacts |
|
model = load("logistic_model.joblib") |
|
tfidf_vectorizer = load("tfidf_vectorizer.joblib") |
|
mlb = load("label_binarizer.joblib") |
|
|
|
# Example usage |
|
new_messages = [ |
|
"Fixed an issue with the base image in Dockerfile", |
|
"Added multistage builds to reduce image size", |
|
"Updated Python version in Dockerfile to 3.10" |
|
] |
|
X_new_tfidf = tfidf_vectorizer.transform(new_messages) |
|
|
|
# Predict the labels |
|
predictions = model.predict(X_new_tfidf) |
|
predicted_labels = mlb.inverse_transform(predictions) |
|
|
|
# Print results |
|
for msg, labels in zip(new_messages, predicted_labels): |
|
print(f"Message: {msg}") |
|
print(f"Predicted Labels: {', '.join(labels) if labels else 'No labels'}\n") |
|
|