DaCrow13
Deploy to HF Spaces (Clean)
225af6a

Machine Learning Canvas

Designed for Designed by Date Iteration
NLBSE 2026 Team Hopcroft 13/10/2025 1

PREDICTION TASK

The prediction task is a multi-label classification aimed at identifying the technical skills required to resolve a specific software issue. The input for the model is a dataset extracted from a GitHub pull request, which includes textual features (like the issue description), code-context information, and other metadata. The output is a set of one or more skill labels, chosen from a predefined set of 217 skills, representing the technical domains and sub-domains (e.g., "database," "security," "UI") needed for the resolution.

DECISIONS

The predictions are used to make crucial operational decisions in software project management. The value for the end-user, such as a project manager or team lead, lies in the ability to automatically assign new issues to the most suitable developers—those who possess the skills identified by the model. This optimizes resource allocation, accelerates resolution times, and improves the overall efficiency of the development team.

VALUE PROPOSITION

The Machine Learning system is designed for project managers and developers, aiming to optimize task assignment. By automatically predicting the technical skills (domains and sub-domains) required to resolve GitHub issues, the system ensures that each task is assigned to the most qualified developer.
The primary value lies in a significant increase in the efficiency of the development process, leading to reduced resolution times and improved software quality.

DATA COLLECTION

The core data was collected by the competition organizers through a mining process on historical GitHub pull requests. This process involved sourcing the issue text and associated source code from tasks that were already completed and merged. Each issue in the dataset then underwent a rigorous, automated labeling protocol, where skill labels (domains and sub-domains) were annotated based on the specific API calls detected within the source code. Due to the nature of software development tasks, the resulting dataset faces a significant class imbalance issue, with certain skill labels appearing far more frequently than others.

DATA SOURCES

The ML system will leverage the official NLBSE’26 Skill Classification dataset, a comprehensive corpus released by the competition organizers. This dataset is sourced from 11 popular Java repositories and comprises 7,245 merged pull requests annotated with 217 distinct skill labels. / All foundational data is provided in a SQLite database (skillscope_data.db), with the nlbse_tool_competition_data_by_issue table serving as the primary source for model training. The competition framework also permits the use of external GitHub APIs for supplementary data.

IMPACT SIMULATION

The model's impact is validated by outperforming the specific "SkillScope Random Forest + TF-IDF" baseline on precision, recall, or micro-F1 scores. This evaluation is performed using the provided SQLite database of labeled pull requests as the ground truth to ensure measurable and superior performance.

MAKING PREDICTIONS

As soon as a new issue is created, the system analyzes it in real-time to understand which technical skills are needed. Instead of waiting for a manual assignment, the system sends the task directly to the most suitable developer. This automated process is so fast that it ensures the right expert can start working on the problem without any delay.

BUILDING MODELS

The ML system will start with the competition’s baseline multi-label classifier, which predicts the domains and sub-domains representing the skills needed for each issue. Model development will focus on iterative improvements to enhance the specified performance metrics.
A new model will be trained until it achieves a statistically significant improvement in precision, recall, or micro-F1 score over the initial baseline, without degradation in the other metrics. Training will occur offline, with computational needs scaling by model complexity and data volume.

FEATURES

Only the most important, non-null, and directly functional features will be selected. Textual data, such as the issue title and description, will be represented using established NLP techniques. We will also utilize numerical features, including the pull request number and the calculated issue duration. Skills will be encoded as binary multi-label vectors, and all features will be normalized to optimize model performance throughout iterative development cycles.

MONITORING

System quality will be assessed by comparing the model's skill predictions with the actual skills used by developers to resolve issues. Performance will be continuously monitored using key metrics (precision, recall, micro-F1 score). To detect data drift, the model will be periodically evaluated on new, recent data; a significant drop in these metrics will indicate the need for retraining. The system's value is measured according to the competition's criteria: the primary value is the increase in the micro-F1 score (∆micro-F1) over the baseline, without worsening precision and recall. Computational efficiency (runtime) serves as a secondary value metric.