--- license: mit language: - en pipeline_tag: text-classification tags: - pytorch - mlflow - ray - fastapi - nlp --- ## Scaling-ML Scaling-ML is a project that classifies news headlines into 10 groups. The main part of the project fine-tuning of the [BERT](https://huggingface.co/allenai/scibert_scivocab_uncased)[1] model and including tools like MLflow for tracking experiments, Ray for scaling and distibuted computing, and MLOps components for seamless management of machine learning workflows. ### Set Up 1. Clone the repository: ```bash git clone https://github.com/your-username/scaling-ml.git cd scaling-ml ``` 2. Set up your virtual environment and install dependencies: ```bash export PYTHONPATH=$PYTHONPATH:$PWD pip install -r requirements.txt ``` ### Scripts Overview ```bash scripts ├── app.py ├── config.py ├── data.py ├── evaluate.py ├── model.py ├── predict.py ├── train.py ├── tune.py └── utils.py ``` - `app.py` - Implementation of FastAPI web service for serving a model. - `config.py` - Configuration of logging settings, directory structures, and MLflow registry. - `data.py`- Functions and a class for data preprocessing tasks in a scalable machine learning project. - `evaluate.py` - Evaluating the performance of a model, calculating precision, recall and F1 score. - `model.py` - Finetuned language model by adding a fully connected layer for classification tasks. - `predict.py` - TorchPredictor class for making predictions using a PyTorch-based model. - `train.py` - Training process using Ray for distributed training. - `tune.py` - Hyperparameter tuning for Language Model using Ray Tune. - `utils.py` - Various utility functions for handling data, setting random seeds, saving and loading dictionaries, etc. #### Dataset For training, small portion of the [News Category Dataset](https://www.kaggle.com/datasets/setseries/news-category-dataset) was used, which contains numerous headlines and descriptions of various articles. ### How to Train ```bash export DATASET_LOC="path/to/dataset" export TRAIN_LOOP_CONFIG='{"dropout_p": 0.5, "lr": 1e-4, "lr_factor": 0.8, "lr_patience": 5}' python3 scripts/train.py \ --experiment_name "llm_train" \ --dataset_loc $DATASET_LOC \ --train_loop_config "$TRAIN_LOOP_CONFIG" \ --num_workers 1 \ --cpu_per_worker 1 \ --gpu_per_worker 0 \ --num_epochs 1 \ --batch_size 128 \ --results_fp results.json ``` - experiment_name: A name for the experiment or run, in this case, "llm". - dataset_loc: The location of the training dataset, replace with the actual path. - train_loop_config: The configuration for the training loop, replace with the actual configuration. - num_workers: The number of workers used for parallel processing. Adjust based on available CPU resources. - cpu_per_worker: The number of CPU cores assigned to each worker. Adjust based on available CPU resources. - gpu_per_worker: The number of GPUs assigned to each worker. Adjust based on available GPU resources. - num_epochs: The number of training epochs. - batch_size: The batch size used during training. - results_fp: The file path to save the results. ### How to Tune ```bash export DATASET_LOC="path/to/dataset" export INITIAL_PARAMS='{"dropout_p": 0.5, "lr": 1e-4, "lr_factor": 0.8, "lr_patience": 5}' python3 scripts/tune.py \ --experiment_name "llm_tune" \ --dataset_loc "$DATASET_LOC" \ --initial_params "$INITIAL_PARAMS" \ --num_workers 1 \ --cpu_per_worker 1 \ --gpu_per_worker 0 \ --num_runs 1 \ --grace_period 1 \ --num_epochs 1 \ --batch_size 128 \ --results_fp results.json ``` - num_runs: The number of tuning runs to perform. - grace_period: The grace period for early stopping during hyperparameter tuning. **Note**: modify the values of the `--num-workers`, `--cpu-per-worker`, and `--gpu-per-worker` input parameters below according to the resources available on your system. ### Experiment Tracking with MLflow ```bash mlflow server -h 0.0.0.0 -p 8080 --backend-store-uri /path/to/mlflow/folder ``` ### Evaluation ```bash export RUN_ID=YOUR_MLFLOW_EXPERIMENT_RUN_ID python3 evaluate.py --run_id $RUN_ID --dataset_loc "path/to/dataset" --results_fp results.json ``` ```json { "timestamp": "January 22, 2024 09:57:12 AM", "precision": 0.9163323229539818, "recall": 0.9124083769633508, "f1": 0.9137224104301406, "num_samples": 1000.0 } ``` - run_id: ID of the specific MLflow run to load from. ### Inference ``` python3 predict.py --run_id $RUN_ID --headline "Airport Guide: Chicago O'Hare" --keyword "destination" ``` ```json [ { "prediction": "TRAVEL", "probabilities": { "BUSINESS": 0.0024151806719601154, "ENTERTAINMENT": 0.002721842611208558, "FOOD & DRINK": 0.001193400239571929, "PARENTING": 0.0015436559915542603, "POLITICS": 0.0012392215430736542, "SPORTS": 0.0020724297501146793, "STYLE & BEAUTY": 0.0018642042996361852, "TRAVEL": 0.9841892123222351, "WELLNESS": 0.0013303911546245217, "WORLD NEWS": 0.0014305398799479008 } } ] ``` ### Application ```bash python3 app.py --run_id $RUN_ID --num_cpus 2 ``` Now, we can send requests to our application: ```python import json import requests headline = "Reboot Your Skin For Spring With These Facial Treatments" keywords = "skin-facial-treatments" json_data = json.dumps({"headline": headline, "keywords": keywords}) out = requests.post("http://127.0.0.1:8010/predict", data=json_data).json() print(out["results"][0]) ``` ```json { "prediction": "STYLE & BEAUTY", "probabilities": { "BUSINESS": 0.002265132963657379, "ENTERTAINMENT": 0.008689943701028824, "FOOD & DRINK": 0.0011296054581180215, "PARENTING": 0.002621663035824895, "POLITICS": 0.002141285454854369, "SPORTS": 0.0017548275645822287, "STYLE & BEAUTY": 0.9760453104972839, "TRAVEL": 0.0024237297475337982, "WELLNESS": 0.001382972695864737, "WORLD NEWS": 0.0015455639222636819 } ``` ### Testing the Code How to test the written code for asserted inputs and outputs: ```bash python3 -m pytest tests/code --verbose --disable-warnings ``` How to test the Model behaviour: ```bash python3 -m pytest --run-id $RUN_ID tests/model --verbose --disable-warnings ``` ### Workload To execute all stages of this project with a single command, `workload.sh` script has been provided, change the resource(cpu_nums, gpu_nums, etc.) parameters to suit your needs. ```bash bash workload.sh ``` ### Extras Makefile to clean the directories and format scripts: ```bash make style && make clean ``` Served documentation for functions and classes: ```bash python3 -m mkdocs serve ```