Model Overview

This model performs abstract summarization of python data science code to english natural language. It is finetuned from google/flan-t5-small with a subset of Meta Kaggle For Code labeled with a 43B model.

Model Architecture

This model was finetuned from the google/flan-t5-small and shares its architecture and tokenizer.

Training

Code cells were extracted from Jupyter Notebooks, chunked into ~500 tokens, and labelled by a 43B model with the prompt: "Think step by step and then provide a two or three sentence summary of what the code is doing for an audience who may not be familiar with machine learning. Focus on the problem the authors' are trying to solve."

Datasets

All code was extracted from .ipynb files that are part of the Meta Kaggle for Code dataset.

Tokenizer Construction

The tokenizer was not modified from the standard google/flan-t5-small tokenizer.

How to Use this Model

The model is available for use in the transformers library, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.


## Generating summaries with this model

```python
ipynb_string = "import pandas as pd\nimport numpy as np"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

chunk_ids = tokenizer.encode("summarize: ```" + ipynb_string + "```", return_tensors="pt", truncation=True, padding="max_length", max_length=512)
output_tokens = model.generate(chunk_ids, max_length=128)
output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

Input

This model accepts 512 tokens from the associated tokenizer. Preface input data with summarize: and wrap input as a markdown code block "```".

Output

This model provides short natural language summaries of python data science code.

Limitations

The Flan-T5-Small architecture was chosen to maximize portability, but summaries may sometimes be repetitive, incomplete, or too abstract. Remember that the model was finetuned with Kaggle notebooks and will perform better for code in that distribution.