Model Overview
This model performs abstract summarization of python data science code to english natural language. It is finetuned from google/flan-t5-small with a subset of Meta Kaggle For Code labeled with a 43B model.
Model Architecture
This model was finetuned from the google/flan-t5-small and shares its architecture and tokenizer.
Training
Code cells were extracted from Jupyter Notebooks, chunked into ~500 tokens, and labelled by a 43B model with the prompt: "Think step by step and then provide a two or three sentence summary of what the code is doing for an audience who may not be familiar with machine learning. Focus on the problem the authors' are trying to solve."
Datasets
All code was extracted from .ipynb files that are part of the Meta Kaggle for Code dataset.
Tokenizer Construction
The tokenizer was not modified from the standard google/flan-t5-small tokenizer.
How to Use this Model
The model is available for use in the transformers
library, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
## Generating summaries with this model
```python
ipynb_string = "import pandas as pd\nimport numpy as np"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
chunk_ids = tokenizer.encode("summarize: ```" + ipynb_string + "```", return_tensors="pt", truncation=True, padding="max_length", max_length=512)
output_tokens = model.generate(chunk_ids, max_length=128)
output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
Input
This model accepts 512 tokens from the associated tokenizer. Preface input data with summarize:
and wrap input as a markdown code block "```".
Output
This model provides short natural language summaries of python data science code.
Limitations
The Flan-T5-Small architecture was chosen to maximize portability, but summaries may sometimes be repetitive, incomplete, or too abstract. Remember that the model was finetuned with Kaggle notebooks and will perform better for code in that distribution.
- Downloads last month
- 9