metadata

license: apache-2.0
datasets:
  - yuan-tian/chartgpt-dataset
language:
  - en
metrics:
  - rouge
pipeline_tag: text2text-generation
base_model:
  - google/flan-t5-xl
new_version: yuan-tian/chartgpt-llama3

Model Card for ChartGPT

Model Details

Model Description

This model is used to generate charts from natural language. For more information, please refer to the paper.

Model type: Language model
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model: FLAN-T5-XL
Research paper: ChartGPT: Leveraging LLMs to Generate Charts from Abstract Natural Language

Model Input Format

Click to expand

Model input on the Step x. Specifically, <...> serves as a seperation token.

{table name}
<head> {column names}
<type> {column types}
<data> {data row 1} <line> {data row 2} <line>
<utterance> {NL utterance}
<ans>
<sep> {Step 1 prompt} {Answer 2}
...
<sep> {Step x-1 prompt} {Answer x-1}
<sep> {Step x prompt}

And the model should output the answer corresponding to step x.

The step 1-6 prompts are as follows:

Step 1. Select columns:
Step 2. Add filter:
Step 3. Add aggregations: 
Step 4. Select chart type:
Step 5. Choose encoding:
Step 6. Add sort:

How to Get Started with the Model

Running the Model on a GPU

An example of a movie dataset with an utterance "What kinds of movies are the most popular?". The model should give the answers to step 1 (select columns). You can use the code below to test if you can run the model successfully.

Click to expand

from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
)
tokenizer = AutoTokenizer.from_pretrained("yuan-tian/chartgpt")
model = AutoModelForSeq2SeqLM.from_pretrained("yuan-tian/chartgpt", device_map="auto")
input_text = "movies <head> Title,Worldwide_Gross,Production_Budget,Release_Year,Content_Rating,Running_Time,Major_Genre,Creative_Type,Rotten_Tomatoes_Rating,IMDB_Rating <type> nominal,quantitative,quantitative,temporal,nominal,quantitative,nominal,nominal,quantitative,quantitative <data> From Dusk Till Dawn,25728961,20000000,1996,R,107,Horror,Fantasy,63,7.1 <line> Broken Arrow,148345997,65000000,1996,R,108,Action,Contemporary Fiction,55,5.8 <line>  <utterance> What kinds of movies are the most popular? <ans> <sep> Step 1. Select the columns:"
inputs = tokenizer(input_text, return_tensors="pt", padding=True).to("cuda")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens = True))

Training Details

Training Data

This model is Fine-tuned from FLAN-T5-XL on the chartgpt-dataset.

Training Procedure

Plan to update the preprocessing and training procedure in the future.

Citation

BibTeX:

@article{tian2024chartgpt,
  title={ChartGPT: Leveraging LLMs to Generate Charts from Abstract Natural Language},
  author={Tian, Yuan and Cui, Weiwei and Deng, Dazhen and Yi, Xinjing and Yang, Yurun and Zhang, Haidong and Wu, Yingcai},
  journal={IEEE Transactions on Visualization and Computer Graphics},
  year={2024},
  pages={1-15},
  doi={10.1109/TVCG.2024.3368621}
}