Understanding Column Mapping
Column mapping is a critical setup process in AutoTrain that informs the system about the roles of different columns in your dataset. Whether it’s a tabular dataset, text classification data, or another type, the need for precise column mapping ensures that AutoTrain processes each dataset element correctly.
How Column Mapping Works
AutoTrain has no way of knowing what the columns in your dataset represent. AutoTrain requires a clear understanding of each column’s function within your dataset to train models effectively. This is managed through a straightforward mapping system in the user interface, represented as a dictionary. Here’s a typical example:
{"text": "text", "label": "target"}
In this example, the text
column in your dataset corresponds to the text data
AutoTrain uses for processing, and the target
column is treated as the
label for training.
But let’s not get confused! AutoTrain has a way to understand what each column in your dataset represents. If your data is already in AutoTrain format, you dont need to change column mappings. If not, you can easily map the columns in your dataset to the correct AutoTrain format.
In the UI, you will see column mapping as a dictionary:
{"text": "text", "label": "target"}
Here, the column text
in your dataset is mapped to the AutoTrain column text
,
and the column target
in your dataset is mapped to the AutoTrain column label
.
Let’s say you are training a text classification model and your dataset has the following columns:
full_text, target_sentiment
"this movie is great", positive
"this movie is bad", negative
You can map these columns to the AutoTrain format as follows:
{"text": "full_text", "label": "target_sentiment"}
If your dataset has the columns: text
and label
, you don’t need to change the column mapping.
Let’s take a look at column mappings for each task:
LLM
Note: For all LLM tasks, if the text column(s) is not formatted i.e. if contains samples in chat format (dict or json), then you
should use chat_template
parameter. Read more about it in LLM Parameters Section.
SFT / Generic Trainer
{"text": "text"}
text
: The column in your dataset that contains the text data.
Reward Trainer
{"text": "text", "rejected_text": "rejected_text"}
text
: The column in your dataset that contains the text data.
rejected_text
: The column in your dataset that contains the rejected text data.
DPO / ORPO Trainer
{"prompt": "prompt", "text": "text", "rejected_text": "rejected_text"}
prompt
: The column in your dataset that contains the prompt data.
text
: The column in your dataset that contains the text data.
rejected_text
: The column in your dataset that contains the rejected text data.
Text Classification & Regression, Seq2Seq
For text classification and regression, the column mapping should be as follows:
{"text": "dataset_text_column", "label": "dataset_target_column"}
text
: The column in your dataset that contains the text data.
label
: The column in your dataset that contains the target variable.
Token Classification
{"text": "tokens", "label": "tags"}
text
: The column in your dataset that contains the tokens. These tokens must be a list of strings.
label
: The column in your dataset that contains the tags. These tags must be a list of strings.
For token classification, if you are using a CSV, make sure that the columns are stringified lists.
Tabular Classification & Regression
{"id": "id", "label": ["target"]}
id
: The column in your dataset that contains the unique identifier for each row.
label
: The column in your dataset that contains the target variable. This should be a list of strings.
For a single target column, you can pass a list with a single element.
For multiple target columns, e.g. a multi label classification task, you can pass a list with multiple elements.
Image Classification
For image classification, the column mapping should be as follows:
{"image": "image_column", "label": "label_column"}
Image classification requires column mapping only when you are using a dataset from Hugging Face Hub. For uploaded datasets, leave column mapping as it is.
Sentence Transformers
For all sentence transformers tasks, one needs to map columns to sentence1_column
, sentence2_column
, sentence3_column
& target_column
column.
Not all columns need to be mapped for all trainers of sentence transformers.
pair :
{"sentence1_column": "anchor", "sentence2_column": "positive"}
pair_class :
{"sentence1_column": "premise", "sentence2_column": "hypothesis", "target_column": "label"}
pair_score :
{"sentence1_column": "sentence1", "sentence2_column": "sentence2", "target_column": "score"}
triplet :
{"sentence1_column": "anchor", "sentence2_column": "positive", "sentence3_column": "negative"}
qa :
{"sentence1_column": "query", "sentence2_column": "answer"}
Extractive Question Answering
For extractive question answering, the column mapping should be as follows:
{"text": "context", "question": "question", "answer": "answers"}
where answer
is a dictionary with keys text
and answer_start
.
Ensuring Accurate Mapping
To ensure your model trains correctly:
Verify Column Names: Double-check that the names used in the mapping dictionary accurately reflect those in your dataset.
Format Appropriately: Especially in token classification, ensure your data format matches expectations (e.g., lists of strings).
Update Mappings for New Datasets: Each new dataset might require its unique mappings based on its structure and the task at hand.
By following these guidelines and using the provided examples as templates, you can effectively instruct AutoTrain on how to interpret and handle your data for various machine learning tasks. This process is fundamental for achieving optimal results from your model training endeavors.
< > Update on GitHub