AutoTrain documentation
Token Classification
Token Classification
Token classification is the task of classifying each token in a sequence. This can be used for Named Entity Recognition (NER), Part-of-Speech (POS) tagging, and more. Get your data ready in proper format and then with just a few clicks, your state-of-the-art model will be ready to be used in production.
Data Format
The data should be in the following CSV format:
tokens,tags
"['I', 'love', 'Paris']", "['O', 'O', 'B-LOC']"
"['I', 'live', 'in', 'New', 'York']", "['O', 'O', 'O', 'B-LOC', 'I-LOC']"
.
.
.or you can also use JSONL format:
{"tokens": ["I", "love", "Paris"], "tags": ["O", "O", "B-LOC"]}
{"tokens": ["I", "live", "in", "New", "York"], "tags": ["O", "O", "O", "B-LOC", "I-LOC"]}
.
.
.As you can see, we have two columns in the CSV file. One column is the tokens and the other is the tags. Both the columns are stringified lists! The tokens column contains the tokens of the sentence and the tags column contains the tags for each token.
If your CSV is huge, you can divide it into multiple CSV files and upload them separately. Please make sure that the column names are the same in all CSV files.
One way to divide the CSV file using pandas is as follows:
import pandas as pd
# Set the chunk size
chunk_size = 1000
i = 1
# Open the CSV file and read it in chunks
for chunk in pd.read_csv('example.csv', chunksize=chunk_size):
    # Save each chunk to a new file
    chunk.to_csv(f'chunk_{i}.csv', index=False)
    i += 1Columns
Your CSV/JSONL dataset must have two columns: tokens and tags.