Data Collator class to use for BLOOM

#238
by monta - opened

Do we need to use DataCollatorForLanguageModeling and EOS (End of Sequence) token for padding token for BLOOM?

Causal language modeling says:

Now create a batch of examples using DataCollatorForLanguageModeling. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. Use the end-of-sequence token as the padding token and set mlm=False. This will use the inputs as labels shifted to the right by one element:

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

However, if I use DataCollatorForLanguageModeling, I get the error:

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

Environment

!cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"

!transformers-cli env
- `transformers` version: 4.28.0
- Platform: Linux-4.14.309-231.529.amzn2.x86_64-x86_64-with-debian-10.6
- Python version: 3.7.10
- Huggingface_hub version: 0.13.4
- Safetensors version: not installed
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: YES
- Using distributed or parallel set-up in script?: <fill in>

Code for Tokenization

DATASET_STREAMING: bool = False
train = load_dataset("xsum", split="train", streaming=DATASET_STREAMING)

# --------------------------------------------------------------------------------
# Function to generate prompt from XSUM dataset
# --------------------------------------------------------------------------------
def get_convert_to_prompt(template: Template) -> Callable:
    def _convert_to_prompt(example: Dict[str, str]) -> Dict[str, str]:
        """Generate prompt as a dictionary:
        {
            "prompt": "Summarize: <document>\n<summary>"
        }

        Args:
            example: single {document, summary} pair to be able to apply template
        Returns: a dictionary of prompt
        """
        # assert isinstance(example, dict), f"expected dict but {type(example)}.\n{example}"
        assert isinstance(example['document'], str), f"expected str but {type(example['document'])}."

        prompt, response = template.apply(example=example, truncate=False)
        return {
            "prompt": " ".join(
                re.sub(r'[\s\'\"]+', ' ', prompt).split(' ')[:MAX_REQUEST_LENGTH-1]  # -1 for \n
            ) + "\n" + " ".join(
                re.sub(r'[\s\'\"]+', ' ', response).split(' ')[:MAX_RESPONSE_LENGTH-1]
            ) + "\n"
        }

    return _convert_to_prompt

convert_to_prompt: Callable = get_convert_to_prompt(template=template)

# --------------------------------------------------------------------------------
# Function to tokenize prompt
# --------------------------------------------------------------------------------
def tokenize_prompt(example):
    """Generate the model inputs in the dictionary with format:
    {
        "input_ids": List[int], 
        "attention_mask": List[int]",
        "labels": List[int]
    }
    
    Args:
        example:   a dictionary of format {
            "prompt": "Summarize:<document>\n<summary>\n",
        }
    """    
    assert isinstance(example['prompt'], str), f"expected str, got {type(example['prompt'])}"
    inputs: Dict[str, List[int]] = tokenizer(
        example['prompt'], 
        max_length=MAX_TOKEN_LENGTH,   
        truncation=True,
        # padding='max_length',
    )
    inputs["labels"] = inputs["input_ids"].copy()   # Casual LM get the same tokens as inputs and label
    
    return inputs

remove_column_names: List[str] = list(train.features.keys())

# --------------------------------------------------------------------------------
# Tokenization by applying function
# --------------------------------------------------------------------------------
tokenized_train = train.map(
    function=convert_to_prompt, 
    batched=False,
    remove_columns=remove_column_names,
    num_proc=NUM_CPUS
).map(
    function=tokenize_prompt, 
    batched=False,
    remove_columns=['prompt'],
    num_proc=NUM_CPUS
).shuffle(
    seed=42
).with_format(
    "torch"
)

Training:

data_collator = DataCollatorForLanguageModeling(
   tokenizer=tokenizer, 
   mlm=False,
   return_tensors='pt'
)

training_args = TrainingArguments(
    output_dir="bloom_finetuned",
    max_steps=MAX_STEPS,
    num_train_epochs=3,
    per_device_train_batch_size=1,
#    per_device_eval_batch_size=1,
    learning_rate=2e-5,
    weight_decay=0.01, 
    fp16=USE_FLOAT16,
    no_cuda=False,
#    evaluation_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Sign up or log in to comment