BioGptTokenizer, BioGptLMHeadModel don't exist yet in transformers

#1
by tdekelver - opened

Hi @kamalkraj ,

Thanks a lot for the contributions, however it seems like that BioGptTokenizer and LMHeadModel are not implemented in transformers yet. Is this normal?

Tanks in advance for the help,
Kind regards,
tdekelver

Hi @tdekelver ,

The PR is not yet merged with the main branch. For experiments you can install the transformers directly from- https://github.com/huggingface/transformers/pull/20420

Thanks,
Kamal

Hi Kamal,

Thanks I just tried it out and wanted to train the model with my own dataset (2 classes) but I get an error when I try to train it, can you help me ?
See below my code:

! pip install git+https://github.com/kamalkraj/transformers.git@BioGPT
! pip install sacremoses

from transformers import BioGptTokenizer, BioGptForCausalLM, TrainingArguments, Trainer
import evaluate 

model = BioGptForCausalLM.from_pretrained("kamalkraj/biogpt", num_labels=2)
tokenizer = BioGptTokenizer.from_pretrained("kamalkraj/biogpt", use_fast=True)
clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

args = TrainingArguments(
    "biogpt-finetuned",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='f1',
    push_to_hub=False,
    report_to='mlflow'
)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions[:, 0]
    return clf_metrics.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset['valid'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
  )

trainer.train()

and the last line (to train the model) gives me the following error:

The following columns in the training set don't have a corresponding argument in `BioGptForCausalLM.forward` and have been ignored: text, abstract, title, BERT_txt, authors, journals, keywords, sources, file. If text, abstract, title, BERT_txt, authors, journals, keywords, sources, file are not expected by `BioGptForCausalLM.forward`,  you can safely ignore this message.
/usr/local/lib/python3.7/dist-packages/transformers/optimization.py:310: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
***** Running training *****
  Num examples = 2820
  Num Epochs = 5
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 3525
  Number of trainable parameters = 346763264

---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-20-3435b262f1ae> in <module>
----> 1 trainer.train()

5 frames

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1549             resume_from_checkpoint=resume_from_checkpoint,
   1550             trial=trial,
-> 1551             ignore_keys_for_eval=ignore_keys_for_eval,
   1552         )
   1553 

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1793                         tr_loss_step = self.training_step(model, inputs)
   1794                 else:
-> 1795                     tr_loss_step = self.training_step(model, inputs)
   1796 
   1797                 if (

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in training_step(self, model, inputs)
   2552 
   2553         with self.compute_loss_context_manager():
-> 2554             loss = self.compute_loss(model, inputs)
   2555 
   2556         if self.args.n_gpu > 1:

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
   2584         else:
   2585             labels = None
-> 2586         outputs = model(**inputs)
   2587         # Save past state if it exists
   2588         # TODO: this needs to be fixed and made cleaner later.

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1128         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130             return forward_call(*input, **kwargs)
   1131         # Do not call functions when jit is used
   1132         full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/transformers/models/biogpt/modeling_biogpt.py in forward(self, input_ids, attention_mask, head_mask, inputs_embeds, past_key_values, labels, use_cache, output_attentions, output_hidden_states, return_dict)
    685             # we are doing next-token prediction; shift prediction scores and input ids by one
    686             shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
--> 687             labels = labels[:, 1:].contiguous()
    688             loss_fct = CrossEntropyLoss()
    689             lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))

IndexError: too many indices for tensor of dimension 1

Can you help me?

Hi @tdekelver ,

BioGptForCausalLM is not for the sequence classification tasks. It is only for generating text.

The original and the current HF implementation don't have a sequence classification task implementation. Once the original PR merges, I will add support for the same.

Thanks.

Ah okay thanks !

tdekelver changed discussion status to closed

Sign up or log in to comment