콜백
콜백은 PyTorch Trainer의 반복 학습 동작을 사용자 정의할 수 있는 객체입니다 (이 기능은 TensorFlow에서는 아직 구현되지 않았습니다). 콜백은 반복 학습의 상태를 검사하여 (진행 상황 보고, TensorBoard 또는 기타 머신 러닝 플랫폼에 로그 남기기 등) 결정(예: 조기 종료)을 내릴 수 있습니다.
콜백은 TrainerControl 객체를 반환하는 것 외에는 반복 학습에서 어떤 것도 변경할 수 없는 “읽기 전용” 코드 조각입니다. 반복 학습에 변경이 필요한 사용자 정의 작업이 필요한 경우, Trainer를 서브클래스로 만들어 필요한 메소드들을 오버라이드해야 합니다 (예시는 trainer를 참조하세요).
기본적으로 TrainingArguments.report_to
는 "all"
로 설정되어 있으므로, Trainer는 다음 콜백을 사용합니다.
- DefaultFlowCallback는 로그, 저장, 평가에 대한 기본 동작을 처리합니다.
- PrinterCallback 또는 ProgressCallback는 진행 상황을 표시하고 로그를 출력합니다 (TrainingArguments를 통해 tqdm을 비활성화하면 첫 번째 콜백이 사용되고, 그렇지 않으면 두 번째가 사용됩니다).
- TensorBoardCallback는 TensorBoard가 (PyTorch >= 1.4 또는 tensorboardX를 통해) 접근 가능하면 사용됩니다.
- WandbCallback는 wandb가 설치되어 있으면 사용됩니다.
- CometCallback는 comet_ml이 설치되어 있으면 사용됩니다.
- MLflowCallback는 mlflow가 설치되어 있으면 사용됩니다.
- NeptuneCallback는 neptune이 설치되어 있으면 사용됩니다.
- AzureMLCallback는 azureml-sdk가 설치되어 있으면 사용됩니다.
- CodeCarbonCallback는 codecarbon이 설치되어 있으면 사용됩니다.
- ClearMLCallback는 clearml이 설치되어 있으면 사용됩니다.
- DagsHubCallback는 dagshub이 설치되어 있으면 사용됩니다.
- FlyteCallback는 flyte가 설치되어 있으면 사용됩니다.
- DVCLiveCallback는 dvclive가 설치되어 있으면 사용됩니다.
패키지가 설치되어 있지만 해당 통합 기능을 사용하고 싶지 않다면, TrainingArguments.report_to
를 사용하고자 하는 통합 기능 목록으로 변경할 수 있습니다 (예: ["azure_ml", "wandb"]
).
콜백을 구현하는 주요 클래스는 TrainerCallback입니다. 이 클래스는 Trainer를 인스턴스화하는 데 사용된 TrainingArguments를 가져오고, 해당 Trainer의 내부 상태를 TrainerState를 통해 접근할 수 있으며, TrainerControl을 통해 반복 학습에서 일부 작업을 수행할 수 있습니다.
사용 가능한 콜백
라이브러리에서 사용 가능한 TrainerCallback 목록은 다음과 같습니다:
A TrainerCallback that sends the logs to Comet ML.
Setup the optional Comet integration.
Environment:
- COMET_MODE (
str
, optional, default toget_or_create
): Control whether to create and log to a new Comet experiment or append to an existing experiment. It accepts the following values:get_or_create
: Decides automatically depending ifCOMET_EXPERIMENT_KEY
is set and whether an Experiment with that key already exists or not.create
: Always create a new Comet Experiment.get
: Always try to append to an Existing Comet Experiment. RequiresCOMET_EXPERIMENT_KEY
to be set.ONLINE
: deprecated, used to create an online Experiment. UseCOMET_START_ONLINE=1
instead.OFFLINE
: deprecated, used to created an offline Experiment. UseCOMET_START_ONLINE=0
instead.DISABLED
: deprecated, used to disable Comet logging. Use the--report_to
flag to control the integrations used for logging result instead.
- COMET_PROJECT_NAME (
str
, optional): Comet project name for experiments. - COMET_LOG_ASSETS (
str
, optional, defaults toTRUE
): Whether or not to log training assets (tf event logs, checkpoints, etc), to Comet. Can beTRUE
, orFALSE
.
For a number of configurable items in the environment, see here.
A TrainerCallback that handles the default flow of the training loop for logs, evaluation and checkpoints.
A bare TrainerCallback that just prints the logs.
A TrainerCallback that displays the progress of training or evaluation.
You can modify max_str_len
to control how long strings are truncated when logging.
class transformers.EarlyStoppingCallback
< source >( early_stopping_patience: int = 1 early_stopping_threshold: typing.Optional[float] = 0.0 )
Parameters
- early_stopping_patience (
int
) — Use withmetric_for_best_model
to stop training when the specified metric worsens forearly_stopping_patience
evaluation calls. - early_stopping_threshold(
float
, optional) — Use with TrainingArgumentsmetric_for_best_model
andearly_stopping_patience
to denote how much the specified metric must improve to satisfy early stopping conditions. `
A TrainerCallback that handles early stopping.
This callback depends on TrainingArguments argument load_best_model_at_end functionality to set best_metric in TrainerState. Note that if the TrainingArguments argument save_steps differs from eval_steps, the early stopping will not occur until the next save step.
class transformers.integrations.TensorBoardCallback
< source >( tb_writer = None )
A TrainerCallback that sends the logs to TensorBoard.
A TrainerCallback that logs metrics, media, model checkpoints to Weight and Biases.
Setup the optional Weights & Biases (wandb) integration.
One can subclass and override this method to customize the setup if needed. Find more information here. You can also override the following environment variables:
Environment:
WANDB_LOG_MODEL (
str
, optional, defaults to"false"
): Whether to log model and checkpoints during training. Can be"end"
,"checkpoint"
or"false"
. If set to"end"
, the model will be uploaded at the end of training. If set to"checkpoint"
, the checkpoint will be uploaded everyargs.save_steps
. If set to"false"
, the model will not be uploaded. Use along withload_best_model_at_end()
to upload best model.Deprecated in 5.0
Setting
WANDB_LOG_MODEL
asbool
will be deprecated in version 5 of 🤗 Transformers.WANDB_WATCH (
str
, optional defaults to"false"
): Can be"gradients"
,"all"
,"parameters"
, or"false"
. Set to"all"
to log gradients and parameters.WANDB_PROJECT (
str
, optional, defaults to"huggingface"
): Set this to a custom string to store results in a different project.WANDB_DISABLED (
bool
, optional, defaults toFalse
): Whether to disable wandb entirely. SetWANDB_DISABLED=true
to disable.
A TrainerCallback that sends the logs to MLflow. Can be disabled by setting
environment variable DISABLE_MLFLOW_INTEGRATION = TRUE
.
Setup the optional MLflow integration.
Environment:
- HF_MLFLOW_LOG_ARTIFACTS (
str
, optional): Whether to use MLflow.log_artifact()
facility to log artifacts. This only makes sense if logging to a remote server, e.g. s3 or GCS. If set toTrue
or 1, will copy each saved checkpoint on each save in TrainingArguments’soutput_dir
to the local or remote artifact storage. Using it without a remote storage will just copy the files to your artifact location. - MLFLOW_TRACKING_URI (
str
, optional): Whether to store runs at a specific path or remote server. Unset by default, which skips setting the tracking URI entirely. - MLFLOW_EXPERIMENT_NAME (
str
, optional, defaults toNone
): Whether to use an MLflow experiment_name under which to launch the run. Default toNone
which will point to theDefault
experiment in MLflow. Otherwise, it is a case sensitive name of the experiment to be activated. If an experiment with this name does not exist, a new experiment with this name is created. - MLFLOW_TAGS (
str
, optional): A string dump of a dictionary of key/value pair to be added to the MLflow run as tags. Example:os.environ['MLFLOW_TAGS']='{"release.candidate": "RC1", "release.version": "2.2.0"}'
. - MLFLOW_NESTED_RUN (
str
, optional): Whether to use MLflow nested runs. If set toTrue
or 1, will create a nested run inside the current run. - MLFLOW_RUN_ID (
str
, optional): Allow to reattach to an existing run which can be usefull when resuming training from a checkpoint. WhenMLFLOW_RUN_ID
environment variable is set,start_run
attempts to resume a run with the specified run ID and other parameters are ignored. - MLFLOW_FLATTEN_PARAMS (
str
, optional, defaults toFalse
): Whether to flatten the parameters dictionary before logging. - MLFLOW_MAX_LOG_PARAMS (
int
, optional): Set the maximum number of parameters to log in the run.
A TrainerCallback that sends the logs to AzureML.
A TrainerCallback that tracks the CO2 emission of training.
class transformers.integrations.NeptuneCallback
< source >( api_token: typing.Optional[str] = None project: typing.Optional[str] = None name: typing.Optional[str] = None base_namespace: str = 'finetuning' run = None log_parameters: bool = True log_checkpoints: typing.Optional[str] = None **neptune_run_kwargs )
Parameters
- api_token (
str
, optional) — Neptune API token obtained upon registration. You can leave this argument out if you have saved your token to theNEPTUNE_API_TOKEN
environment variable (strongly recommended). See full setup instructions in the docs. - project (
str
, optional) — Name of an existing Neptune project, in the form “workspace-name/project-name”. You can find and copy the name in Neptune from the project settings -> Properties. If None (default), the value of theNEPTUNE_PROJECT
environment variable is used. - name (
str
, optional) — Custom name for the run. - base_namespace (
str
, optional, defaults to “finetuning”) — In the Neptune run, the root namespace that will contain all of the metadata logged by the callback. - log_parameters (
bool
, optional, defaults toTrue
) — If True, logs all Trainer arguments and model parameters provided by the Trainer. - log_checkpoints (
str
, optional) — If “same”, uploads checkpoints whenever they are saved by the Trainer. If “last”, uploads only the most recently saved checkpoint. If “best”, uploads the best checkpoint (among the ones saved by the Trainer). IfNone
, does not upload checkpoints. - run (
Run
, optional) — Pass a Neptune run object if you want to continue logging to an existing run. Read more about resuming runs in the docs. - **neptune_run_kwargs (optional) —
Additional keyword arguments to be passed directly to the
neptune.init_run()
function when a new run is created.
TrainerCallback that sends the logs to Neptune.
For instructions and examples, see the Transformers integration guide in the Neptune documentation.
A TrainerCallback that sends the logs to ClearML.
Environment:
- CLEARML_PROJECT (
str
, optional, defaults toHuggingFace Transformers
): ClearML project name. - CLEARML_TASK (
str
, optional, defaults toTrainer
): ClearML task name. - CLEARML_LOG_MODEL (
bool
, optional, defaults toFalse
): Whether to log models as artifacts during training.
A TrainerCallback that logs to DagsHub. Extends MLflowCallback
Setup the DagsHub’s Logging integration.
Environment:
- HF_DAGSHUB_LOG_ARTIFACTS (
str
, optional): Whether to save the data and model artifacts for the experiment. Default toFalse
.
class transformers.integrations.FlyteCallback
< source >( save_log_history: bool = True sync_checkpoints: bool = True )
Parameters
A TrainerCallback that sends the logs to Flyte. NOTE: This callback only works within a Flyte task.
class transformers.integrations.DVCLiveCallback
< source >( live: typing.Optional[typing.Any] = None log_model: typing.Union[typing.Literal['all'], bool, NoneType] = None **kwargs )
Parameters
- live (
dvclive.Live
, optional, defaults toNone
) — Optional Live instance. If None, a new instance will be created using **kwargs. - log_model (Union[Literal[“all”], bool], optional, defaults to
None
) — Whether to usedvclive.Live.log_artifact()
to log checkpoints created by Trainer. If set toTrue
, the final checkpoint is logged at the end of training. If set to"all"
, the entire TrainingArguments’soutput_dir
is logged at each checkpoint.
A TrainerCallback that sends the logs to DVCLive.
Use the environment variables below in setup
to configure the integration. To customize this callback beyond
those environment variables, see here.
Setup the optional DVCLive integration. To customize this callback beyond the environment variables below, see here.
Environment:
- HF_DVCLIVE_LOG_MODEL (
str
, optional): Whether to usedvclive.Live.log_artifact()
to log checkpoints created by Trainer. If set toTrue
or 1, the final checkpoint is logged at the end of training. If set toall
, the entire TrainingArguments’soutput_dir
is logged at each checkpoint.
TrainerCallback
class transformers.TrainerCallback
< source >( )
Parameters
- args (TrainingArguments) — The training arguments used to instantiate the Trainer.
- state (TrainerState) — The current state of the Trainer.
- control (TrainerControl) — The object that is returned to the Trainer and can be used to make some decisions.
- model (PreTrainedModel or
torch.nn.Module
) — The model being trained. - tokenizer (
PreTrainedTokenizer
) — The tokenizer used for encoding the data. This is deprecated in favour ofprocessing_class
. - processing_class ([
PreTrainedTokenizer
orBaseImageProcessor
orProcessorMixin
orFeatureExtractionMixin
]) — The processing class used for encoding the data. Can be a tokenizer, a processor, an image processor or a feature extractor. - optimizer (
torch.optim.Optimizer
) — The optimizer used for the training steps. - lr_scheduler (
torch.optim.lr_scheduler.LambdaLR
) — The scheduler used for setting the learning rate. - train_dataloader (
torch.utils.data.DataLoader
, optional) — The current dataloader used for training. - eval_dataloader (
torch.utils.data.DataLoader
, optional) — The current dataloader used for evaluation. - metrics (
Dict[str, float]
) — The metrics computed by the last evaluation phase.Those are only accessible in the event
on_evaluate
. - logs (
Dict[str, float]
) — The values to log.Those are only accessible in the event
on_log
.
A class for objects that will inspect the state of the training loop at some events and take some decisions. At each of those events the following arguments are available:
The control
object is the only one that can be changed by the callback, in which case the event that changes it
should return the modified version.
The argument args
, state
and control
are positionals for all events, all the others are grouped in kwargs
.
You can unpack the ones you need in the signature of the event using them. As an example, see the code of the
simple PrinterCallback.
Example:
class PrinterCallback(TrainerCallback):
def on_log(self, args, state, control, logs=None, **kwargs):
_ = logs.pop("total_flos", None)
if state.is_local_process_zero:
print(logs)
on_epoch_begin
< source >( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs )
Event called at the beginning of an epoch.
on_epoch_end
< source >( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs )
Event called at the end of an epoch.
on_evaluate
< source >( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs )
Event called after an evaluation phase.
on_init_end
< source >( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs )
Event called at the end of the initialization of the Trainer.
Event called after logging the last logs.
on_optimizer_step
< source >( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs )
Event called after the optimizer step but before gradients are zeroed out. Useful for monitoring gradients.
on_pre_optimizer_step
< source >( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs )
Event called before the optimizer step but after gradient clipping. Useful for monitoring gradients.
on_predict
< source >( args: TrainingArguments state: TrainerState control: TrainerControl metrics **kwargs )
Event called after a successful prediction.
on_prediction_step
< source >( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs )
Event called after a prediction step.
Event called after a checkpoint save.
on_step_begin
< source >( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs )
Event called at the beginning of a training step. If using gradient accumulation, one training step might take several inputs.
on_step_end
< source >( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs )
Event called at the end of a training step. If using gradient accumulation, one training step might take several inputs.
on_substep_end
< source >( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs )
Event called at the end of an substep during gradient accumulation.
on_train_begin
< source >( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs )
Event called at the beginning of training.
on_train_end
< source >( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs )
Event called at the end of training.
여기 PyTorch Trainer와 함께 사용자 정의 콜백을 등록하는 예시가 있습니다:
class MyCallback(TrainerCallback):
"A callback that prints a message at the beginning of training"
def on_train_begin(self, args, state, control, **kwargs):
print("Starting training")
trainer = Trainer(
model,
args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
callbacks=[MyCallback], # 우리는 콜백 클래스를 이 방식으로 전달하거나 그것의 인스턴스(MyCallback())를 전달할 수 있습니다
)
또 다른 콜백을 등록하는 방법은 trainer.add_callback()
을 호출하는 것입니다:
trainer = Trainer(...)
trainer.add_callback(MyCallback)
# 다른 방법으로는 콜백 클래스의 인스턴스를 전달할 수 있습니다
trainer.add_callback(MyCallback())
TrainerState
class transformers.TrainerState
< source >( epoch: typing.Optional[float] = None global_step: int = 0 max_steps: int = 0 logging_steps: int = 500 eval_steps: int = 500 save_steps: int = 500 train_batch_size: int = None num_train_epochs: int = 0 num_input_tokens_seen: int = 0 total_flos: float = 0 log_history: typing.List[typing.Dict[str, float]] = None best_metric: typing.Optional[float] = None best_model_checkpoint: typing.Optional[str] = None is_local_process_zero: bool = True is_world_process_zero: bool = True is_hyper_param_search: bool = False trial_name: str = None trial_params: typing.Dict[str, typing.Union[str, float, int, bool]] = None stateful_callbacks: typing.List[ForwardRef('TrainerCallback')] = None )
Parameters
- epoch (
float
, optional) — Only set during training, will represent the epoch the training is at (the decimal part being the percentage of the current epoch completed). - global_step (
int
, optional, defaults to 0) — During training, represents the number of update steps completed. - max_steps (
int
, optional, defaults to 0) — The number of update steps to do during the current training. - logging_steps (
int
, optional, defaults to 500) — Log every X updates steps - eval_steps (
int
, optional) — Run an evaluation every X steps. - save_steps (
int
, optional, defaults to 500) — Save checkpoint every X updates steps. - train_batch_size (
int
, optional) — The batch size for the training dataloader. Only needed whenauto_find_batch_size
has been used. - num_input_tokens_seen (
int
, optional, defaults to 0) — When tracking the inputs tokens, the number of tokens seen during training (number of input tokens, not the number of prediction tokens). - total_flos (
float
, optional, defaults to 0) — The total number of floating operations done by the model since the beginning of training (stored as floats to avoid overflow). - log_history (
List[Dict[str, float]]
, optional) — The list of logs done since the beginning of training. - best_metric (
float
, optional) — When tracking the best model, the value of the best metric encountered so far. - best_model_checkpoint (
str
, optional) — When tracking the best model, the value of the name of the checkpoint for the best model encountered so far. - is_local_process_zero (
bool
, optional, defaults toTrue
) — Whether or not this process is the local (e.g., on one machine if training in a distributed fashion on several machines) main process. - is_world_process_zero (
bool
, optional, defaults toTrue
) — Whether or not this process is the global main process (when training in a distributed fashion on several machines, this is only going to beTrue
for one process). - is_hyper_param_search (
bool
, optional, defaults toFalse
) — Whether we are in the process of a hyper parameter search using Trainer.hyperparameter_search. This will impact the way data will be logged in TensorBoard. - stateful_callbacks (
List[StatefulTrainerCallback]
, optional) — Callbacks attached to theTrainer
that should have their states be saved or restored. Relevent callbacks should implement astate
andfrom_state
function.
A class containing the Trainer inner state that will be saved along the model and optimizer when checkpointing and passed to the TrainerCallback.
In all this class, one step is to be understood as one update step. When using gradient accumulation, one update
step may require several forward and backward passes: if you use gradient_accumulation_steps=n
, then one update
step requires going through n batches.
Create an instance from the content of json_path
.
Save the content of this instance in JSON format inside json_path
.
TrainerControl
class transformers.TrainerControl
< source >( should_training_stop: bool = False should_epoch_stop: bool = False should_save: bool = False should_evaluate: bool = False should_log: bool = False )
Parameters
- should_training_stop (
bool
, optional, defaults toFalse
) — Whether or not the training should be interrupted.If
True
, this variable will not be set back toFalse
. The training will just stop. - should_epoch_stop (
bool
, optional, defaults toFalse
) — Whether or not the current epoch should be interrupted.If
True
, this variable will be set back toFalse
at the beginning of the next epoch. - should_save (
bool
, optional, defaults toFalse
) — Whether or not the model should be saved at this step.If
True
, this variable will be set back toFalse
at the beginning of the next step. - should_evaluate (
bool
, optional, defaults toFalse
) — Whether or not the model should be evaluated at this step.If
True
, this variable will be set back toFalse
at the beginning of the next step. - should_log (
bool
, optional, defaults toFalse
) — Whether or not the logs should be reported at this step.If
True
, this variable will be set back toFalse
at the beginning of the next step.
A class that handles the Trainer control flow. This class is used by the TrainerCallback to activate some switches in the training loop.