OFA-OCR / fairseq /docs /hydra_integration.md
JustinLin610's picture
first commit
ee21b96
|
raw
history blame
10.4 kB

Hydra

Hydra is an open-source Python framework that simplifies the development of research and other complex applications. The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line. The name Hydra comes from its ability to run multiple similar jobs - much like a Hydra with multiple heads.

Motivation

Until recently, all components in fairseq were configured through a shared args namespace that was created at application startup. Components declared their own add_args method to update the argparse parser, hoping that the names would not clash with arguments from other components. While this model works for smaller applications, as fairseq grew and became integrated into other applications, this became problematic. In order to determine how to configure each component, one needed to a) examine what args were added by this component, and b) read the code to figure out what shared arguments it is using that were added in other places. Reproducing models involved sharing commands that often contained dozens of command line switches.

The model described above is still supported by fairseq for backward compatibility, but will be deprecated some time in the future.

New components in fairseq should now create a dataclass that encapsulates all parameters required to configure this component. The dataclass is registered along with the component, and fairseq takes care of constructing and providing this configuration object to the component's constructor. Note that sharing parameters can optionally still work, but one has to explicitly point to the "source of truth" (see inheritance example below). These changes make components in fairseq more independent and re-usable by other applications: all that is needed to create a component is to initialize its dataclass and overwrite some of the defaults.

While configuring fairseq through command line (using either the legacy argparse based or the new Hydra based entry points) is still fully supported, you can now take advantage of configuring fairseq completely or piece-by-piece through hierarchical YAML configuration files. These files can also be shipped as examples that others can use to run an identically configured job.

Additionally, Hydra has a rich and growing library of plugins that provide functionality such as hyperparameter sweeping (including using bayesian optimization through the Ax library), job launching across various platforms, and more.

Creating or migrating components

In general, each new (or updated) component should provide a companion dataclass. These dataclass are typically located in the same file as the component and are passed as arguments to the register_*() functions. Top-level configs that should be present in every fairseq application are placed in the global config file and added to the FairseqConfig object.

Each dataclass is a plain-old-data object, similar to a NamedTuple. These classes are decorated with a @dataclass decorator, and typically inherit from FairseqDataclass (which adds some functionality for backward compatibility). Each field must have a type, and generally has metadata (such as a help string) and a default value. Only primitive types or other config objects are allowed as data types for each field.

Example:

from dataclasses import dataclass, field
from fairseq.dataclass import FairseqDataclass

@dataclass
class InteractiveConfig(FairseqDataclass):
    buffer_size: int = field(
        default=0,
        metadata={
            "help": "read this many sentences into a buffer before processing them"
        },
    )
    input: str = field(
        default="-",
        metadata={"help": "file to read from; use - for stdin"},
    )

Inherting values

Some components require sharing a value. For example, a learning rate scheduler and an optimizer may both need to know the initial learning rate value. One can declare a field that, by default, will inherit its value from another config node in the same hierarchy:

@dataclass
FairseqAdamConfig(FairseqDataclass):
    ...
    lr: List[float] = II("optimization.lr")
    ...

II("optimization.lr") is syntactic sugar for "${optimization.lr}", which is the value one can use in a YAML config file or through command line to achieve the same effect. Note that this assumes that there is an "optimization" config object in the root config and it has a field called "lr".

Tasks and Models

Creating Tasks and Models works same as before, except that legacy implementations now inherit from LegacyFairseq* base classes, while new components inherit from FairseqTask and FairseqModel and provide a dataclass to the register_*() functions.

Task example:

@dataclass
class LanguageModelingConfig(FairseqDataclass):
    data: Optional[str] = field(
        default=None, metadata={"help": "path to data directory"}
    )
    ...

@register_task("language_modeling", dataclass=LanguageModelingConfig)
class LanguageModelingTask(FairseqTask):
    ...
    @classmethod
    def setup_task(cls, cfg: LanguageModelingConfig):
        ...

Model example:

@dataclass
class TransformerLanguageModelConfig(FairseqDataclass):
    activation_fn: ChoiceEnum(utils.get_available_activation_fns()) = field(
        default="relu", metadata={"help": "activation function to use"}
    )
    dropout: float = field(default=0.1, metadata={"help": "dropout probability"})
    ...

@register_model("transformer_lm", dataclass=TransformerLanguageModelConfig)
class TransformerLanguageModel(FairseqLanguageModel):
    ...
    @classmethod
    def build_model(cls, cfg: TransformerLanguageModelConfig, task: FairseqTask):
        ...

Other components

Other components work as before, but they now take their configuration dataclass as the only constructor argument:

@dataclass
class MosesTokenizerConfig(FairseqDataclass):
    source_lang: str = field(default="en", metadata={"help": "source language"})
    ...

@register_tokenizer("moses", dataclass=MosesTokenizerConfig)
class MosesTokenizer(object):
    def __init__(self, cfg: MosesTokenizerConfig):
        ...

Note that if you are adding a new registry for a new set of components, you need to add it to the FairseqConfig object in fairseq/dataclass/configs.py:

@dataclass
class FairseqConfig(object):
    ...
    my_new_registry: Any = None

Training with fairseq-hydra-train

To fully take advantage of configuration flexibility offered by Hydra, you may want to train new models using the fairseq-hydra-train entry point. Legacy CLI tools such as fairseq-train will remain supported for the foreseeable future but will be deprecated eventually.

On startup, Hydra will create a configuration object that contains a hierarchy of all the necessary dataclasses populated with their default values in the code. The default values are overwritten by values found in YAML files in fairseq/config directory (which currently sets minimal defaults) and then further overwritten by values provided through command line arguments.

Some of the most common use cases are shown below:

1. Override default values through command line:

$ fairseq-hydra-train \
    distributed_training.distributed_world_size=1 \
    dataset.batch_size=2 \
    task.data=data-bin \
    model=transformer_lm/transformer_lm_gpt \
    task=language_modeling \
    optimization.max_update=5000

Note that along with explicitly providing values for parameters such as dataset.batch_size, this also tells Hydra to overlay configuration found in fairseq/config/model/transformer_lm/transformer_lm_gpt.yaml over the default values in the dataclass. If you want to train a model without specifying a particular architecture you can simply specify model=transformer_lm. This only works for migrated tasks and models.

2. Replace bundled configs with an external config:

$ fairseq-hydra-train \
    --config-dir /path/to/external/configs \
    --config-name wiki103

where /path/to/external/configs/wiki103.yaml contains:

# @package _group_

model:
  _name: transformer_lm
distributed_training:
  distributed_world_size: 1
dataset:
  batch_size: 2
task:
  _name: language_modeling
  data: /path/to/data
  add_bos_token: false
  max_target_positions: 1024
optimization:
  max_update: 50000
  lr: [ 0.25 ]
criterion: cross_entropy
optimizer: adam
lr_scheduler:
  _name: cosine

Note that here bundled configs from fairseq/config directory are not used, however the defaults from each dataclass will still be used (unless overwritten by your external config).

Additionally you can choose to break up your configs by creating a directory structure in the same location as your main config file, with the names of the top-level fields (such as "model", "dataset", etc), and placing config files with meaningful names that would populate that specific section of your top-level config file (for example, you might have model/small_transformer_lm.yaml, model/big_transformer_lm.yaml, etc). You can then specify the correct configuration via command line, defaults in the main config, or even launch all of them as a sweep (see Hydra documentation on how to do this).

3. Add an external config directory to Hydra search path:

This allows combining default configuration (including using any bundled config files), while specifying your own config files for some parts of the configuration.

$ fairseq-hydra-train \
    distributed_training.distributed_world_size=1 \
    dataset.batch_size=2 \
    task.data=/path/to/data/ \
    model=transformer_lm/2_layers \
    task=language_modeling \
    optimization.max_update=5000 \
    --config-dir /path/to/external/configs

where /path/to/external/configs has the following structure:

.
+-- model
|   +-- transformer_lm
|   |   +-- 2_layers.yaml

and 2_layers.yaml contains a copy of transformer_lm_gpt.yaml but with decoder_layers set to 2. You can add other configs to configure other components as well.