|
## Hydra |
|
|
|
[Hydra](https://github.com/facebookresearch/hydra) is an open-source Python |
|
framework that simplifies the development of research and other complex |
|
applications. The key feature is the ability to dynamically create a |
|
hierarchical configuration by composition and override it through config files |
|
and the command line. The name Hydra comes from its ability to run multiple |
|
similar jobs - much like a Hydra with multiple heads. |
|
|
|
## Motivation |
|
|
|
Until recently, all components in fairseq were configured through a shared |
|
`args` namespace that was created at application startup. Components declared |
|
their own `add_args` method to update the argparse parser, hoping that the names |
|
would not clash with arguments from other components. While this model works for |
|
smaller applications, as fairseq grew and became integrated into other |
|
applications, this became problematic. In order to determine how to configure |
|
each component, one needed to a) examine what args were added by this component, |
|
and b) read the code to figure out what shared arguments it is using that were |
|
added in other places. Reproducing models involved sharing commands that often |
|
contained dozens of command line switches. |
|
|
|
The model described above is still supported by fairseq for backward |
|
compatibility, but will be deprecated some time in the future. |
|
|
|
New components in fairseq should now create a dataclass that encapsulates all |
|
parameters required to configure this component. The dataclass is registered |
|
along with the component, and fairseq takes care of constructing and providing |
|
this configuration object to the component's constructor. Note that sharing |
|
parameters can optionally still work, but one has to explicitly point to the |
|
"source of truth" (see inheritance example below). These changes make components |
|
in fairseq more independent and re-usable by other applications: all that is |
|
needed to create a component is to initialize its dataclass and overwrite some |
|
of the defaults. |
|
|
|
While configuring fairseq through command line (using either the legacy argparse |
|
based or the new Hydra based entry points) is still fully supported, you can now |
|
take advantage of configuring fairseq completely or piece-by-piece through |
|
hierarchical YAML configuration files. These files can also be shipped as |
|
examples that others can use to run an identically configured job. |
|
|
|
Additionally, Hydra has a rich and growing [library of |
|
plugins](https://github.com/facebookresearch/hydra/tree/master/plugins) that |
|
provide functionality such as hyperparameter sweeping (including using bayesian |
|
optimization through the [Ax](https://github.com/facebook/Ax) library), job |
|
launching across various platforms, and more. |
|
|
|
## Creating or migrating components |
|
|
|
In general, each new (or updated) component should provide a companion |
|
[dataclass](https://www.python.org/dev/peps/pep-0557/). These dataclass are |
|
typically located in the same file as the component and are passed as arguments |
|
to the `register_*()` functions. Top-level configs that should be present in |
|
every fairseq application are placed in the |
|
[global](fairseq/dataclass/configs.py) config file and added to the |
|
`FairseqConfig` object. |
|
|
|
Each dataclass is a plain-old-data object, similar to a `NamedTuple`. These |
|
classes are decorated with a `@dataclass` decorator, and typically inherit from |
|
`FairseqDataclass` (which adds some functionality for backward compatibility). |
|
Each field must have a type, and generally has metadata (such as a help string) |
|
and a default value. Only primitive types or other config objects are allowed as |
|
data types for each field. |
|
|
|
#### Example: |
|
|
|
```python |
|
from dataclasses import dataclass, field |
|
from fairseq.dataclass import FairseqDataclass |
|
|
|
@dataclass |
|
class InteractiveConfig(FairseqDataclass): |
|
buffer_size: int = field( |
|
default=0, |
|
metadata={ |
|
"help": "read this many sentences into a buffer before processing them" |
|
}, |
|
) |
|
input: str = field( |
|
default="-", |
|
metadata={"help": "file to read from; use - for stdin"}, |
|
) |
|
``` |
|
|
|
### Inherting values |
|
|
|
Some components require sharing a value. For example, a learning rate scheduler |
|
and an optimizer may both need to know the initial learning rate value. One can |
|
declare a field that, by default, will inherit its value from another config |
|
node in the same hierarchy: |
|
|
|
```python |
|
@dataclass |
|
FairseqAdamConfig(FairseqDataclass): |
|
... |
|
lr: List[float] = II("optimization.lr") |
|
... |
|
``` |
|
|
|
`II("optimization.lr")` is syntactic sugar for `"${optimization.lr}"`, which is |
|
the value one can use in a YAML config file or through command line to achieve |
|
the same effect. Note that this assumes that there is an "optimization" config |
|
object in the root config and it has a field called "lr". |
|
|
|
### Tasks and Models |
|
|
|
Creating Tasks and Models works same as before, except that legacy |
|
implementations now inherit from `LegacyFairseq*` base classes, while new |
|
components inherit from `FairseqTask` and `FairseqModel` and provide a dataclass |
|
to the `register_*()` functions. |
|
|
|
#### Task example: |
|
|
|
```python |
|
@dataclass |
|
class LanguageModelingConfig(FairseqDataclass): |
|
data: Optional[str] = field( |
|
default=None, metadata={"help": "path to data directory"} |
|
) |
|
... |
|
|
|
@register_task("language_modeling", dataclass=LanguageModelingConfig) |
|
class LanguageModelingTask(FairseqTask): |
|
... |
|
@classmethod |
|
def setup_task(cls, cfg: LanguageModelingConfig): |
|
... |
|
``` |
|
|
|
#### Model example: |
|
|
|
```python |
|
@dataclass |
|
class TransformerLanguageModelConfig(FairseqDataclass): |
|
activation_fn: ChoiceEnum(utils.get_available_activation_fns()) = field( |
|
default="relu", metadata={"help": "activation function to use"} |
|
) |
|
dropout: float = field(default=0.1, metadata={"help": "dropout probability"}) |
|
... |
|
|
|
@register_model("transformer_lm", dataclass=TransformerLanguageModelConfig) |
|
class TransformerLanguageModel(FairseqLanguageModel): |
|
... |
|
@classmethod |
|
def build_model(cls, cfg: TransformerLanguageModelConfig, task: FairseqTask): |
|
... |
|
``` |
|
|
|
### Other components |
|
|
|
Other components work as before, but they now take their configuration dataclass |
|
as the only constructor argument: |
|
|
|
```python |
|
@dataclass |
|
class MosesTokenizerConfig(FairseqDataclass): |
|
source_lang: str = field(default="en", metadata={"help": "source language"}) |
|
... |
|
|
|
@register_tokenizer("moses", dataclass=MosesTokenizerConfig) |
|
class MosesTokenizer(object): |
|
def __init__(self, cfg: MosesTokenizerConfig): |
|
... |
|
``` |
|
|
|
Note that if you are adding a new registry for a new set of components, you need |
|
to add it to the `FairseqConfig` object in `fairseq/dataclass/configs.py`: |
|
|
|
```python |
|
@dataclass |
|
class FairseqConfig(object): |
|
... |
|
my_new_registry: Any = None |
|
``` |
|
|
|
## Training with `fairseq-hydra-train` |
|
|
|
To fully take advantage of configuration flexibility offered by Hydra, you may |
|
want to train new models using the `fairseq-hydra-train` entry point. Legacy CLI |
|
tools such as `fairseq-train` will remain supported for the foreseeable future |
|
but will be deprecated eventually. |
|
|
|
On startup, Hydra will create a configuration object that contains a hierarchy |
|
of all the necessary dataclasses populated with their default values in the |
|
code. The default values are overwritten by values found in YAML files in |
|
`fairseq/config` directory (which currently sets minimal defaults) and then |
|
further overwritten by values provided through command line arguments. |
|
|
|
Some of the most common use cases are shown below: |
|
|
|
### 1. Override default values through command line: |
|
|
|
```shell script |
|
$ fairseq-hydra-train \ |
|
distributed_training.distributed_world_size=1 \ |
|
dataset.batch_size=2 \ |
|
task.data=data-bin \ |
|
model=transformer_lm/transformer_lm_gpt \ |
|
task=language_modeling \ |
|
optimization.max_update=5000 |
|
``` |
|
|
|
Note that along with explicitly providing values for parameters such as |
|
`dataset.batch_size`, this also tells Hydra to overlay configuration found in |
|
`fairseq/config/model/transformer_lm/transformer_lm_gpt.yaml` over the default |
|
values in the dataclass. If you want to train a model without specifying a |
|
particular architecture you can simply specify `model=transformer_lm`. This only |
|
works for migrated tasks and models. |
|
|
|
### 2. Replace bundled configs with an external config: |
|
|
|
```shell script |
|
$ fairseq-hydra-train \ |
|
--config-dir /path/to/external/configs \ |
|
--config-name wiki103 |
|
``` |
|
|
|
where `/path/to/external/configs/wiki103.yaml` contains: |
|
|
|
```yaml |
|
# @package _group_ |
|
|
|
model: |
|
_name: transformer_lm |
|
distributed_training: |
|
distributed_world_size: 1 |
|
dataset: |
|
batch_size: 2 |
|
task: |
|
_name: language_modeling |
|
data: /path/to/data |
|
add_bos_token: false |
|
max_target_positions: 1024 |
|
optimization: |
|
max_update: 50000 |
|
lr: [ 0.25 ] |
|
criterion: cross_entropy |
|
optimizer: adam |
|
lr_scheduler: |
|
_name: cosine |
|
``` |
|
|
|
Note that here bundled configs from `fairseq/config` directory are not used, |
|
however the defaults from each dataclass will still be used (unless overwritten |
|
by your external config). |
|
|
|
Additionally you can choose to break up your configs by creating a directory |
|
structure in the same location as your main config file, with the names of the |
|
top-level fields (such as "model", "dataset", etc), and placing config files |
|
with meaningful names that would populate that specific section of your |
|
top-level config file (for example, you might have |
|
`model/small_transformer_lm.yaml`, `model/big_transformer_lm.yaml`, etc). You |
|
can then specify the correct configuration via command line, defaults in the |
|
main config, or even launch all of them as a sweep (see Hydra documentation on |
|
how to do this). |
|
|
|
### 3. Add an external config directory to Hydra search path: |
|
|
|
This allows combining default configuration (including using any bundled config |
|
files), while specifying your own config files for some parts of the |
|
configuration. |
|
|
|
```shell script |
|
$ fairseq-hydra-train \ |
|
distributed_training.distributed_world_size=1 \ |
|
dataset.batch_size=2 \ |
|
task.data=/path/to/data/ \ |
|
model=transformer_lm/2_layers \ |
|
task=language_modeling \ |
|
optimization.max_update=5000 \ |
|
--config-dir /path/to/external/configs |
|
``` |
|
|
|
where `/path/to/external/configs` has the following structure: |
|
``` |
|
. |
|
+-- model |
|
| +-- transformer_lm |
|
| | +-- 2_layers.yaml |
|
``` |
|
|
|
and `2_layers.yaml` contains a copy of `transformer_lm_gpt.yaml` but with |
|
`decoder_layers` set to 2. You can add other configs to configure other |
|
components as well. |
|
|