Spaces:

osanseviero
/

HUBERT

Runtime error

App Files Files Community

osanseviero commited on Sep 9, 2021

Commit

cbe1813

1 Parent(s): fc67275

Cleanup

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

CODE_OF_CONDUCT.md +0 -77
CONTRIBUTING.md +0 -28
LICENSE +0 -21
gradiodemo.py → app.py +1 -1
docs/Makefile +0 -20
docs/_static/theme_overrides.css +0 -9
docs/command_line_tools.rst +0 -85
docs/conf.py +0 -134
docs/criterions.rst +0 -31
docs/data.rst +0 -58
docs/docutils.conf +0 -2
docs/fairseq.gif +0 -0
docs/fairseq_logo.png +0 -0
docs/getting_started.rst +0 -216
docs/hydra_integration.md +0 -284
docs/index.rst +0 -49
docs/lr_scheduler.rst +0 -34
docs/make.bat +0 -36
docs/models.rst +0 -104
docs/modules.rst +0 -9
docs/optim.rst +0 -38
docs/overview.rst +0 -74
docs/requirements.txt +0 -2
docs/tasks.rst +0 -61
docs/tutorial_classifying_names.rst +0 -415
docs/tutorial_simple_lstm.rst +0 -518
examples/.gitignore +0 -2
examples/__init__.py +0 -9
examples/adaptive_span/README.md +0 -90
examples/adaptive_span/__init__.py +0 -19
examples/adaptive_span/adagrad_with_grad_clip.py +0 -128
examples/adaptive_span/adaptive_span_attention.py +0 -160
examples/adaptive_span/adaptive_span_loss.py +0 -106
examples/adaptive_span/adaptive_span_model.py +0 -263
examples/adaptive_span/adaptive_span_model_wrapper.py +0 -145
examples/adaptive_span/truncated_bptt_lm_task.py +0 -1
examples/backtranslation/README.md +0 -297
examples/backtranslation/deduplicate_lines.py +0 -41
examples/backtranslation/extract_bt_data.py +0 -72
examples/backtranslation/prepare-de-monolingual.sh +0 -98
examples/backtranslation/prepare-wmt18en2de.sh +0 -135
examples/backtranslation/sacrebleu.sh +0 -37
examples/backtranslation/tokenized_bleu.sh +0 -46
examples/bart/README.glue.md +0 -99
examples/bart/README.md +0 -228
examples/bart/README.summarization.md +0 -102
examples/bart/summarize.py +0 -100
examples/byte_level_bpe/README.md +0 -88
examples/byte_level_bpe/get_bitext.py +0 -254
examples/byte_level_bpe/get_data.sh +0 -47

CODE_OF_CONDUCT.md DELETED Viewed

@@ -1,77 +0,0 @@
-# Code of Conduct
-## Our Pledge
-In the interest of fostering an open and welcoming environment, we as
-contributors and maintainers pledge to make participation in our project and
-our community a harassment-free experience for everyone, regardless of age, body
-size, disability, ethnicity, sex characteristics, gender identity and expression,
-level of experience, education, socio-economic status, nationality, personal
-appearance, race, religion, or sexual identity and orientation.
-## Our Standards
-Examples of behavior that contributes to creating a positive environment
-include:
-* Using welcoming and inclusive language
-* Being respectful of differing viewpoints and experiences
-* Gracefully accepting constructive criticism
-* Focusing on what is best for the community
-* Showing empathy towards other community members
-Examples of unacceptable behavior by participants include:
-* The use of sexualized language or imagery and unwelcome sexual attention or
-  advances
-* Trolling, insulting/derogatory comments, and personal or political attacks
-* Public or private harassment
-* Publishing others' private information, such as a physical or electronic
-  address, without explicit permission
-* Other conduct which could reasonably be considered inappropriate in a
-  professional setting
-## Our Responsibilities
-Project maintainers are responsible for clarifying the standards of acceptable
-behavior and are expected to take appropriate and fair corrective action in
-response to any instances of unacceptable behavior.
-Project maintainers have the right and responsibility to remove, edit, or
-reject comments, commits, code, wiki edits, issues, and other contributions
-that are not aligned to this Code of Conduct, or to ban temporarily or
-permanently any contributor for other behaviors that they deem inappropriate,
-threatening, offensive, or harmful.
-## Scope
-This Code of Conduct applies within all project spaces, and it also applies when
-an individual is representing the project or its community in public spaces.
-Examples of representing a project or community include using an official
-project e-mail address, posting via an official social media account, or acting
-as an appointed representative at an online or offline event. Representation of
-a project may be further defined and clarified by project maintainers.
-## Enforcement
-Instances of abusive, harassing, or otherwise unacceptable behavior may be
-reported by contacting the project team at <conduct@pytorch.org>. All
-complaints will be reviewed and investigated and will result in a response that
-is deemed necessary and appropriate to the circumstances. The project team is
-obligated to maintain confidentiality with regard to the reporter of an incident.
-Further details of specific enforcement policies may be posted separately.
-Project maintainers who do not follow or enforce the Code of Conduct in good
-faith may face temporary or permanent repercussions as determined by other
-members of the project's leadership.
-## Attribution
-This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
-available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
-[homepage]: https://www.contributor-covenant.org
-For answers to common questions about this code of conduct, see
-https://www.contributor-covenant.org/faq

CONTRIBUTING.md DELETED Viewed

@@ -1,28 +0,0 @@
-# Contributing to Facebook AI Research Sequence-to-Sequence Toolkit (fairseq)
-We want to make contributing to this project as easy and transparent as
-possible.
-## Pull Requests
-We actively welcome your pull requests.
-1. Fork the repo and create your branch from `master`.
-2. If you've added code that should be tested, add tests.
-3. If you've changed APIs, update the documentation.
-4. Ensure the test suite passes.
-5. Make sure your code lints.
-6. If you haven't already, complete the Contributor License Agreement ("CLA").
-## Contributor License Agreement ("CLA")
-In order to accept your pull request, we need you to submit a CLA. You only need
-to do this once to work on any of Facebook's open source projects.
-Complete your CLA here: <https://code.facebook.com/cla>
-## Issues
-We use GitHub issues to track public bugs. Please ensure your description is
-clear and has sufficient instructions to be able to reproduce the issue.
-## License
-By contributing to Facebook AI Research Sequence-to-Sequence Toolkit (fairseq),
-you agree that your contributions will be licensed under the LICENSE file in
-the root directory of this source tree.

LICENSE DELETED Viewed

@@ -1,21 +0,0 @@
-MIT License
-Copyright (c) Facebook, Inc. and its affiliates.
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Software.
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.

gradiodemo.py → app.py RENAMED Viewed

@@ -2,7 +2,7 @@ import gradio as gr
-description = "demo for HuBERT. To use it, simply add your audio or click one of the examples to load them. Read more at the links below."
 article = "<p style='text-align: center'><a href='https://arxiv.org/abs/2106.07447'>HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units</a> | <a href='https://github.com/pytorch/fairseq/tree/master/examples/hubert'>Github Repo</a></p>"
 gr.Interface.load("huggingface/facebook/hubert-large-ls960-ft",

+description = "Demo for HuBERT. Add your audio or click one of the examples to load them. Read more at the links below."
 article = "<p style='text-align: center'><a href='https://arxiv.org/abs/2106.07447'>HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units</a> | <a href='https://github.com/pytorch/fairseq/tree/master/examples/hubert'>Github Repo</a></p>"
 gr.Interface.load("huggingface/facebook/hubert-large-ls960-ft",

docs/Makefile DELETED Viewed

@@ -1,20 +0,0 @@
-# Minimal makefile for Sphinx documentation
-#
-# You can set these variables from the command line.
-SPHINXOPTS    =
-SPHINXBUILD   = python -msphinx
-SPHINXPROJ    = fairseq
-SOURCEDIR     = .
-BUILDDIR      = _build
-# Put it first so that "make" without argument is like "make help".
-help:
-	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
-.PHONY: help Makefile
-# Catch-all target: route all unknown targets to Sphinx using the new
-# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
-%: Makefile
-	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

docs/_static/theme_overrides.css DELETED Viewed

@@ -1,9 +0,0 @@
-.wy-table-responsive table td kbd {
-    white-space: nowrap;
-}
-.wy-table-responsive table td {
-    white-space: normal !important;
-}
-.wy-table-responsive {
-    overflow: visible !important;
-}

docs/command_line_tools.rst DELETED Viewed

@@ -1,85 +0,0 @@
-.. _Command-line Tools:
-Command-line Tools
-==================
-Fairseq provides several command-line tools for training and evaluating models:
-- :ref:`fairseq-preprocess`: Data pre-processing: build vocabularies and binarize training data
-- :ref:`fairseq-train`: Train a new model on one or multiple GPUs
-- :ref:`fairseq-generate`: Translate pre-processed data with a trained model
-- :ref:`fairseq-interactive`: Translate raw text with a trained model
-- :ref:`fairseq-score`: BLEU scoring of generated translations against reference translations
-- :ref:`fairseq-eval-lm`: Language model evaluation
-.. _fairseq-preprocess:
-fairseq-preprocess
-~~~~~~~~~~~~~~~~~~
-.. automodule:: fairseq_cli.preprocess
-    .. argparse::
-        :module: fairseq.options
-        :func: get_preprocessing_parser
-        :prog: fairseq-preprocess
-.. _fairseq-train:
-fairseq-train
-~~~~~~~~~~~~~
-.. automodule:: fairseq_cli.train
-    .. argparse::
-        :module: fairseq.options
-        :func: get_training_parser
-        :prog: fairseq-train
-.. _fairseq-generate:
-fairseq-generate
-~~~~~~~~~~~~~~~~
-.. automodule:: fairseq_cli.generate
-    .. argparse::
-        :module: fairseq.options
-        :func: get_generation_parser
-        :prog: fairseq-generate
-.. _fairseq-interactive:
-fairseq-interactive
-~~~~~~~~~~~~~~~~~~~
-.. automodule:: fairseq_cli.interactive
-    .. argparse::
-        :module: fairseq.options
-        :func: get_interactive_generation_parser
-        :prog: fairseq-interactive
-.. _fairseq-score:
-fairseq-score
-~~~~~~~~~~~~~
-.. automodule:: fairseq_cli.score
-    .. argparse::
-        :module: fairseq_cli.score
-        :func: get_parser
-        :prog: fairseq-score
-.. _fairseq-eval-lm:
-fairseq-eval-lm
-~~~~~~~~~~~~~~~
-.. automodule:: fairseq_cli.eval_lm
-    .. argparse::
-        :module: fairseq.options
-        :func: get_eval_lm_parser
-        :prog: fairseq-eval-lm

docs/conf.py DELETED Viewed

@@ -1,134 +0,0 @@
-#!/usr/bin/env python3
-# -*- coding: utf-8 -*-
-#
-# fairseq documentation build configuration file, created by
-# sphinx-quickstart on Fri Aug 17 21:45:30 2018.
-#
-# This file is execfile()d with the current directory set to its
-# containing dir.
-#
-# Note that not all possible configuration values are present in this
-# autogenerated file.
-#
-# All configuration values have a default; values that are commented out
-# serve to show the default.
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here. If the directory is relative to the
-# documentation root, use os.path.abspath to make it absolute, like shown here.
-import os
-import sys
-from fairseq import __version__
-# source code directory, relative to this file, for sphinx-autobuild
-sys.path.insert(0, os.path.abspath(".."))
-source_suffix = [".rst"]
-# -- General configuration ------------------------------------------------
-# If your documentation needs a minimal Sphinx version, state it here.
-#
-# needs_sphinx = '1.0'
-# Add any Sphinx extension module names here, as strings. They can be
-# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
-# ones.
-extensions = [
-    "sphinx.ext.autodoc",
-    "sphinx.ext.intersphinx",
-    "sphinx.ext.viewcode",
-    "sphinx.ext.napoleon",
-    "sphinxarg.ext",
-]
-# Add any paths that contain templates here, relative to this directory.
-templates_path = ["_templates"]
-# The master toctree document.
-master_doc = "index"
-# General information about the project.
-project = "fairseq"
-copyright = "Facebook AI Research (FAIR)"
-author = "Facebook AI Research (FAIR)"
-github_doc_root = "https://github.com/pytorch/fairseq/tree/master/docs/"
-# The version info for the project you're documenting, acts as replacement for
-# |version| and |release|, also used in various other places throughout the
-# built documents.
-#
-# The short X.Y version.
-version = __version__
-# The full version, including alpha/beta/rc tags.
-release = __version__
-# The language for content autogenerated by Sphinx. Refer to documentation
-# for a list of supported languages.
-#
-# This is also used if you do content translation via gettext catalogs.
-# Usually you set "language" from the command line for these cases.
-language = None
-# List of patterns, relative to source directory, that match files and
-# directories to ignore when looking for source files.
-# This patterns also effect to html_static_path and html_extra_path
-exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
-# The name of the Pygments (syntax highlighting) style to use.
-pygments_style = "sphinx"
-highlight_language = "python"
-# If true, `todo` and `todoList` produce output, else they produce nothing.
-todo_include_todos = False
-# -- Options for HTML output ----------------------------------------------
-# The theme to use for HTML and HTML Help pages.  See the documentation for
-# a list of builtin themes.
-#
-html_theme = "sphinx_rtd_theme"
-# Theme options are theme-specific and customize the look and feel of a theme
-# further.  For a list of options available for each theme, see the
-# documentation.
-#
-# html_theme_options = {}
-# Add any paths that contain custom static files (such as style sheets) here,
-# relative to this directory. They are copied after the builtin static files,
-# so a file named "default.css" will overwrite the builtin "default.css".
-html_static_path = ["_static"]
-html_context = {
-    "css_files": [
-        "_static/theme_overrides.css",  # override wide tables in RTD theme
-    ],
-}
-# Custom sidebar templates, must be a dictionary that maps document names
-# to template names.
-#
-# This is required for the alabaster theme
-# refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars
-# html_sidebars = {
-#    '**': [
-#        'about.html',
-#        'navigation.html',
-#        'relations.html',  # needs 'show_related': True theme option to display
-#        'searchbox.html',
-#        'donate.html',
-#    ]
-# }
-# Example configuration for intersphinx: refer to the Python standard library.
-intersphinx_mapping = {
-    "numpy": ("http://docs.scipy.org/doc/numpy/", None),
-    "python": ("https://docs.python.org/", None),
-    "torch": ("https://pytorch.org/docs/master/", None),
-}

docs/criterions.rst DELETED Viewed

@@ -1,31 +0,0 @@
-.. role:: hidden
-    :class: hidden-section
-.. _Criterions:
-Criterions
-==========
-Criterions compute the loss function given the model and batch, roughly::
-  loss = criterion(model, batch)
-.. automodule:: fairseq.criterions
-    :members:
-.. autoclass:: fairseq.criterions.FairseqCriterion
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.criterions.adaptive_loss.AdaptiveLoss
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.criterions.composite_loss.CompositeLoss
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.criterions.cross_entropy.CrossEntropyCriterion
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.criterions.label_smoothed_cross_entropy.LabelSmoothedCrossEntropyCriterion
-    :members:
-    :undoc-members:

docs/data.rst DELETED Viewed

@@ -1,58 +0,0 @@
-.. role:: hidden
-    :class: hidden-section
-.. module:: fairseq.data
-Data Loading and Utilities
-==========================
-.. _datasets:
-Datasets
---------
-**Datasets** define the data format and provide helpers for creating
-mini-batches.
-.. autoclass:: fairseq.data.FairseqDataset
-    :members:
-.. autoclass:: fairseq.data.LanguagePairDataset
-    :members:
-.. autoclass:: fairseq.data.MonolingualDataset
-    :members:
-**Helper Datasets**
-These datasets wrap other :class:`fairseq.data.FairseqDataset` instances and
-provide additional functionality:
-.. autoclass:: fairseq.data.BacktranslationDataset
-    :members:
-.. autoclass:: fairseq.data.ConcatDataset
-    :members:
-.. autoclass:: fairseq.data.ResamplingDataset
-    :members:
-.. autoclass:: fairseq.data.RoundRobinZipDatasets
-    :members:
-.. autoclass:: fairseq.data.TransformEosDataset
-    :members:
-Dictionary
-----------
-.. autoclass:: fairseq.data.Dictionary
-    :members:
-Iterators
----------
-.. autoclass:: fairseq.data.CountingIterator
-    :members:
-.. autoclass:: fairseq.data.EpochBatchIterator
-    :members:
-.. autoclass:: fairseq.data.GroupedIterator
-    :members:
-.. autoclass:: fairseq.data.ShardedIterator
-    :members:

docs/docutils.conf DELETED Viewed

	@@ -1,2 +0,0 @@
1	- [writers]
2	- option-limit=0

docs/fairseq.gif DELETED Viewed

Binary file (2.66 MB)

docs/fairseq_logo.png DELETED Viewed

Binary file (73 kB)

docs/getting_started.rst DELETED Viewed

@@ -1,216 +0,0 @@
-Evaluating Pre-trained Models
-=============================
-First, download a pre-trained model along with its vocabularies:
-.. code-block:: console
-    > curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf -
-This model uses a `Byte Pair Encoding (BPE)
-vocabulary <https://arxiv.org/abs/1508.07909>`__, so we'll have to apply
-the encoding to the source text before it can be translated. This can be
-done with the
-`apply\_bpe.py <https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/apply_bpe.py>`__
-script using the ``wmt14.en-fr.fconv-cuda/bpecodes`` file. ``@@`` is
-used as a continuation marker and the original text can be easily
-recovered with e.g. ``sed s/@@ //g`` or by passing the ``--remove-bpe``
-flag to :ref:`fairseq-generate`. Prior to BPE, input text needs to be tokenized
-using ``tokenizer.perl`` from
-`mosesdecoder <https://github.com/moses-smt/mosesdecoder>`__.
-Let's use :ref:`fairseq-interactive` to generate translations interactively.
-Here, we use a beam size of 5 and preprocess the input with the Moses
-tokenizer and the given Byte-Pair Encoding vocabulary. It will automatically
-remove the BPE continuation markers and detokenize the output.
-.. code-block:: console
-    > MODEL_DIR=wmt14.en-fr.fconv-py
-    > fairseq-interactive \
-        --path $MODEL_DIR/model.pt $MODEL_DIR \
-        --beam 5 --source-lang en --target-lang fr \
-        --tokenizer moses \
-        --bpe subword_nmt --bpe-codes $MODEL_DIR/bpecodes
-    | loading model(s) from wmt14.en-fr.fconv-py/model.pt
-    | [en] dictionary: 44206 types
-    | [fr] dictionary: 44463 types
-    | Type the input sentence and press return:
-    Why is it rare to discover new marine mammal species?
-    S-0     Why is it rare to discover new marine mam@@ mal species ?
-    H-0     -0.0643349438905716     Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins?
-    P-0     -0.0763 -0.1849 -0.0956 -0.0946 -0.0735 -0.1150 -0.1301 -0.0042 -0.0321 -0.0171 -0.0052 -0.0062 -0.0015
-This generation script produces three types of outputs: a line prefixed
-with *O* is a copy of the original source sentence; *H* is the
-hypothesis along with an average log-likelihood; and *P* is the
-positional score per token position, including the
-end-of-sentence marker which is omitted from the text.
-Other types of output lines you might see are *D*, the detokenized hypothesis,
-*T*, the reference target, *A*, alignment info, *E* the history of generation steps.
-See the `README <https://github.com/pytorch/fairseq#pre-trained-models>`__ for a
-full list of pre-trained models available.
-Training a New Model
-====================
-The following tutorial is for machine translation. For an example of how
-to use Fairseq for other tasks, such as :ref:`language modeling`, please see the
-``examples/`` directory.
-Data Pre-processing
--------------------
-Fairseq contains example pre-processing scripts for several translation
-datasets: IWSLT 2014 (German-English), WMT 2014 (English-French) and WMT
-2014 (English-German). To pre-process and binarize the IWSLT dataset:
-.. code-block:: console
-    > cd examples/translation/
-    > bash prepare-iwslt14.sh
-    > cd ../..
-    > TEXT=examples/translation/iwslt14.tokenized.de-en
-    > fairseq-preprocess --source-lang de --target-lang en \
-        --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
-        --destdir data-bin/iwslt14.tokenized.de-en
-This will write binarized data that can be used for model training to
-``data-bin/iwslt14.tokenized.de-en``.
-Training
---------
-Use :ref:`fairseq-train` to train a new model. Here a few example settings that work
-well for the IWSLT 2014 dataset:
-.. code-block:: console
-    > mkdir -p checkpoints/fconv
-    > CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \
-        --optimizer nag --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
-        --arch fconv_iwslt_de_en --save-dir checkpoints/fconv
-By default, :ref:`fairseq-train` will use all available GPUs on your machine. Use the
-``CUDA_VISIBLE_DEVICES`` environment variable to select specific GPUs and/or to
-change the number of GPU devices that will be used.
-Also note that the batch size is specified in terms of the maximum
-number of tokens per batch (``--max-tokens``). You may need to use a
-smaller value depending on the available GPU memory on your system.
-Generation
-----------
-Once your model is trained, you can generate translations using
-:ref:`fairseq-generate` **(for binarized data)** or
-:ref:`fairseq-interactive` **(for raw text)**:
-.. code-block:: console
-    > fairseq-generate data-bin/iwslt14.tokenized.de-en \
-        --path checkpoints/fconv/checkpoint_best.pt \
-        --batch-size 128 --beam 5
-    | [de] dictionary: 35475 types
-    | [en] dictionary: 24739 types
-    | data-bin/iwslt14.tokenized.de-en test 6750 examples
-    | model fconv
-    | loaded checkpoint trainings/fconv/checkpoint_best.pt
-    S-721   danke .
-    T-721   thank you .
-    ...
-To generate translations with only a CPU, use the ``--cpu`` flag. BPE
-continuation markers can be removed with the ``--remove-bpe`` flag.
-Advanced Training Options
-=========================
-Large mini-batch training with delayed updates
-----------------------------------------------
-The ``--update-freq`` option can be used to accumulate gradients from
-multiple mini-batches and delay updating, creating a larger effective
-batch size. Delayed updates can also improve training speed by reducing
-inter-GPU communication costs and by saving idle time caused by variance
-in workload across GPUs. See `Ott et al.
-(2018) <https://arxiv.org/abs/1806.00187>`__ for more details.
-To train on a single GPU with an effective batch size that is equivalent
-to training on 8 GPUs:
-.. code-block:: console
-    > CUDA_VISIBLE_DEVICES=0 fairseq-train --update-freq 8 (...)
-Training with half precision floating point (FP16)
---------------------------------------------------
-.. note::
-    FP16 training requires a Volta GPU and CUDA 9.1 or greater
-Recent GPUs enable efficient half precision floating point computation,
-e.g., using `Nvidia Tensor Cores
-<https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html>`__.
-Fairseq supports FP16 training with the ``--fp16`` flag:
-.. code-block:: console
-    > fairseq-train --fp16 (...)
-Distributed training
---------------------
-Distributed training in fairseq is implemented on top of ``torch.distributed``.
-The easiest way to launch jobs is with the `torch.distributed.launch
-<https://pytorch.org/docs/stable/distributed.html#launch-utility>`__ tool.
-For example, to train a large English-German Transformer model on 2 nodes each
-with 8 GPUs (in total 16 GPUs), run the following command on each node,
-replacing ``node_rank=0`` with ``node_rank=1`` on the second node and making
-sure to update ``--master_addr`` to the IP address of the first node:
-.. code-block:: console
-    > python -m torch.distributed.launch --nproc_per_node=8 \
-        --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" \
-        --master_port=12345 \
-        $(which fairseq-train) data-bin/wmt16_en_de_bpe32k \
-        --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
-        --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
-        --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
-        --lr 0.0005 \
-        --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
-        --max-tokens 3584 \
-        --max-epoch 70 \
-        --fp16
-On SLURM clusters, fairseq will automatically detect the number of nodes and
-GPUs, but a port number must be provided:
-.. code-block:: console
-    > salloc --gpus=16 --nodes 2 (...)
-    > srun fairseq-train --distributed-port 12345 (...).
-Sharding very large datasets
-----------------------------
-It can be challenging to train over very large datasets, particularly if your
-machine does not have much system RAM. Most tasks in fairseq support training
-over "sharded" datasets, in which the original dataset has been preprocessed
-into non-overlapping chunks (or "shards").
-For example, instead of preprocessing all your data into a single "data-bin"
-directory, you can split the data and create "data-bin1", "data-bin2", etc.
-Then you can adapt your training command like so:
-.. code-block:: console
-    > fairseq-train data-bin1:data-bin2:data-bin3 (...)
-Training will now iterate over each shard, one by one, with each shard
-corresponding to an "epoch", thus reducing system memory usage.

docs/hydra_integration.md DELETED Viewed

@@ -1,284 +0,0 @@
-## Hydra
-[Hydra](https://github.com/facebookresearch/hydra) is an open-source Python
-framework that simplifies the development of research and other complex
-applications. The key feature is the ability to dynamically create a
-hierarchical configuration by composition and override it through config files
-and the command line. The name Hydra comes from its ability to run multiple
-similar jobs - much like a Hydra with multiple heads.
-## Motivation
-Until recently, all components in fairseq were configured through a shared
-`args` namespace that was created at application startup. Components declared
-their own `add_args` method to update the argparse parser, hoping that the names
-would not clash with arguments from other components. While this model works for
-smaller applications, as fairseq grew and became integrated into other
-applications, this became problematic. In order to determine how to configure
-each component, one needed to a) examine what args were added by this component,
-and b) read the code to figure out what shared arguments it is using that were
-added in other places. Reproducing models involved sharing commands that often
-contained dozens of command line switches.
-The model described above is still supported by fairseq for backward
-compatibility, but will be deprecated some time in the future.
-New components in fairseq should now create a dataclass that encapsulates all
-parameters required to configure this component. The dataclass is registered
-along with the component, and fairseq takes care of constructing and providing
-this configuration object to the component's constructor. Note that sharing
-parameters can optionally still work, but one has to explicitly point to the
-"source of truth" (see inheritance example below). These changes make components
-in fairseq more independent and re-usable by other applications: all that is
-needed to create a component is to initialize its dataclass and overwrite some
-of the defaults.
-While configuring fairseq through command line (using either the legacy argparse
-based or the new Hydra based entry points) is still fully supported, you can now
-take advantage of configuring fairseq completely or piece-by-piece through
-hierarchical YAML configuration files. These files can also be shipped as
-examples that others can use to run an identically configured job.
-Additionally, Hydra has a rich and growing [library of
-plugins](https://github.com/facebookresearch/hydra/tree/master/plugins) that
-provide functionality such as hyperparameter sweeping (including using bayesian
-optimization through the [Ax](https://github.com/facebook/Ax) library), job
-launching across various platforms, and more.
-## Creating or migrating components
-In general, each new (or updated) component should provide a companion
-[dataclass](https://www.python.org/dev/peps/pep-0557/). These dataclass are
-typically located in the same file as the component and are passed as arguments
-to the `register_*()` functions. Top-level configs that should be present in
-every fairseq application are placed in the
-[global](fairseq/dataclass/configs.py) config file and added to the
-`FairseqConfig` object.
-Each dataclass is a plain-old-data object, similar to a `NamedTuple`. These
-classes are decorated with a `@dataclass` decorator, and typically inherit from
-`FairseqDataclass` (which adds some functionality for backward compatibility).
-Each field must have a type, and generally has metadata (such as a help string)
-and a default value. Only primitive types or other config objects are allowed as
-data types for each field.
-#### Example:
-```python
-from dataclasses import dataclass, field
-from fairseq.dataclass import FairseqDataclass
-@dataclass
-class InteractiveConfig(FairseqDataclass):
-    buffer_size: int = field(
-        default=0,
-        metadata={
-            "help": "read this many sentences into a buffer before processing them"
-        },
-    )
-    input: str = field(
-        default="-",
-        metadata={"help": "file to read from; use - for stdin"},
-    )
-```
-### Inherting values
-Some components require sharing a value. For example, a learning rate scheduler
-and an optimizer may both need to know the initial learning rate value. One can
-declare a field that, by default, will inherit its value from another config
-node in the same hierarchy:
-```python
-@dataclass
-FairseqAdamConfig(FairseqDataclass):
-    ...
-    lr: List[float] = II("optimization.lr")
-    ...
-```
-`II("optimization.lr")` is syntactic sugar for `"${optimization.lr}"`, which is
-the value one can use in a YAML config file or through command line to achieve
-the same effect. Note that this assumes that there is an "optimization" config
-object in the root config and it has a field called "lr".
-### Tasks and Models
-Creating Tasks and Models works same as before, except that legacy
-implementations now inherit from `LegacyFairseq*` base classes, while new
-components inherit from `FairseqTask` and `FairseqModel` and provide a dataclass
-to the `register_*()` functions.
-#### Task example:
-```python
-@dataclass
-class LanguageModelingConfig(FairseqDataclass):
-    data: Optional[str] = field(
-        default=None, metadata={"help": "path to data directory"}
-    )
-    ...
-@register_task("language_modeling", dataclass=LanguageModelingConfig)
-class LanguageModelingTask(FairseqTask):
-    ...
-    @classmethod
-    def setup_task(cls, cfg: LanguageModelingConfig):
-        ...
-```
-#### Model example:
-```python
-@dataclass
-class TransformerLanguageModelConfig(FairseqDataclass):
-    activation_fn: ChoiceEnum(utils.get_available_activation_fns()) = field(
-        default="relu", metadata={"help": "activation function to use"}
-    )
-    dropout: float = field(default=0.1, metadata={"help": "dropout probability"})
-    ...
-@register_model("transformer_lm", dataclass=TransformerLanguageModelConfig)
-class TransformerLanguageModel(FairseqLanguageModel):
-    ...
-    @classmethod
-    def build_model(cls, cfg: TransformerLanguageModelConfig, task: FairseqTask):
-        ...
-```
-### Other components
-Other components work as before, but they now take their configuration dataclass
-as the only constructor argument:
-```python
-@dataclass
-class MosesTokenizerConfig(FairseqDataclass):
-    source_lang: str = field(default="en", metadata={"help": "source language"})
-    ...
-@register_tokenizer("moses", dataclass=MosesTokenizerConfig)
-class MosesTokenizer(object):
-    def __init__(self, cfg: MosesTokenizerConfig):
-        ...
-```
-Note that if you are adding a new registry for a new set of components, you need
-to add it to the `FairseqConfig` object in `fairseq/dataclass/configs.py`:
-```python
-@dataclass
-class FairseqConfig(object):
-    ...
-    my_new_registry: Any = None
-```
-## Training with `fairseq-hydra-train`
-To fully take advantage of configuration flexibility offered by Hydra, you may
-want to train new models using the `fairseq-hydra-train` entry point. Legacy CLI
-tools such as `fairseq-train` will remain supported for the foreseeable future
-but will be deprecated eventually.
-On startup, Hydra will create a configuration object that contains a hierarchy
-of all the necessary dataclasses populated with their default values in the
-code. The default values are overwritten by values found in YAML files in
-`fairseq/config` directory (which currently sets minimal defaults) and then
-further overwritten by values provided through command line arguments.
-Some of the most common use cases are shown below:
-### 1. Override default values through command line:
-```shell script
-$ fairseq-hydra-train \
-    distributed_training.distributed_world_size=1 \
-    dataset.batch_size=2 \
-    task.data=data-bin \
-    model=transformer_lm/transformer_lm_gpt \
-    task=language_modeling \
-    optimization.max_update=5000
-```
-Note that along with explicitly providing values for parameters such as
-`dataset.batch_size`, this also tells Hydra to overlay configuration found in
-`fairseq/config/model/transformer_lm/transformer_lm_gpt.yaml` over the default
-values in the dataclass. If you want to train a model without specifying a
-particular architecture you can simply specify `model=transformer_lm`. This only
-works for migrated tasks and models.
-### 2. Replace bundled configs with an external config:
-```shell script
-$ fairseq-hydra-train \
-    --config-dir /path/to/external/configs \
-    --config-name wiki103
-```
-where `/path/to/external/configs/wiki103.yaml` contains:
-```yaml
-# @package _group_
-model:
-  _name: transformer_lm
-distributed_training:
-  distributed_world_size: 1
-dataset:
-  batch_size: 2
-task:
-  _name: language_modeling
-  data: /path/to/data
-  add_bos_token: false
-  max_target_positions: 1024
-optimization:
-  max_update: 50000
-  lr: [ 0.25 ]
-criterion: cross_entropy
-optimizer: adam
-lr_scheduler:
-  _name: cosine
-```
-Note that here bundled configs from `fairseq/config` directory are not used,
-however the defaults from each dataclass will still be used (unless overwritten
-by your external config).
-Additionally you can choose to break up your configs by creating a directory
-structure in the same location as your main config file, with the names of the
-top-level fields (such as "model", "dataset", etc), and placing config files
-with meaningful names that would populate that specific section of your
-top-level config file (for example, you might have
-`model/small_transformer_lm.yaml`, `model/big_transformer_lm.yaml`, etc). You
-can then specify the correct configuration via command line, defaults in the
-main config, or even launch all of them as a sweep (see Hydra documentation on
-how to do this).
-### 3. Add an external config directory to Hydra search path:
-This allows combining default configuration (including using any bundled config
-files), while specifying your own config files for some parts of the
-configuration.
-```shell script
-$ fairseq-hydra-train \
-    distributed_training.distributed_world_size=1 \
-    dataset.batch_size=2 \
-    task.data=/path/to/data/ \
-    model=transformer_lm/2_layers \
-    task=language_modeling \
-    optimization.max_update=5000 \
-    --config-dir /path/to/external/configs
-```
-where `/path/to/external/configs` has the following structure:
-```
-.
-+-- model
-|   +-- transformer_lm
-|   |   +-- 2_layers.yaml
-```
-and `2_layers.yaml` contains a copy of `transformer_lm_gpt.yaml` but with
-`decoder_layers` set to 2. You can add other configs to configure other
-components as well.

docs/index.rst DELETED Viewed

@@ -1,49 +0,0 @@
-.. fairseq documentation master file, created by
-   sphinx-quickstart on Fri Aug 17 21:45:30 2018.
-   You can adapt this file completely to your liking, but it should at least
-   contain the root `toctree` directive.
-:github_url: https://github.com/pytorch/fairseq
-fairseq documentation
-=====================
-Fairseq is a sequence modeling toolkit written in `PyTorch
-<http://pytorch.org/>`_ that allows researchers and developers to
-train custom models for translation, summarization, language modeling and other
-text generation tasks.
-.. toctree::
-    :maxdepth: 1
-    :caption: Getting Started
-    getting_started
-    command_line_tools
-.. toctree::
-    :maxdepth: 1
-    :caption: Extending Fairseq
-    overview
-    tutorial_simple_lstm
-    tutorial_classifying_names
-.. toctree::
-    :maxdepth: 2
-    :caption: Library Reference
-    tasks
-    models
-    criterions
-    optim
-    lr_scheduler
-    data
-    modules
-Indices and tables
-==================
-* :ref:`genindex`
-* :ref:`search`

docs/lr_scheduler.rst DELETED Viewed

@@ -1,34 +0,0 @@
-.. role:: hidden
-    :class: hidden-section
-.. _Learning Rate Schedulers:
-Learning Rate Schedulers
-========================
-Learning Rate Schedulers update the learning rate over the course of training.
-Learning rates can be updated after each update via :func:`step_update` or at
-epoch boundaries via :func:`step`.
-.. automodule:: fairseq.optim.lr_scheduler
-    :members:
-.. autoclass:: fairseq.optim.lr_scheduler.FairseqLRScheduler
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.optim.lr_scheduler.cosine_lr_scheduler.CosineSchedule
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.optim.lr_scheduler.fixed_schedule.FixedSchedule
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.optim.lr_scheduler.inverse_square_root_schedule.InverseSquareRootSchedule
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.optim.lr_scheduler.reduce_lr_on_plateau.ReduceLROnPlateau
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.optim.lr_scheduler.triangular_lr_scheduler.TriangularSchedule
-    :members:
-    :undoc-members:

docs/make.bat DELETED Viewed

@@ -1,36 +0,0 @@
-@ECHO OFF
-pushd %~dp0
-REM Command file for Sphinx documentation
-if "%SPHINXBUILD%" == "" (
-	set SPHINXBUILD=python -msphinx
-)
-set SOURCEDIR=.
-set BUILDDIR=_build
-set SPHINXPROJ=fairseq
-if "%1" == "" goto help
-%SPHINXBUILD% >NUL 2>NUL
-if errorlevel 9009 (
-	echo.
-	echo.The Sphinx module was not found. Make sure you have Sphinx installed,
-	echo.then set the SPHINXBUILD environment variable to point to the full
-	echo.path of the 'sphinx-build' executable. Alternatively you may add the
-	echo.Sphinx directory to PATH.
-	echo.
-	echo.If you don't have Sphinx installed, grab it from
-	echo.http://sphinx-doc.org/
-	exit /b 1
-)
-%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
-goto end
-:help
-%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
-:end
-popd

docs/models.rst DELETED Viewed

@@ -1,104 +0,0 @@
-.. role:: hidden
-    :class: hidden-section
-.. module:: fairseq.models
-.. _Models:
-Models
-======
-A Model defines the neural network's ``forward()`` method and encapsulates all
-of the learnable parameters in the network. Each model also provides a set of
-named *architectures* that define the precise network configuration (e.g.,
-embedding dimension, number of layers, etc.).
-Both the model type and architecture are selected via the ``--arch``
-command-line argument. Once selected, a model may expose additional command-line
-arguments for further configuration.
-.. note::
-    All fairseq Models extend :class:`BaseFairseqModel`, which in turn extends
-    :class:`torch.nn.Module`. Thus any fairseq Model can be used as a
-    stand-alone Module in other PyTorch code.
-Convolutional Neural Networks (CNN)
------------------------------------
-.. module:: fairseq.models.fconv
-.. autoclass:: fairseq.models.fconv.FConvModel
-    :members:
-.. autoclass:: fairseq.models.fconv.FConvEncoder
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.models.fconv.FConvDecoder
-    :members:
-Long Short-Term Memory (LSTM) networks
---------------------------------------
-.. module:: fairseq.models.lstm
-.. autoclass:: fairseq.models.lstm.LSTMModel
-    :members:
-.. autoclass:: fairseq.models.lstm.LSTMEncoder
-    :members:
-.. autoclass:: fairseq.models.lstm.LSTMDecoder
-    :members:
-Transformer (self-attention) networks
--------------------------------------
-.. module:: fairseq.models.transformer
-.. autoclass:: fairseq.models.transformer.TransformerModel
-    :members:
-.. autoclass:: fairseq.models.transformer.TransformerEncoder
-    :members:
-.. autoclass:: fairseq.models.transformer.TransformerEncoderLayer
-    :members:
-.. autoclass:: fairseq.models.transformer.TransformerDecoder
-    :members:
-.. autoclass:: fairseq.models.transformer.TransformerDecoderLayer
-    :members:
-Adding new models
------------------
-.. currentmodule:: fairseq.models
-.. autofunction:: fairseq.models.register_model
-.. autofunction:: fairseq.models.register_model_architecture
-.. autoclass:: fairseq.models.BaseFairseqModel
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.models.FairseqEncoderDecoderModel
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.models.FairseqEncoderModel
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.models.FairseqLanguageModel
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.models.FairseqMultiModel
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.models.FairseqEncoder
-    :members:
-.. autoclass:: fairseq.models.CompositeEncoder
-    :members:
-.. autoclass:: fairseq.models.FairseqDecoder
-    :members:
-.. _Incremental decoding:
-Incremental decoding
---------------------
-.. autoclass:: fairseq.models.FairseqIncrementalDecoder
-    :members:
-    :undoc-members:

docs/modules.rst DELETED Viewed

@@ -1,9 +0,0 @@
-Modules
-=======
-Fairseq provides several stand-alone :class:`torch.nn.Module` classes that may
-be helpful when implementing a new :class:`~fairseq.models.BaseFairseqModel`.
-.. automodule:: fairseq.modules
-    :members:
-    :undoc-members:

docs/optim.rst DELETED Viewed

@@ -1,38 +0,0 @@
-.. role:: hidden
-    :class: hidden-section
-.. _optimizers:
-Optimizers
-==========
-Optimizers update the Model parameters based on the gradients.
-.. automodule:: fairseq.optim
-    :members:
-.. autoclass:: fairseq.optim.FairseqOptimizer
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.optim.adadelta.Adadelta
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.optim.adagrad.Adagrad
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.optim.adafactor.FairseqAdafactor
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.optim.adam.FairseqAdam
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.optim.fp16_optimizer.FP16Optimizer
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.optim.nag.FairseqNAG
-    :members:
-    :undoc-members:
-.. autoclass:: fairseq.optim.sgd.SGD
-    :members:
-    :undoc-members:

docs/overview.rst DELETED Viewed

@@ -1,74 +0,0 @@
-Overview
-========
-Fairseq can be extended through user-supplied `plug-ins
-<https://en.wikipedia.org/wiki/Plug-in_(computing)>`_. We support five kinds of
-plug-ins:
-- :ref:`Models` define the neural network architecture and encapsulate all of the
-  learnable parameters.
-- :ref:`Criterions` compute the loss function given the model outputs and targets.
-- :ref:`Tasks` store dictionaries and provide helpers for loading/iterating over
-  Datasets, initializing the Model/Criterion and calculating the loss.
-- :ref:`Optimizers` update the Model parameters based on the gradients.
-- :ref:`Learning Rate Schedulers` update the learning rate over the course of
-  training.
-**Training Flow**
-Given a ``model``, ``criterion``, ``task``, ``optimizer`` and ``lr_scheduler``,
-fairseq implements the following high-level training flow::
-  for epoch in range(num_epochs):
-      itr = task.get_batch_iterator(task.dataset('train'))
-      for num_updates, batch in enumerate(itr):
-          task.train_step(batch, model, criterion, optimizer)
-          average_and_clip_gradients()
-          optimizer.step()
-          lr_scheduler.step_update(num_updates)
-      lr_scheduler.step(epoch)
-where the default implementation for ``task.train_step`` is roughly::
-  def train_step(self, batch, model, criterion, optimizer, **unused):
-      loss = criterion(model, batch)
-      optimizer.backward(loss)
-      return loss
-**Registering new plug-ins**
-New plug-ins are *registered* through a set of ``@register`` function
-decorators, for example::
-  @register_model('my_lstm')
-  class MyLSTM(FairseqEncoderDecoderModel):
-      (...)
-Once registered, new plug-ins can be used with the existing :ref:`Command-line
-Tools`. See the Tutorial sections for more detailed walkthroughs of how to add
-new plug-ins.
-**Loading plug-ins from another directory**
-New plug-ins can be defined in a custom module stored in the user system. In
-order to import the module, and make the plugin available to *fairseq*, the
-command line supports the ``--user-dir`` flag that can be used to specify a
-custom location for additional modules to load into *fairseq*.
-For example, assuming this directory tree::
-  /home/user/my-module/
-  └── __init__.py
-with ``__init__.py``::
-  from fairseq.models import register_model_architecture
-  from fairseq.models.transformer import transformer_vaswani_wmt_en_de_big
-  @register_model_architecture('transformer', 'my_transformer')
-  def transformer_mmt_big(args):
-      transformer_vaswani_wmt_en_de_big(args)
-it is possible to invoke the :ref:`fairseq-train` script with the new architecture with::
-  fairseq-train ... --user-dir /home/user/my-module -a my_transformer --task translation

docs/requirements.txt DELETED Viewed

	@@ -1,2 +0,0 @@
1	- sphinx<2.0
2	- sphinx-argparse

docs/tasks.rst DELETED Viewed

@@ -1,61 +0,0 @@
-.. role:: hidden
-    :class: hidden-section
-.. module:: fairseq.tasks
-.. _Tasks:
-Tasks
-=====
-Tasks store dictionaries and provide helpers for loading/iterating over
-Datasets, initializing the Model/Criterion and calculating the loss.
-Tasks can be selected via the ``--task`` command-line argument. Once selected, a
-task may expose additional command-line arguments for further configuration.
-Example usage::
-    # setup the task (e.g., load dictionaries)
-    task = fairseq.tasks.setup_task(args)
-    # build model and criterion
-    model = task.build_model(args)
-    criterion = task.build_criterion(args)
-    # load datasets
-    task.load_dataset('train')
-    task.load_dataset('valid')
-    # iterate over mini-batches of data
-    batch_itr = task.get_batch_iterator(
-        task.dataset('train'), max_tokens=4096,
-    )
-    for batch in batch_itr:
-        # compute the loss
-        loss, sample_size, logging_output = task.get_loss(
-            model, criterion, batch,
-        )
-        loss.backward()
-Translation
------------
-.. autoclass:: fairseq.tasks.translation.TranslationTask
-.. _language modeling:
-Language Modeling
------------------
-.. autoclass:: fairseq.tasks.language_modeling.LanguageModelingTask
-Adding new tasks
-----------------
-.. autofunction:: fairseq.tasks.register_task
-.. autoclass:: fairseq.tasks.FairseqTask
-    :members:
-    :undoc-members:

docs/tutorial_classifying_names.rst DELETED Viewed

@@ -1,415 +0,0 @@
-Tutorial: Classifying Names with a Character-Level RNN
-======================================================
-In this tutorial we will extend fairseq to support *classification* tasks. In
-particular we will re-implement the PyTorch tutorial for `Classifying Names with
-a Character-Level RNN <https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html>`_
-in fairseq. It is recommended to quickly skim that tutorial before beginning
-this one.
-This tutorial covers:
-1. **Preprocessing the data** to create dictionaries.
-2. **Registering a new Model** that encodes an input sentence with a simple RNN
-   and predicts the output label.
-3. **Registering a new Task** that loads our dictionaries and dataset.
-4. **Training the Model** using the existing command-line tools.
-5. **Writing an evaluation script** that imports fairseq and allows us to
-   interactively evaluate our model on new inputs.
-1. Preprocessing the data
--------------------------
-The original tutorial provides raw data, but we'll work with a modified version
-of the data that is already tokenized into characters and split into separate
-train, valid and test sets.
-Download and extract the data from here:
-`tutorial_names.tar.gz <https://dl.fbaipublicfiles.com/fairseq/data/tutorial_names.tar.gz>`_
-Once extracted, let's preprocess the data using the :ref:`fairseq-preprocess`
-command-line tool to create the dictionaries. While this tool is primarily
-intended for sequence-to-sequence problems, we're able to reuse it here by
-treating the label as a "target" sequence of length 1. We'll also output the
-preprocessed files in "raw" format using the ``--dataset-impl`` option to
-enhance readability:
-.. code-block:: console
-  > fairseq-preprocess \
-    --trainpref names/train --validpref names/valid --testpref names/test \
-    --source-lang input --target-lang label \
-    --destdir names-bin --dataset-impl raw
-After running the above command you should see a new directory,
-:file:`names-bin/`, containing the dictionaries for *inputs* and *labels*.
-2. Registering a new Model
---------------------------
-Next we'll register a new model in fairseq that will encode an input sentence
-with a simple RNN and predict the output label. Compared to the original PyTorch
-tutorial, our version will also work with batches of data and GPU Tensors.
-First let's copy the simple RNN module implemented in the `PyTorch tutorial
-<https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html#creating-the-network>`_.
-Create a new file named :file:`fairseq/models/rnn_classifier.py` with the
-following contents::
-    import torch
-    import torch.nn as nn
-    class RNN(nn.Module):
-        def __init__(self, input_size, hidden_size, output_size):
-            super(RNN, self).__init__()
-            self.hidden_size = hidden_size
-            self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
-            self.i2o = nn.Linear(input_size + hidden_size, output_size)
-            self.softmax = nn.LogSoftmax(dim=1)
-        def forward(self, input, hidden):
-            combined = torch.cat((input, hidden), 1)
-            hidden = self.i2h(combined)
-            output = self.i2o(combined)
-            output = self.softmax(output)
-            return output, hidden
-        def initHidden(self):
-            return torch.zeros(1, self.hidden_size)
-We must also *register* this model with fairseq using the
-:func:`~fairseq.models.register_model` function decorator. Once the model is
-registered we'll be able to use it with the existing :ref:`Command-line Tools`.
-All registered models must implement the :class:`~fairseq.models.BaseFairseqModel`
-interface, so we'll create a small wrapper class in the same file and register
-it in fairseq with the name ``'rnn_classifier'``::
-    from fairseq.models import BaseFairseqModel, register_model
-    # Note: the register_model "decorator" should immediately precede the
-    # definition of the Model class.
-    @register_model('rnn_classifier')
-    class FairseqRNNClassifier(BaseFairseqModel):
-        @staticmethod
-        def add_args(parser):
-            # Models can override this method to add new command-line arguments.
-            # Here we'll add a new command-line argument to configure the
-            # dimensionality of the hidden state.
-            parser.add_argument(
-                '--hidden-dim', type=int, metavar='N',
-                help='dimensionality of the hidden state',
-            )
-        @classmethod
-        def build_model(cls, args, task):
-            # Fairseq initializes models by calling the ``build_model()``
-            # function. This provides more flexibility, since the returned model
-            # instance can be of a different type than the one that was called.
-            # In this case we'll just return a FairseqRNNClassifier instance.
-            # Initialize our RNN module
-            rnn = RNN(
-                # We'll define the Task in the next section, but for now just
-                # notice that the task holds the dictionaries for the "source"
-                # (i.e., the input sentence) and "target" (i.e., the label).
-                input_size=len(task.source_dictionary),
-                hidden_size=args.hidden_dim,
-                output_size=len(task.target_dictionary),
-            )
-            # Return the wrapped version of the module
-            return FairseqRNNClassifier(
-                rnn=rnn,
-                input_vocab=task.source_dictionary,
-            )
-        def __init__(self, rnn, input_vocab):
-            super(FairseqRNNClassifier, self).__init__()
-            self.rnn = rnn
-            self.input_vocab = input_vocab
-            # The RNN module in the tutorial expects one-hot inputs, so we can
-            # precompute the identity matrix to help convert from indices to
-            # one-hot vectors. We register it as a buffer so that it is moved to
-            # the GPU when ``cuda()`` is called.
-            self.register_buffer('one_hot_inputs', torch.eye(len(input_vocab)))
-        def forward(self, src_tokens, src_lengths):
-            # The inputs to the ``forward()`` function are determined by the
-            # Task, and in particular the ``'net_input'`` key in each
-            # mini-batch. We'll define the Task in the next section, but for
-            # now just know that *src_tokens* has shape `(batch, src_len)` and
-            # *src_lengths* has shape `(batch)`.
-            bsz, max_src_len = src_tokens.size()
-            # Initialize the RNN hidden state. Compared to the original PyTorch
-            # tutorial we'll also handle batched inputs and work on the GPU.
-            hidden = self.rnn.initHidden()
-            hidden = hidden.repeat(bsz, 1)  # expand for batched inputs
-            hidden = hidden.to(src_tokens.device)  # move to GPU
-            for i in range(max_src_len):
-                # WARNING: The inputs have padding, so we should mask those
-                # elements here so that padding doesn't affect the results.
-                # This is left as an exercise for the reader. The padding symbol
-                # is given by ``self.input_vocab.pad()`` and the unpadded length
-                # of each input is given by *src_lengths*.
-                # One-hot encode a batch of input characters.
-                input = self.one_hot_inputs[src_tokens[:, i].long()]
-                # Feed the input to our RNN.
-                output, hidden = self.rnn(input, hidden)
-            # Return the final output state for making a prediction
-            return output
-Finally let's define a *named architecture* with the configuration for our
-model. This is done with the :func:`~fairseq.models.register_model_architecture`
-function decorator. Thereafter this named architecture can be used with the
-``--arch`` command-line argument, e.g., ``--arch pytorch_tutorial_rnn``::
-    from fairseq.models import register_model_architecture
-    # The first argument to ``register_model_architecture()`` should be the name
-    # of the model we registered above (i.e., 'rnn_classifier'). The function we
-    # register here should take a single argument *args* and modify it in-place
-    # to match the desired architecture.
-    @register_model_architecture('rnn_classifier', 'pytorch_tutorial_rnn')
-    def pytorch_tutorial_rnn(args):
-        # We use ``getattr()`` to prioritize arguments that are explicitly given
-        # on the command-line, so that the defaults defined below are only used
-        # when no other value has been specified.
-        args.hidden_dim = getattr(args, 'hidden_dim', 128)
-3. Registering a new Task
--------------------------
-Now we'll register a new :class:`~fairseq.tasks.FairseqTask` that will load our
-dictionaries and dataset. Tasks can also control how the data is batched into
-mini-batches, but in this tutorial we'll reuse the batching provided by
-:class:`fairseq.data.LanguagePairDataset`.
-Create a new file named :file:`fairseq/tasks/simple_classification.py` with the
-following contents::
-  import os
-  import torch
-  from fairseq.data import Dictionary, LanguagePairDataset
-  from fairseq.tasks import FairseqTask, register_task
-  @register_task('simple_classification')
-  class SimpleClassificationTask(LegacyFairseqTask):
-      @staticmethod
-      def add_args(parser):
-          # Add some command-line arguments for specifying where the data is
-          # located and the maximum supported input length.
-          parser.add_argument('data', metavar='FILE',
-                              help='file prefix for data')
-          parser.add_argument('--max-positions', default=1024, type=int,
-                              help='max input length')
-      @classmethod
-      def setup_task(cls, args, **kwargs):
-          # Here we can perform any setup required for the task. This may include
-          # loading Dictionaries, initializing shared Embedding layers, etc.
-          # In this case we'll just load the Dictionaries.
-          input_vocab = Dictionary.load(os.path.join(args.data, 'dict.input.txt'))
-          label_vocab = Dictionary.load(os.path.join(args.data, 'dict.label.txt'))
-          print('| [input] dictionary: {} types'.format(len(input_vocab)))
-          print('| [label] dictionary: {} types'.format(len(label_vocab)))
-          return SimpleClassificationTask(args, input_vocab, label_vocab)
-      def __init__(self, args, input_vocab, label_vocab):
-          super().__init__(args)
-          self.input_vocab = input_vocab
-          self.label_vocab = label_vocab
-      def load_dataset(self, split, **kwargs):
-          """Load a given dataset split (e.g., train, valid, test)."""
-          prefix = os.path.join(self.args.data, '{}.input-label'.format(split))
-          # Read input sentences.
-          sentences, lengths = [], []
-          with open(prefix + '.input', encoding='utf-8') as file:
-              for line in file:
-                  sentence = line.strip()
-                  # Tokenize the sentence, splitting on spaces
-                  tokens = self.input_vocab.encode_line(
-                      sentence, add_if_not_exist=False,
-                  )
-                  sentences.append(tokens)
-                  lengths.append(tokens.numel())
-          # Read labels.
-          labels = []
-          with open(prefix + '.label', encoding='utf-8') as file:
-              for line in file:
-                  label = line.strip()
-                  labels.append(
-                      # Convert label to a numeric ID.
-                      torch.LongTensor([self.label_vocab.add_symbol(label)])
-                  )
-          assert len(sentences) == len(labels)
-          print('| {} {} {} examples'.format(self.args.data, split, len(sentences)))
-          # We reuse LanguagePairDataset since classification can be modeled as a
-          # sequence-to-sequence task where the target sequence has length 1.
-          self.datasets[split] = LanguagePairDataset(
-              src=sentences,
-              src_sizes=lengths,
-              src_dict=self.input_vocab,
-              tgt=labels,
-              tgt_sizes=torch.ones(len(labels)),  # targets have length 1
-              tgt_dict=self.label_vocab,
-              left_pad_source=False,
-              # Since our target is a single class label, there's no need for
-              # teacher forcing. If we set this to ``True`` then our Model's
-              # ``forward()`` method would receive an additional argument called
-              # *prev_output_tokens* that would contain a shifted version of the
-              # target sequence.
-              input_feeding=False,
-          )
-      def max_positions(self):
-          """Return the max input length allowed by the task."""
-          # The source should be less than *args.max_positions* and the "target"
-          # has max length 1.
-          return (self.args.max_positions, 1)
-      @property
-      def source_dictionary(self):
-          """Return the source :class:`~fairseq.data.Dictionary`."""
-          return self.input_vocab
-      @property
-      def target_dictionary(self):
-          """Return the target :class:`~fairseq.data.Dictionary`."""
-          return self.label_vocab
-      # We could override this method if we wanted more control over how batches
-      # are constructed, but it's not necessary for this tutorial since we can
-      # reuse the batching provided by LanguagePairDataset.
-      #
-      # def get_batch_iterator(
-      #     self, dataset, max_tokens=None, max_sentences=None, max_positions=None,
-      #     ignore_invalid_inputs=False, required_batch_size_multiple=1,
-      #     seed=1, num_shards=1, shard_id=0, num_workers=0, epoch=1,
-      #     data_buffer_size=0, disable_iterator_cache=False,
-      # ):
-      #     (...)
-4. Training the Model
----------------------
-Now we're ready to train the model. We can use the existing :ref:`fairseq-train`
-command-line tool for this, making sure to specify our new Task (``--task
-simple_classification``) and Model architecture (``--arch
-pytorch_tutorial_rnn``):
-.. note::
-  You can also configure the dimensionality of the hidden state by passing the
-  ``--hidden-dim`` argument to :ref:`fairseq-train`.
-.. code-block:: console
-  > fairseq-train names-bin \
-    --task simple_classification \
-    --arch pytorch_tutorial_rnn \
-    --optimizer adam --lr 0.001 --lr-shrink 0.5 \
-    --max-tokens 1000
-  (...)
-  | epoch 027 | loss 1.200 | ppl 2.30 | wps 15728 | ups 119.4 | wpb 116 | bsz 116 | num_updates 3726 | lr 1.5625e-05 | gnorm 1.290 | clip 0% | oom 0 | wall 32 | train_wall 21
-  | epoch 027 | valid on 'valid' subset | valid_loss 1.41304 | valid_ppl 2.66 | num_updates 3726 | best 1.41208
-  | done training in 31.6 seconds
-The model files should appear in the :file:`checkpoints/` directory.
-5. Writing an evaluation script
--------------------------------
-Finally we can write a short script to evaluate our model on new inputs. Create
-a new file named :file:`eval_classifier.py` with the following contents::
-  from fairseq import checkpoint_utils, data, options, tasks
-  # Parse command-line arguments for generation
-  parser = options.get_generation_parser(default_task='simple_classification')
-  args = options.parse_args_and_arch(parser)
-  # Setup task
-  task = tasks.setup_task(args)
-  # Load model
-  print('| loading model from {}'.format(args.path))
-  models, _model_args = checkpoint_utils.load_model_ensemble([args.path], task=task)
-  model = models[0]
-  while True:
-      sentence = input('\nInput: ')
-      # Tokenize into characters
-      chars = ' '.join(list(sentence.strip()))
-      tokens = task.source_dictionary.encode_line(
-          chars, add_if_not_exist=False,
-      )
-      # Build mini-batch to feed to the model
-      batch = data.language_pair_dataset.collate(
-          samples=[{'id': -1, 'source': tokens}],  # bsz = 1
-          pad_idx=task.source_dictionary.pad(),
-          eos_idx=task.source_dictionary.eos(),
-          left_pad_source=False,
-          input_feeding=False,
-      )
-      # Feed batch to the model and get predictions
-      preds = model(**batch['net_input'])
-      # Print top 3 predictions and their log-probabilities
-      top_scores, top_labels = preds[0].topk(k=3)
-      for score, label_idx in zip(top_scores, top_labels):
-          label_name = task.target_dictionary.string([label_idx])
-          print('({:.2f})\t{}'.format(score, label_name))
-Now we can evaluate our model interactively. Note that we have included the
-original data path (:file:`names-bin/`) so that the dictionaries can be loaded:
-.. code-block:: console
-  > python eval_classifier.py names-bin --path checkpoints/checkpoint_best.pt
-  | [input] dictionary: 64 types
-  | [label] dictionary: 24 types
-  | loading model from checkpoints/checkpoint_best.pt
-  Input: Satoshi
-  (-0.61) Japanese
-  (-1.20) Arabic
-  (-2.86) Italian
-  Input: Sinbad
-  (-0.30) Arabic
-  (-1.76) English
-  (-4.08) Russian

docs/tutorial_simple_lstm.rst DELETED Viewed

@@ -1,518 +0,0 @@
-Tutorial: Simple LSTM
-=====================
-In this tutorial we will extend fairseq by adding a new
-:class:`~fairseq.models.FairseqEncoderDecoderModel` that encodes a source
-sentence with an LSTM and then passes the final hidden state to a second LSTM
-that decodes the target sentence (without attention).
-This tutorial covers:
-1. **Writing an Encoder and Decoder** to encode/decode the source/target
-   sentence, respectively.
-2. **Registering a new Model** so that it can be used with the existing
-   :ref:`Command-line tools`.
-3. **Training the Model** using the existing command-line tools.
-4. **Making generation faster** by modifying the Decoder to use
-   :ref:`Incremental decoding`.
-1. Building an Encoder and Decoder
-----------------------------------
-In this section we'll define a simple LSTM Encoder and Decoder. All Encoders
-should implement the :class:`~fairseq.models.FairseqEncoder` interface and
-Decoders should implement the :class:`~fairseq.models.FairseqDecoder` interface.
-These interfaces themselves extend :class:`torch.nn.Module`, so FairseqEncoders
-and FairseqDecoders can be written and used in the same ways as ordinary PyTorch
-Modules.
-Encoder
-~~~~~~~
-Our Encoder will embed the tokens in the source sentence, feed them to a
-:class:`torch.nn.LSTM` and return the final hidden state. To create our encoder
-save the following in a new file named :file:`fairseq/models/simple_lstm.py`::
-  import torch.nn as nn
-  from fairseq import utils
-  from fairseq.models import FairseqEncoder
-  class SimpleLSTMEncoder(FairseqEncoder):
-      def __init__(
-          self, args, dictionary, embed_dim=128, hidden_dim=128, dropout=0.1,
-      ):
-          super().__init__(dictionary)
-          self.args = args
-          # Our encoder will embed the inputs before feeding them to the LSTM.
-          self.embed_tokens = nn.Embedding(
-              num_embeddings=len(dictionary),
-              embedding_dim=embed_dim,
-              padding_idx=dictionary.pad(),
-          )
-          self.dropout = nn.Dropout(p=dropout)
-          # We'll use a single-layer, unidirectional LSTM for simplicity.
-          self.lstm = nn.LSTM(
-              input_size=embed_dim,
-              hidden_size=hidden_dim,
-              num_layers=1,
-              bidirectional=False,
-              batch_first=True,
-          )
-      def forward(self, src_tokens, src_lengths):
-          # The inputs to the ``forward()`` function are determined by the
-          # Task, and in particular the ``'net_input'`` key in each
-          # mini-batch. We discuss Tasks in the next tutorial, but for now just
-          # know that *src_tokens* has shape `(batch, src_len)` and *src_lengths*
-          # has shape `(batch)`.
-          # Note that the source is typically padded on the left. This can be
-          # configured by adding the `--left-pad-source "False"` command-line
-          # argument, but here we'll make the Encoder handle either kind of
-          # padding by converting everything to be right-padded.
-          if self.args.left_pad_source:
-              # Convert left-padding to right-padding.
-              src_tokens = utils.convert_padding_direction(
-                  src_tokens,
-                  padding_idx=self.dictionary.pad(),
-                  left_to_right=True
-              )
-          # Embed the source.
-          x = self.embed_tokens(src_tokens)
-          # Apply dropout.
-          x = self.dropout(x)
-          # Pack the sequence into a PackedSequence object to feed to the LSTM.
-          x = nn.utils.rnn.pack_padded_sequence(x, src_lengths, batch_first=True)
-          # Get the output from the LSTM.
-          _outputs, (final_hidden, _final_cell) = self.lstm(x)
-          # Return the Encoder's output. This can be any object and will be
-          # passed directly to the Decoder.
-          return {
-              # this will have shape `(bsz, hidden_dim)`
-              'final_hidden': final_hidden.squeeze(0),
-          }
-      # Encoders are required to implement this method so that we can rearrange
-      # the order of the batch elements during inference (e.g., beam search).
-      def reorder_encoder_out(self, encoder_out, new_order):
-          """
-          Reorder encoder output according to `new_order`.
-          Args:
-              encoder_out: output from the ``forward()`` method
-              new_order (LongTensor): desired order
-          Returns:
-              `encoder_out` rearranged according to `new_order`
-          """
-          final_hidden = encoder_out['final_hidden']
-          return {
-              'final_hidden': final_hidden.index_select(0, new_order),
-          }
-Decoder
-~~~~~~~
-Our Decoder will predict the next word, conditioned on the Encoder's final
-hidden state and an embedded representation of the previous target word -- which
-is sometimes called *teacher forcing*. More specifically, we'll use a
-:class:`torch.nn.LSTM` to produce a sequence of hidden states that we'll project
-to the size of the output vocabulary to predict each target word.
-::
-  import torch
-  from fairseq.models import FairseqDecoder
-  class SimpleLSTMDecoder(FairseqDecoder):
-      def __init__(
-          self, dictionary, encoder_hidden_dim=128, embed_dim=128, hidden_dim=128,
-          dropout=0.1,
-      ):
-          super().__init__(dictionary)
-          # Our decoder will embed the inputs before feeding them to the LSTM.
-          self.embed_tokens = nn.Embedding(
-              num_embeddings=len(dictionary),
-              embedding_dim=embed_dim,
-              padding_idx=dictionary.pad(),
-          )
-          self.dropout = nn.Dropout(p=dropout)
-          # We'll use a single-layer, unidirectional LSTM for simplicity.
-          self.lstm = nn.LSTM(
-              # For the first layer we'll concatenate the Encoder's final hidden
-              # state with the embedded target tokens.
-              input_size=encoder_hidden_dim + embed_dim,
-              hidden_size=hidden_dim,
-              num_layers=1,
-              bidirectional=False,
-          )
-          # Define the output projection.
-          self.output_projection = nn.Linear(hidden_dim, len(dictionary))
-      # During training Decoders are expected to take the entire target sequence
-      # (shifted right by one position) and produce logits over the vocabulary.
-      # The *prev_output_tokens* tensor begins with the end-of-sentence symbol,
-      # ``dictionary.eos()``, followed by the target sequence.
-      def forward(self, prev_output_tokens, encoder_out):
-          """
-          Args:
-              prev_output_tokens (LongTensor): previous decoder outputs of shape
-                  `(batch, tgt_len)`, for teacher forcing
-              encoder_out (Tensor, optional): output from the encoder, used for
-                  encoder-side attention
-          Returns:
-              tuple:
-                  - the last decoder layer's output of shape
-                    `(batch, tgt_len, vocab)`
-                  - the last decoder layer's attention weights of shape
-                    `(batch, tgt_len, src_len)`
-          """
-          bsz, tgt_len = prev_output_tokens.size()
-          # Extract the final hidden state from the Encoder.
-          final_encoder_hidden = encoder_out['final_hidden']
-          # Embed the target sequence, which has been shifted right by one
-          # position and now starts with the end-of-sentence symbol.
-          x = self.embed_tokens(prev_output_tokens)
-          # Apply dropout.
-          x = self.dropout(x)
-          # Concatenate the Encoder's final hidden state to *every* embedded
-          # target token.
-          x = torch.cat(
-              [x, final_encoder_hidden.unsqueeze(1).expand(bsz, tgt_len, -1)],
-              dim=2,
-          )
-          # Using PackedSequence objects in the Decoder is harder than in the
-          # Encoder, since the targets are not sorted in descending length order,
-          # which is a requirement of ``pack_padded_sequence()``. Instead we'll
-          # feed nn.LSTM directly.
-          initial_state = (
-              final_encoder_hidden.unsqueeze(0),  # hidden
-              torch.zeros_like(final_encoder_hidden).unsqueeze(0),  # cell
-          )
-          output, _ = self.lstm(
-              x.transpose(0, 1),  # convert to shape `(tgt_len, bsz, dim)`
-              initial_state,
-          )
-          x = output.transpose(0, 1)  # convert to shape `(bsz, tgt_len, hidden)`
-          # Project the outputs to the size of the vocabulary.
-          x = self.output_projection(x)
-          # Return the logits and ``None`` for the attention weights
-          return x, None
-2. Registering the Model
-------------------------
-Now that we've defined our Encoder and Decoder we must *register* our model with
-fairseq using the :func:`~fairseq.models.register_model` function decorator.
-Once the model is registered we'll be able to use it with the existing
-:ref:`Command-line Tools`.
-All registered models must implement the
-:class:`~fairseq.models.BaseFairseqModel` interface. For sequence-to-sequence
-models (i.e., any model with a single Encoder and Decoder), we can instead
-implement the :class:`~fairseq.models.FairseqEncoderDecoderModel` interface.
-Create a small wrapper class in the same file and register it in fairseq with
-the name ``'simple_lstm'``::
-  from fairseq.models import FairseqEncoderDecoderModel, register_model
-  # Note: the register_model "decorator" should immediately precede the
-  # definition of the Model class.
-  @register_model('simple_lstm')
-  class SimpleLSTMModel(FairseqEncoderDecoderModel):
-      @staticmethod
-      def add_args(parser):
-          # Models can override this method to add new command-line arguments.
-          # Here we'll add some new command-line arguments to configure dropout
-          # and the dimensionality of the embeddings and hidden states.
-          parser.add_argument(
-              '--encoder-embed-dim', type=int, metavar='N',
-              help='dimensionality of the encoder embeddings',
-          )
-          parser.add_argument(
-              '--encoder-hidden-dim', type=int, metavar='N',
-              help='dimensionality of the encoder hidden state',
-          )
-          parser.add_argument(
-              '--encoder-dropout', type=float, default=0.1,
-              help='encoder dropout probability',
-          )
-          parser.add_argument(
-              '--decoder-embed-dim', type=int, metavar='N',
-              help='dimensionality of the decoder embeddings',
-          )
-          parser.add_argument(
-              '--decoder-hidden-dim', type=int, metavar='N',
-              help='dimensionality of the decoder hidden state',
-          )
-          parser.add_argument(
-              '--decoder-dropout', type=float, default=0.1,
-              help='decoder dropout probability',
-          )
-      @classmethod
-      def build_model(cls, args, task):
-          # Fairseq initializes models by calling the ``build_model()``
-          # function. This provides more flexibility, since the returned model
-          # instance can be of a different type than the one that was called.
-          # In this case we'll just return a SimpleLSTMModel instance.
-          # Initialize our Encoder and Decoder.
-          encoder = SimpleLSTMEncoder(
-              args=args,
-              dictionary=task.source_dictionary,
-              embed_dim=args.encoder_embed_dim,
-              hidden_dim=args.encoder_hidden_dim,
-              dropout=args.encoder_dropout,
-          )
-          decoder = SimpleLSTMDecoder(
-              dictionary=task.target_dictionary,
-              encoder_hidden_dim=args.encoder_hidden_dim,
-              embed_dim=args.decoder_embed_dim,
-              hidden_dim=args.decoder_hidden_dim,
-              dropout=args.decoder_dropout,
-          )
-          model = SimpleLSTMModel(encoder, decoder)
-          # Print the model architecture.
-          print(model)
-          return model
-      # We could override the ``forward()`` if we wanted more control over how
-      # the encoder and decoder interact, but it's not necessary for this
-      # tutorial since we can inherit the default implementation provided by
-      # the FairseqEncoderDecoderModel base class, which looks like:
-      #
-      # def forward(self, src_tokens, src_lengths, prev_output_tokens):
-      #     encoder_out = self.encoder(src_tokens, src_lengths)
-      #     decoder_out = self.decoder(prev_output_tokens, encoder_out)
-      #     return decoder_out
-Finally let's define a *named architecture* with the configuration for our
-model. This is done with the :func:`~fairseq.models.register_model_architecture`
-function decorator. Thereafter this named architecture can be used with the
-``--arch`` command-line argument, e.g., ``--arch tutorial_simple_lstm``::
-  from fairseq.models import register_model_architecture
-  # The first argument to ``register_model_architecture()`` should be the name
-  # of the model we registered above (i.e., 'simple_lstm'). The function we
-  # register here should take a single argument *args* and modify it in-place
-  # to match the desired architecture.
-  @register_model_architecture('simple_lstm', 'tutorial_simple_lstm')
-  def tutorial_simple_lstm(args):
-      # We use ``getattr()`` to prioritize arguments that are explicitly given
-      # on the command-line, so that the defaults defined below are only used
-      # when no other value has been specified.
-      args.encoder_embed_dim = getattr(args, 'encoder_embed_dim', 256)
-      args.encoder_hidden_dim = getattr(args, 'encoder_hidden_dim', 256)
-      args.decoder_embed_dim = getattr(args, 'decoder_embed_dim', 256)
-      args.decoder_hidden_dim = getattr(args, 'decoder_hidden_dim', 256)
-3. Training the Model
----------------------
-Now we're ready to train the model. We can use the existing :ref:`fairseq-train`
-command-line tool for this, making sure to specify our new Model architecture
-(``--arch tutorial_simple_lstm``).
-.. note::
-  Make sure you've already preprocessed the data from the IWSLT example in the
-  :file:`examples/translation/` directory.
-.. code-block:: console
-  > fairseq-train data-bin/iwslt14.tokenized.de-en \
-    --arch tutorial_simple_lstm \
-    --encoder-dropout 0.2 --decoder-dropout 0.2 \
-    --optimizer adam --lr 0.005 --lr-shrink 0.5 \
-    --max-tokens 12000
-  (...)
-  | epoch 052 | loss 4.027 | ppl 16.30 | wps 420805 | ups 39.7 | wpb 9841 | bsz 400 | num_updates 20852 | lr 1.95313e-05 | gnorm 0.218 | clip 0% | oom 0 | wall 529 | train_wall 396
-  | epoch 052 | valid on 'valid' subset | valid_loss 4.74989 | valid_ppl 26.91 | num_updates 20852 | best 4.74954
-The model files should appear in the :file:`checkpoints/` directory. While this
-model architecture is not very good, we can use the :ref:`fairseq-generate` script to
-generate translations and compute our BLEU score over the test set:
-.. code-block:: console
-  > fairseq-generate data-bin/iwslt14.tokenized.de-en \
-    --path checkpoints/checkpoint_best.pt \
-    --beam 5 \
-    --remove-bpe
-  (...)
-  | Translated 6750 sentences (153132 tokens) in 17.3s (389.12 sentences/s, 8827.68 tokens/s)
-  | Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)
-4. Making generation faster
----------------------------
-While autoregressive generation from sequence-to-sequence models is inherently
-slow, our implementation above is especially slow because it recomputes the
-entire sequence of Decoder hidden states for every output token (i.e., it is
-``O(n^2)``). We can make this significantly faster by instead caching the
-previous hidden states.
-In fairseq this is called :ref:`Incremental decoding`. Incremental decoding is a
-special mode at inference time where the Model only receives a single timestep
-of input corresponding to the immediately previous output token (for teacher
-forcing) and must produce the next output incrementally. Thus the model must
-cache any long-term state that is needed about the sequence, e.g., hidden
-states, convolutional states, etc.
-To implement incremental decoding we will modify our model to implement the
-:class:`~fairseq.models.FairseqIncrementalDecoder` interface. Compared to the
-standard :class:`~fairseq.models.FairseqDecoder` interface, the incremental
-decoder interface allows ``forward()`` methods to take an extra keyword argument
-(*incremental_state*) that can be used to cache state across time-steps.
-Let's replace our ``SimpleLSTMDecoder`` with an incremental one::
-  import torch
-  from fairseq.models import FairseqIncrementalDecoder
-  class SimpleLSTMDecoder(FairseqIncrementalDecoder):
-      def __init__(
-          self, dictionary, encoder_hidden_dim=128, embed_dim=128, hidden_dim=128,
-          dropout=0.1,
-      ):
-          # This remains the same as before.
-          super().__init__(dictionary)
-          self.embed_tokens = nn.Embedding(
-              num_embeddings=len(dictionary),
-              embedding_dim=embed_dim,
-              padding_idx=dictionary.pad(),
-          )
-          self.dropout = nn.Dropout(p=dropout)
-          self.lstm = nn.LSTM(
-              input_size=encoder_hidden_dim + embed_dim,
-              hidden_size=hidden_dim,
-              num_layers=1,
-              bidirectional=False,
-          )
-          self.output_projection = nn.Linear(hidden_dim, len(dictionary))
-      # We now take an additional kwarg (*incremental_state*) for caching the
-      # previous hidden and cell states.
-      def forward(self, prev_output_tokens, encoder_out, incremental_state=None):
-          if incremental_state is not None:
-              # If the *incremental_state* argument is not ``None`` then we are
-              # in incremental inference mode. While *prev_output_tokens* will
-              # still contain the entire decoded prefix, we will only use the
-              # last step and assume that the rest of the state is cached.
-              prev_output_tokens = prev_output_tokens[:, -1:]
-          # This remains the same as before.
-          bsz, tgt_len = prev_output_tokens.size()
-          final_encoder_hidden = encoder_out['final_hidden']
-          x = self.embed_tokens(prev_output_tokens)
-          x = self.dropout(x)
-          x = torch.cat(
-              [x, final_encoder_hidden.unsqueeze(1).expand(bsz, tgt_len, -1)],
-              dim=2,
-          )
-          # We will now check the cache and load the cached previous hidden and
-          # cell states, if they exist, otherwise we will initialize them to
-          # zeros (as before). We will use the ``utils.get_incremental_state()``
-          # and ``utils.set_incremental_state()`` helpers.
-          initial_state = utils.get_incremental_state(
-              self, incremental_state, 'prev_state',
-          )
-          if initial_state is None:
-              # first time initialization, same as the original version
-              initial_state = (
-                  final_encoder_hidden.unsqueeze(0),  # hidden
-                  torch.zeros_like(final_encoder_hidden).unsqueeze(0),  # cell
-              )
-          # Run one step of our LSTM.
-          output, latest_state = self.lstm(x.transpose(0, 1), initial_state)
-          # Update the cache with the latest hidden and cell states.
-          utils.set_incremental_state(
-              self, incremental_state, 'prev_state', latest_state,
-          )
-          # This remains the same as before
-          x = output.transpose(0, 1)
-          x = self.output_projection(x)
-          return x, None
-      # The ``FairseqIncrementalDecoder`` interface also requires implementing a
-      # ``reorder_incremental_state()`` method, which is used during beam search
-      # to select and reorder the incremental state.
-      def reorder_incremental_state(self, incremental_state, new_order):
-          # Load the cached state.
-          prev_state = utils.get_incremental_state(
-              self, incremental_state, 'prev_state',
-          )
-          # Reorder batches according to *new_order*.
-          reordered_state = (
-              prev_state[0].index_select(1, new_order),  # hidden
-              prev_state[1].index_select(1, new_order),  # cell
-          )
-          # Update the cached state.
-          utils.set_incremental_state(
-              self, incremental_state, 'prev_state', reordered_state,
-          )
-Finally, we can rerun generation and observe the speedup:
-.. code-block:: console
-  # Before
-  > fairseq-generate data-bin/iwslt14.tokenized.de-en \
-    --path checkpoints/checkpoint_best.pt \
-    --beam 5 \
-    --remove-bpe
-  (...)
-  | Translated 6750 sentences (153132 tokens) in 17.3s (389.12 sentences/s, 8827.68 tokens/s)
-  | Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)
-  # After
-  > fairseq-generate data-bin/iwslt14.tokenized.de-en \
-    --path checkpoints/checkpoint_best.pt \
-    --beam 5 \
-    --remove-bpe
-  (...)
-  | Translated 6750 sentences (153132 tokens) in 5.5s (1225.54 sentences/s, 27802.94 tokens/s)
-  | Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)

examples/.gitignore DELETED Viewed

	@@ -1,2 +0,0 @@
1	- !/.sh
2	- !/.md

examples/__init__.py DELETED Viewed

@@ -1,9 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-try:
-    from fairseq.version import __version__  # noqa
-except ImportError:
-    pass

examples/adaptive_span/README.md DELETED Viewed

@@ -1,90 +0,0 @@
-# Adaptive Span
-Adaptive Span is a novel self-attention mechanism that can learn its optimal
-attention span. This allows us to extend significantly the maximum context size
-used in Transformer, while maintaining control over their memory footprint
-and computational time. It uses the Truncated BPTT technique for training,
-as in [transformerXL](https://github.com/pytorch/fairseq/blob/master/examples/truncated_bptt/README.md).
-Adaptive Span was introduced by paper:
-[Adaptive Attention Span in Transformers](https://arxiv.org/abs/1905.07799),
-which achieved state-of-the-art language modeling results at the time of publication.
-We manage to reproduce their result in fairseq and keep most of the
-[original implementation](https://github.com/facebookresearch/adaptive-span) untouched.
-You can refer to the their sweep file as well if any combination of hyperparameter is not clear.
-##### 0. Setup
-First you need to process the Enwik8 dataset, we use the pre-tokenized dataset
-from [adaptive span paper](https://github.com/facebookresearch/adaptive-span/blob/master/get_data.sh).
-You can download the dataset, and then run:
-```bash
-fairseq-preprocess --only-source --trainpref ~/data/enwik8/train.txt \
-    --validpref ~/data/enwik8/valid.txt --testpref ~/data/enwik8/test.txt \
-    --destdir ~/data/enwik8/data-bin/ --joined-dictionary --workers 20
-```
-##### 1. Train a Adaptive Span model on Enwik8
-We will train a 12-layer Adaptive Span model following the [hyperparameters
-used in the original
-paper](https://github.com/facebookresearch/adaptive-span/blob/master/experiments/enwik8.sh).
-The following command assumes 4 GPUs, so that the total batch size is 64
-sequences (4 x 16). Training should take 2-3 days on 4 V100 GPUs:
-```bash
-CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train \
-    --user-dir examples/adaptive_span \
-    --data  ~/data/enwik8/data-bin/ \
-    --fp16 --fp16-no-flatten-grads --max-update 600000 \
-    --task truncated_bptt_lm --tokens-per-sample 512 --arch adaptive_span \
-    --n-layer 12 --d-model 512 --n-head 8 --d-inner 2048 --dropout 0.3 \
-    --attn-span 8192 --optimizer adagrad_with_grad_clip --adagrad-clip 0.03 \
-    --validate-interval-updates 1000 \
-    --lr-scheduler fixed --warmup-updates 32000 --batch-size-valid 32 \
-    --lr 0.07 --criterion adaptive_span_loss --batch-size 16 --update-freq 1 \
-    --seed 2 --log-format json --log-interval 25 --aux-loss-scaler 5e-07
-```
-This should land around 1.05 on validation, 1.03 on test. You can lower the
---aux-loss-scaler for better performance (longer span). It gives ~0.03 bpc
-improvement to the transformerXL baseline here.
-If training on a single GPU, set `--update-freq=4` to accumulate 4x gradients
-and simulate training on 4 GPUs.
-You can also reproduce the transformerXL result on enwik8 using this code base.
-It should land around 1.06 on test,matching the [original paper](https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/run_enwik8_base.sh).
-You can try by
-```bash
-CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train \
-    --user-dir examples/truncated_bptt \
-    ~/data/enwik8/data-bin/ \
-    --task truncated_bptt_lm  --fp16 --max-update 400000 \
-    --tokens-per-sample 512 --arch transformer_xl --n-layer 12 \
-    --d-model 512 --n-head 8 --d-head 64 --d-inner 2048 --dropout 0.1 \
-    --dropatt 0.0 --mem-len 512 --optimizer adam --clip-norm 0.25 \
-    --lr-scheduler cosine --warmup-updates 0 \
-    --lr 0.0 --lr 0.00025 --batch-size 15 \
-    --update-freq 1 --seed 2 --log-format json --log-interval 25 \
-    --fp16
-```
-##### 2. Evaluate
-For Adaptive Span:
-```bash
-fairseq-eval-lm ~/data/enwik8/data-bin/ --path model/checkpoint_best.pt \
- --user-dir examples/adaptive_span \
- --task truncated_bptt_lm --batch-size 8 --tokens-per-sample 512 --gen-subset test
-```
-For Transformer-XL evaluation:
-```bash
-fairseq-eval-lm ~/data/enwik8/data-bin/ --path model/checkpoint_best.pt \
-    --user-dir examples/truncated_bptt/ --task truncated_bptt_lm --batch-size 8 \
-    --tokens-per-sample 80 \
-    --model-overrides '{"mem_len":2100,"clamp_len":820,"same_length":True}' \
-    --gen-subset valid
-```
-*Note:* During training the model saw 512 tokens of context
-(``--tokens-per-sample=512``), with batch size 8. These settings match the evaluation
-settings from [the original
-paper](https://github.com/facebookresearch/adaptive-span/blob/master/experiments/enwik8.sh).

examples/adaptive_span/__init__.py DELETED Viewed

@@ -1,19 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-import importlib
-import os
-# automatically import any Python files in the current directory
-cur_dir = os.path.dirname(__file__)
-for file in os.listdir(cur_dir):
-    path = os.path.join(cur_dir, file)
-    if (
-        not file.startswith("_")
-        and not file.startswith(".")
-        and (file.endswith(".py") or os.path.isdir(path))
-    ):
-        mod_name = file[: file.find(".py")] if file.endswith(".py") else file
-        module = importlib.import_module(__name__ + "." + mod_name)

examples/adaptive_span/adagrad_with_grad_clip.py DELETED Viewed

@@ -1,128 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-from torch.optim import Adagrad
-from fairseq.optim import LegacyFairseqOptimizer, register_optimizer
-@register_optimizer("adagrad_with_grad_clip")
-class FairseqAdagradWithGradClip(LegacyFairseqOptimizer):
-    def __init__(self, args, params):
-        super().__init__(args)
-        self._optimizer = AdagradWithGradClip(params, **self.optimizer_config)
-    @staticmethod
-    def add_args(parser):
-        """Add optimizer-specific arguments to the parser."""
-        # fmt: off
-        parser.add_argument('--weight-decay', '--wd', default=0.0, type=float, metavar='WD',
-                            help='weight decay')
-        parser.add_argument('--adagrad-clip', default=0.0, type=float, metavar='D',
-                            help='internal grad clip')
-        # fmt: on
-    @property
-    def optimizer_config(self):
-        """
-        Return a kwarg dictionary that will be used to override optimizer
-        args stored in checkpoints. This allows us to load a checkpoint and
-        resume training using a different set of optimizer args, e.g., with a
-        different learning rate.
-        """
-        return {
-            "lr": self.args.lr[0],
-            "weight_decay": self.args.weight_decay,
-            "grad_clip": self.args.adagrad_clip,
-        }
-    @property
-    def supports_flat_params(self):
-        return False
-def _clip_grad(clr, grad, group_grad_clip):
-    if group_grad_clip > 0:
-        norm = grad.norm(2).item()
-        if norm > group_grad_clip:
-            clr *= group_grad_clip / (norm + 1e-10)
-    return clr
-class AdagradWithGradClip(Adagrad):
-    """Adagrad algorithm with custom gradient clipping"""
-    def __init__(
-        self,
-        params,
-        lr=1e-2,
-        lr_decay=0,
-        weight_decay=0,
-        initial_accumulator_value=0,
-        grad_clip=0,
-    ):
-        Adagrad.__init__(
-            self,
-            params,
-            lr=lr,
-            lr_decay=lr_decay,
-            weight_decay=weight_decay,
-            initial_accumulator_value=initial_accumulator_value,
-        )
-        self.defaults["grad_clip"] = grad_clip
-        self.param_groups[0].setdefault("grad_clip", grad_clip)
-    def step(self, closure=None):
-        loss = None
-        if closure is not None:
-            loss = closure()
-        for group in self.param_groups:
-            for p in group["params"]:
-                if p.grad is None:
-                    continue
-                grad = p.grad.data
-                state = self.state[p]
-                state["step"] += 1
-                if group["weight_decay"] != 0:
-                    if p.grad.data.is_sparse:
-                        raise RuntimeError(
-                            "weight_decay option is "
-                            "not compatible with sparse "
-                            "gradients"
-                        )
-                    grad = grad.add(group["weight_decay"], p.data)
-                clr = group["lr"] / (1 + (state["step"] - 1) * group["lr_decay"])
-                # clip
-                clr = _clip_grad(clr=clr, grad=grad, group_grad_clip=group["grad_clip"])
-                if grad.is_sparse:
-                    # the update is non-linear so indices must be unique
-                    grad = grad.coalesce()
-                    grad_indices = grad._indices()
-                    grad_values = grad._values()
-                    size = grad.size()
-                    def make_sparse(values):
-                        constructor = grad.new
-                        if grad_indices.dim() == 0 or values.dim() == 0:
-                            return constructor().resize_as_(grad)
-                        return constructor(grad_indices, values, size)
-                    state["sum"].add_(make_sparse(grad_values.pow(2)))
-                    std = state["sum"]._sparse_mask(grad)
-                    std_values = std._values().sqrt_().add_(1e-10)
-                    p.data.add_(-clr, make_sparse(grad_values / std_values))
-                else:
-                    state["sum"].addcmul_(1, grad, grad)
-                    std = state["sum"].sqrt().add_(1e-10)
-                    p.data.addcdiv_(-clr, grad, std)
-        return loss

examples/adaptive_span/adaptive_span_attention.py DELETED Viewed

@@ -1,160 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-import math
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-class AdaptiveMask(nn.Module):
-    """Soft masking function for adaptive size.
-    It masks out the last K values of an input. The masking value
-    goes from 1 to 0 gradually, so K can be learned with
-    back-propagation.
-    Args:
-        max_size: maximum size (i.e. input dimension)
-        ramp_size: size of the ramp going from 0 to 1
-        init_val: initial size proportion not to be masked out
-        shape: learn multiple sizes independent of each other
-    """
-    def __init__(self, max_size, ramp_size, init_val=0, shape=(1,)):
-        nn.Module.__init__(self)
-        self._max_size = max_size
-        self._ramp_size = ramp_size
-        self.current_val = nn.Parameter(torch.zeros(*shape) + init_val)
-        mask_template = torch.linspace(1 - max_size, 0, steps=max_size)
-        self.register_buffer("mask_template", mask_template)
-    def forward(self, x):
-        mask = self.mask_template.float() + self.current_val.float() * self._max_size
-        mask = mask / self._ramp_size + 1
-        mask = mask.clamp(0, 1)
-        if x.size(-1) < self._max_size:
-            # the input could have been trimmed beforehand to save computation
-            mask = mask.narrow(-1, self._max_size - x.size(-1), x.size(-1))
-        x = (x * mask).type_as(x)
-        return x
-    def get_current_max_size(self, include_ramp=True):
-        current_size = math.ceil(self.current_val.max().item() * self._max_size)
-        if include_ramp:
-            current_size += self._ramp_size
-        current_size = max(0, min(self._max_size, current_size))
-        return current_size
-    def get_current_avg_size(self, include_ramp=True):
-        current_size = math.ceil(
-            self.current_val.float().mean().item() * self._max_size
-        )
-        if include_ramp:
-            current_size += self._ramp_size
-        current_size = max(0, min(self._max_size, current_size))
-        return current_size
-    def clamp_param(self):
-        """this need to be called after each update"""
-        self.current_val.data.clamp_(0, 1)
-class AdaptiveSpan(nn.Module):
-    """Adaptive attention span for Transformerself.
-    This module learns an attention span length from data for each
-    self-attention head.
-    Args:
-        attn_span: maximum attention span
-        adapt_span_loss: loss coefficient for the span length
-        adapt_span_ramp: length of the masking ramp
-        adapt_span_init: initial size ratio
-        adapt_span_cache: adapt cache size to reduce memory usage
-    """
-    def __init__(
-        self,
-        attn_span,
-        adapt_span_ramp,
-        adapt_span_init,
-        n_head,
-        adapt_span_layer,
-        **kargs
-    ):
-        nn.Module.__init__(self)
-        self._max_span = attn_span
-        self._n_head = n_head
-        self._adapt_span_layer = adapt_span_layer
-        if self._adapt_span_layer:
-            self._mask = AdaptiveMask(
-                max_size=self._max_span,
-                ramp_size=adapt_span_ramp,
-                init_val=adapt_span_init,
-            )
-        else:
-            self._mask = AdaptiveMask(
-                max_size=self._max_span,
-                ramp_size=adapt_span_ramp,
-                init_val=adapt_span_init,
-                shape=(n_head, 1, 1),
-            )
-    def forward(self, attn, normalize=True):
-        """mask attention with the right span"""
-        # batch and head dimensions are merged together, so separate them first
-        self.clamp_param()
-        if self._adapt_span_layer:
-            attn = self._mask(attn)
-        else:
-            B = attn.size(0)  # batch size
-            M = attn.size(1)  # block size
-            attn = attn.reshape(B // self._n_head, self._n_head, M, -1)
-            attn = self._mask(attn)
-            attn = attn.view(B, M, -1)
-        return attn
-    def get_trim_len(self):
-        """how much of memory can be trimmed to reduce computation"""
-        L = self._max_span
-        trim_len = min(L - 1, L - self._mask.get_current_max_size())
-        # too fine granularity might be bad for the memory management
-        trim_len = math.floor(trim_len / 64) * 64
-        return trim_len
-    def trim_memory(self, query, key, value, key_pe):
-        """trim out unnecessary memory beforehand to reduce computation"""
-        trim_len = self.get_trim_len()
-        cache_size = key.size(1) - query.size(1)
-        trim_len_cache = trim_len - (self._max_span - cache_size)
-        if trim_len_cache > 0:
-            key = key[:, trim_len_cache:, :]
-            value = value[:, trim_len_cache:, :]
-        elif trim_len_cache < 0:
-            # cache is too short! this happens when validation resumes
-            # after a lot of updates.
-            key = F.pad(key, [0, 0, -trim_len_cache, 0])
-            value = F.pad(value, [0, 0, -trim_len_cache, 0])
-        if trim_len > 0:
-            if key_pe is not None:
-                key_pe = key_pe[:, :, trim_len:]
-        return key, value, key_pe
-    def get_cache_size(self):
-        """determine how long the cache should be"""
-        trim_len = self.get_trim_len()
-        # give a buffer of 64 steps since a span might increase
-        # in future updates
-        return min(self._max_span, self._max_span - trim_len + 64)
-    def get_loss(self):
-        """a loss term for regularizing the span length"""
-        return self._max_span * self._mask.current_val.float().mean()
-    def get_current_max_span(self):
-        return self._mask.get_current_max_size()
-    def get_current_avg_span(self):
-        return self._mask.get_current_avg_size()
-    def clamp_param(self):
-        self._mask.clamp_param()

examples/adaptive_span/adaptive_span_loss.py DELETED Viewed

@@ -1,106 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-import math
-from dataclasses import dataclass
-import torch.nn.functional as F
-from fairseq import metrics, utils
-from fairseq.criterions import register_criterion
-from fairseq.criterions.cross_entropy import CrossEntropyCriterion
-from fairseq.dataclass import FairseqDataclass
-from omegaconf import II
-@dataclass
-class AdaptiveSpanCriterionConfig(FairseqDataclass):
-    sentence_avg: bool = II("optimization.sentence_avg")
-@register_criterion("adaptive_span_loss", dataclass=AdaptiveSpanCriterionConfig)
-class AdaptiveSpanCriterion(CrossEntropyCriterion):
-    def __init__(self, task, sentence_avg):
-        super().__init__(task, sentence_avg)
-    def forward(self, model, sample, reduce=True):
-        """Compute the loss for the given sample.
-        Returns a tuple with three elements:
-        1) the loss here is summed, different from the adaptive span code
-        2) the sample size, which is used as the denominator for the gradient
-        3) logging outputs to display while training
-        """
-        net_output = model(**sample["net_input"])
-        loss, aux_loss, avg_span, max_span = self.compute_loss(
-            model, net_output, sample, reduce=reduce
-        )
-        sample_size = (
-            sample["target"].size(0) if self.sentence_avg else sample["ntokens"]
-        )
-        loss /= sample_size
-        total_loss = loss + aux_loss
-        sample_size = 1
-        logging_output = {
-            "loss": loss.data,
-            "ntokens": sample["ntokens"],
-            "nsentences": sample["target"].size(0),
-            "sample_size": sample_size,
-            "total_loss": total_loss.data,
-            "avg_span": avg_span * sample_size,
-            "max_span": max_span * sample_size,
-        }
-        return total_loss, sample_size, logging_output
-    def compute_loss(self, model, net_output, sample, reduce=True):
-        loss, _ = super().compute_loss(model, net_output, sample, reduce)
-        aux_loss = model.get_aux_loss()
-        avg_span = model.get_current_avg_span()
-        max_span = model.get_current_max_span()
-        return loss, aux_loss, avg_span, max_span
-    @staticmethod
-    def reduce_metrics(logging_outputs) -> None:
-        """Aggregate logging outputs from data parallel training."""
-        loss_sum = sum(log.get("loss", 0) for log in logging_outputs)
-        ntokens = sum(log.get("ntokens", 0) for log in logging_outputs)
-        sample_size = sum(log.get("sample_size", 0) for log in logging_outputs)
-        total_loss_sum = sum(log.get("total_loss", 0) for log in logging_outputs)
-        avg_span_sum = sum(log.get("avg_span", 0) for log in logging_outputs)
-        max_span_sum = sum(log.get("max_span", 0) for log in logging_outputs)
-        # we divide by log(2) to convert the loss from base e to base 2
-        metrics.log_scalar(
-            "loss", loss_sum / sample_size / math.log(2), sample_size, round=3
-        )
-        metrics.log_scalar("avg_span", avg_span_sum / sample_size, sample_size, round=3)
-        metrics.log_scalar("max_span", max_span_sum / sample_size, sample_size, round=3)
-        # total loss contains the L1 norm on adaptive-span
-        metrics.log_scalar(
-            "total_loss",
-            total_loss_sum / sample_size / math.log(2),
-            sample_size,
-            round=3,
-        )
-        if sample_size != ntokens:
-            metrics.log_scalar(
-                "nll_loss", loss_sum / ntokens / math.log(2), ntokens, round=3
-            )
-            metrics.log_derived(
-                "ppl", lambda meters: utils.get_perplexity(meters["nll_loss"].avg)
-            )
-        else:
-            metrics.log_derived(
-                "ppl", lambda meters: utils.get_perplexity(meters["loss"].avg)
-            )
-    @staticmethod
-    def logging_outputs_can_be_summed() -> bool:
-        """
-        Whether the logging outputs returned by `forward` can be summed
-        across workers prior to calling `reduce_metrics`. Setting this
-        to True will improves distributed training speed.
-        """
-        return True

examples/adaptive_span/adaptive_span_model.py DELETED Viewed

@@ -1,263 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the license found in the
-# LICENSE file in the root directory of this source tree.
-import math
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from fairseq.modules.layer_norm import LayerNorm
-from .adaptive_span_attention import AdaptiveSpan
-# Size notations:
-# B = batch_size, H = d_model, M = block_size, L = attn_span
-def _skew(X, pad_value):
-    """shift every row 1 step to right"""
-    # X = B x M x L
-    B, M, L = X.size()
-    X = F.pad(X, (0, M + 1), value=pad_value)  # B x M x (L+M+1)
-    X = X.view(B, -1)  # B x ML+MM+M
-    X = X[:, :-M]  # B x ML+MM
-    X = X.view(B, M, M + L)  # B x M x L+M
-    return X
-def _unskew(X):
-    """reverse _skew operation"""
-    # X = B x M x L+M
-    B, M, L = X.size()
-    L -= M
-    X = X.view(B, -1)  # B x ML+MM
-    X = F.pad(X, (0, M))  # B x ML+MM+M
-    X = X.view(B, M, M + L + 1)  # B x M x L+M+1
-    X = X[:, :, :L]  # B x M x L
-    return X
-class SeqAttention(nn.Module):
-    """Sequential self-attention layer.
-    Each token will attend to its previous fixed number of steps.
-    Note that attention doesn't include the current step itself.
-    """
-    def __init__(self, d_model, n_head, attn_span, dropout, adapt_span_layer, **kargs):
-        nn.Module.__init__(self)
-        self.dropout = nn.Dropout(dropout)
-        self.d_model = d_model  # size of a single head
-        self.attn_span = attn_span
-        self.adaptive_span = AdaptiveSpan(
-            attn_span=attn_span,
-            n_head=n_head,
-            adapt_span_layer=adapt_span_layer,
-            **kargs
-        )
-    def forward(self, query, key, value, key_pe):
-        # query size = B x M x H
-        # key, value sizes = B x (M+L) x H
-        key, value, key_pe = self.adaptive_span.trim_memory(query, key, value, key_pe)
-        # compute attention from context
-        # B x M (dest) x (M+L) (src)
-        attn_cont = torch.matmul(query, key.transpose(-1, -2))
-        attn_cont = _unskew(attn_cont)  # B x M x L
-        # compute the effect of position embedding
-        attn_pos = torch.matmul(query, key_pe)  # B x M x L_pos
-        attn = attn_cont + attn_pos
-        attn = attn / math.sqrt(self.d_model)  # B x M X L_pos
-        attn = F.softmax(attn.float(), dim=-1).type_as(attn)
-        # trim attention lengths according to the learned span
-        attn = self.adaptive_span(attn)
-        attn = self.dropout(attn)  # B x M X L_pos
-        attn_cont = _skew(attn, 0)  # B x M X (L+M)
-        out = torch.matmul(attn_cont, value)  # B x M x H
-        return out
-    def get_cache_size(self):
-        return self.adaptive_span.get_cache_size()
-class MultiHeadSeqAttention(nn.Module):
-    def __init__(self, d_model, n_head, **kargs):
-        nn.Module.__init__(self)
-        assert d_model % n_head == 0
-        self.n_head = n_head
-        self.head_dim = d_model // n_head
-        self.attn = SeqAttention(d_model=self.head_dim, n_head=n_head, **kargs)
-        self.proj_query = nn.Linear(d_model, d_model, bias=False)
-        nn.init.xavier_normal_(self.proj_query.weight)
-        self.proj_out = nn.Linear(d_model, d_model, bias=False)
-        nn.init.xavier_normal_(self.proj_out.weight)
-        self.proj_val = nn.Linear(d_model, d_model, bias=False)
-        nn.init.xavier_normal_(self.proj_val.weight)
-        self.proj_key = nn.Linear(d_model, d_model, bias=False)
-        nn.init.xavier_normal_(self.proj_key.weight)
-    def head_reshape(self, x):
-        K = self.n_head
-        D = self.head_dim
-        x = x.view(x.size()[:-1] + (K, D))  # B x (M+L) x K x D
-        x = x.transpose(1, 2).contiguous()  # B x K x (M+L) x D
-        x = x.view(-1, x.size(-2), x.size(-1))  # B_K x (M+L) x D
-        return x
-    def forward(self, query, key, value, key_pe):
-        B = query.size(0)
-        K = self.n_head
-        D = self.head_dim
-        M = query.size(1)
-        query = self.proj_query(query)
-        query = self.head_reshape(query)
-        value = self.proj_val(value)
-        value = self.head_reshape(value)
-        key = self.proj_key(key)
-        key = self.head_reshape(key)
-        out = self.attn(query, key, value, key_pe)  # B_K x M x D
-        out = out.view(B, K, M, D)  # B x K x M x D
-        out = out.transpose(1, 2).contiguous()  # B x M x K x D
-        out = out.view(B, M, -1)  # B x M x K_D
-        out = self.proj_out(out)
-        return out
-class FeedForwardLayer(nn.Module):
-    def __init__(self, d_model, d_inner, dropout, **kargs):
-        nn.Module.__init__(self)
-        self.fc1 = nn.Linear(d_model, d_inner)
-        self.fc2 = nn.Linear(d_inner, d_model)
-        nn.init.xavier_uniform_(self.fc1.weight)
-        nn.init.xavier_uniform_(self.fc2.weight)
-        self.dropout = nn.Dropout(dropout)
-    def forward(self, h):
-        h1 = F.relu(self.fc1(h))
-        h1 = self.dropout(h1)
-        h2 = self.fc2(h1)
-        return h2
-class TransformerSeqLayer(nn.Module):
-    def __init__(self, d_model, **kargs):
-        nn.Module.__init__(self)
-        self.attn = MultiHeadSeqAttention(d_model=d_model, **kargs)
-        self.norm1 = LayerNorm(d_model)
-        self.ff = FeedForwardLayer(d_model=d_model, **kargs)
-        self.norm2 = LayerNorm(d_model)
-    def forward(self, h, h_cache, key_pe):
-        # h = B x M x H
-        # h_cache = B x L x H
-        h_all = torch.cat([h_cache, h], dim=1)  # B x (M+L) x H
-        attn_out = self.attn(h, h_all, h_all, key_pe)
-        h = self.norm1(h + attn_out)  # B x M x H
-        if self.ff is not None:
-            ff_out = self.ff(h)
-            out = self.norm2(h + ff_out)  # B x M x H
-        else:
-            out = h
-        return out
-    def get_cache_size(self):
-        return self.attn.attn.get_cache_size()
-class TransformerSeq(nn.Module):
-    def __init__(
-        self,
-        vocab_size,
-        d_model,
-        n_head,
-        n_layer,
-        attn_span,
-        emb_dropout,
-        aux_loss_scaler,
-        adapt_span_layer,
-        **kargs
-    ):
-        nn.Module.__init__(self)
-        # token embeddings
-        self.in_emb = nn.Embedding(vocab_size, d_model)
-        nn.init.normal_(self.in_emb.weight, mean=0, std=d_model ** -0.5)
-        self.out_emb = nn.Linear(d_model, vocab_size)
-        self.aux_loss_scaler = aux_loss_scaler
-        if emb_dropout > 0:
-            self.emb_dropout = nn.Dropout(emb_dropout)
-        else:
-            self.emb_dropout = None
-        # position embeddings
-        self.key_pe = nn.Parameter(torch.randn(1, d_model // n_head, attn_span))
-        self.layers = nn.ModuleList()
-        self.layers.extend(
-            TransformerSeqLayer(
-                d_model=d_model,
-                n_head=n_head,
-                attn_span=attn_span,
-                adapt_span_layer=adapt_span_layer,
-                **kargs
-            )
-            for _ in range(n_layer)
-        )
-    def forward(self, x, h_cache, target=None):
-        # x size = B x M
-        block_size = x.size(1)
-        h = self.in_emb(x)  # B x M x H
-        if self.emb_dropout is not None:
-            h = self.emb_dropout(h)
-        h_cache_next = []
-        for l, layer in enumerate(self.layers):
-            cache_size = layer.attn.attn.get_cache_size()
-            if cache_size > block_size:
-                h_cache_next_l = torch.cat(
-                    [h_cache[l][:, -cache_size + block_size :, :], h], dim=1
-                ).detach()
-            else:
-                h_cache_next_l = h[:, -cache_size:, :].detach()
-            h_cache_next.append(h_cache_next_l)
-            h = layer(h, h_cache[l], self.key_pe)  # B x M x H
-        if self.emb_dropout is not None:
-            h = self.emb_dropout(h)
-        out = F.log_softmax(self.out_emb(h).float(), dim=-1).type_as(h)
-        dummy_loss = None
-        return out, h_cache_next, dummy_loss
-    def get_aux_loss(self):
-        loss = 0.0
-        for layer in self.layers:
-            loss += layer.attn.attn.adaptive_span.get_loss()
-        return self.aux_loss_scaler * loss
-    def get_current_max_span(self):
-        max_span = 0.0
-        for layer in self.layers:
-            max_span = max(
-                max_span, layer.attn.attn.adaptive_span.get_current_max_span()
-            )
-        return max_span
-    def get_current_avg_span(self):
-        avg_span = 0.0
-        for layer in self.layers:
-            avg_span += layer.attn.attn.adaptive_span.get_current_avg_span()
-        return avg_span / len(self.layers)

examples/adaptive_span/adaptive_span_model_wrapper.py DELETED Viewed

@@ -1,145 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-import logging
-from dataclasses import dataclass
-from typing import Dict, List, Optional
-import torch
-from fairseq.dataclass import FairseqDataclass
-from fairseq.models import (
-    FairseqIncrementalDecoder,
-    FairseqLanguageModel,
-    register_model,
-)
-from .adaptive_span_model import TransformerSeq as AdaptiveSpanTransformerModel
-logger = logging.getLogger(__name__)
-@dataclass
-class AdaptiveSpanSmallConfig(FairseqDataclass):
-    # defaults come from https://github.com/facebookresearch/adaptive-span/blob/master/experiments/enwik8_small.sh
-    vocab_size: int = 50
-    d_model: int = 256
-    n_head: int = 4
-    d_inner: int = 1024
-    n_layer: int = 8
-    attn_span: int = 1024
-    dropout: float = 0.0
-    emb_dropout: float = 0.0
-    adapt_span_ramp: int = 32
-    adapt_span_init: float = 0.0
-    aux_loss_scaler: float = 0.000002
-    adapt_span_layer: bool = False
-@register_model("adaptive_span", dataclass=AdaptiveSpanSmallConfig)
-class AdaptiveSpanTransformer(FairseqLanguageModel):
-    @classmethod
-    def build_model(cls, cfg: AdaptiveSpanSmallConfig, task):
-        return cls(AdaptiveSpanDecoder(cfg, task))
-    def get_aux_loss(self):
-        return self.decoder.get_aux_loss()
-    def get_current_max_span(self):
-        return self.decoder.get_current_max_span()
-    def get_current_avg_span(self):
-        return self.decoder.get_current_avg_span()
-class AdaptiveSpanDecoder(FairseqIncrementalDecoder):
-    def __init__(self, cfg, task):
-        super().__init__(task.target_dictionary)
-        self.config = cfg
-        config = AdaptiveSpanSmallConfig(
-            vocab_size=len(task.target_dictionary),
-            d_model=cfg.d_model,
-            n_head=cfg.n_head,
-            d_inner=cfg.d_inner,
-            n_layer=cfg.n_layer,
-            attn_span=cfg.attn_span,
-            dropout=cfg.dropout,
-            emb_dropout=cfg.emb_dropout,
-            adapt_span_ramp=cfg.adapt_span_ramp,
-            adapt_span_init=cfg.adapt_span_init,
-            aux_loss_scaler=cfg.aux_loss_scaler,
-            adapt_span_layer=cfg.adapt_span_layer,
-        )
-        logger.info(config)
-        self.model = AdaptiveSpanTransformerModel(**config.__dict__)
-        self._mems = None
-    def forward(
-        self,
-        src_tokens,
-        incremental_state: Optional[Dict[str, List[torch.Tensor]]] = None,
-        encoder_out=None,
-    ):
-        bsz = src_tokens.size(0)
-        if incremental_state is not None:  # used during inference
-            mems = self.get_incremental_state("mems")
-            src_tokens = src_tokens[:, -1:]  # only keep the most recent token
-        else:
-            mems = self._mems
-        if mems is None:
-            # first time init
-            mems = self.init_hid_cache(bsz)
-        output = self.model(x=src_tokens, h_cache=mems,)
-        if incremental_state is not None:
-            self.set_incremental_state(incremental_state, "mems", output[1])
-        else:
-            self._mems = output[1]
-        return (output[0],)
-    def max_positions(self):
-        return self.config.attn_span
-    def init_hid_cache(self, batch_sz):
-        hid = []
-        for layer in self.model.layers:
-            param = next(self.model.parameters())
-            h = torch.zeros(
-                batch_sz,
-                layer.get_cache_size(),
-                self.config.d_model,
-                dtype=param.dtype,
-                device=param.device,
-            )
-            hid.append(h)
-        return hid
-    def get_aux_loss(self):
-        return self.model.get_aux_loss()
-    def get_current_max_span(self):
-        return self.model.get_current_max_span()
-    def get_current_avg_span(self):
-        return self.model.get_current_avg_span()
-    def reorder_incremental_state(
-        self,
-        incremental_state: Dict[str, Dict[str, Optional[torch.Tensor]]],
-        new_order: torch.Tensor,
-    ):
-        """Reorder incremental state.
-        This will be called when the order of the input has changed from the
-        previous time step. A typical use case is beam search, where the input
-        order changes between time steps based on the selection of beams.
-        """
-        raise NotImplementedError("This is required for generation/beam search")
-        # mems = self.get_incremental_state(incremental_state, "mems")
-        # if mems is not None:
-        #     new_mems = [mems_i.index_select(1, new_order) for mems_i in mems]
-        #     self.set_incremental_state(incremental_state, "mems", new_mems)

examples/adaptive_span/truncated_bptt_lm_task.py DELETED Viewed

	@@ -1 +0,0 @@
1	- ../truncated_bptt/truncated_bptt_lm_task.py

examples/backtranslation/README.md DELETED Viewed

@@ -1,297 +0,0 @@
-# Understanding Back-Translation at Scale (Edunov et al., 2018)
-This page includes pre-trained models from the paper [Understanding Back-Translation at Scale (Edunov et al., 2018)](https://arxiv.org/abs/1808.09381).
-## Pre-trained models
-Model | Description | Dataset | Download
----|---|---|---
-`transformer.wmt18.en-de` | Transformer <br> ([Edunov et al., 2018](https://arxiv.org/abs/1808.09381)) <br> WMT'18 winner | [WMT'18 English-German](http://www.statmt.org/wmt18/translation-task.html) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt18.en-de.ensemble.tar.gz) <br> See NOTE in the archive
-## Example usage (torch.hub)
-We require a few additional Python dependencies for preprocessing:
-```bash
-pip install subword_nmt sacremoses
-```
-Then to generate translations from the full model ensemble:
-```python
-import torch
-# List available models
-torch.hub.list('pytorch/fairseq')  # [..., 'transformer.wmt18.en-de', ... ]
-# Load the WMT'18 En-De ensemble
-en2de_ensemble = torch.hub.load(
-    'pytorch/fairseq', 'transformer.wmt18.en-de',
-    checkpoint_file='wmt18.model1.pt:wmt18.model2.pt:wmt18.model3.pt:wmt18.model4.pt:wmt18.model5.pt',
-    tokenizer='moses', bpe='subword_nmt')
-# The ensemble contains 5 models
-len(en2de_ensemble.models)
-# 5
-# Translate
-en2de_ensemble.translate('Hello world!')
-# 'Hallo Welt!'
-```
-## Training your own model (WMT'18 English-German)
-The following instructions can be adapted to reproduce the models from the paper.
-#### Step 1. Prepare parallel data and optionally train a baseline (English-German) model
-First download and preprocess the data:
-```bash
-# Download and prepare the data
-cd examples/backtranslation/
-bash prepare-wmt18en2de.sh
-cd ../..
-# Binarize the data
-TEXT=examples/backtranslation/wmt18_en_de
-fairseq-preprocess \
-    --joined-dictionary \
-    --source-lang en --target-lang de \
-    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
-    --destdir data-bin/wmt18_en_de --thresholdtgt 0 --thresholdsrc 0 \
-    --workers 20
-# Copy the BPE code into the data-bin directory for future use
-cp examples/backtranslation/wmt18_en_de/code data-bin/wmt18_en_de/code
-```
-(Optionally) Train a baseline model (English-German) using just the parallel data:
-```bash
-CHECKPOINT_DIR=checkpoints_en_de_parallel
-fairseq-train --fp16 \
-    data-bin/wmt18_en_de \
-    --source-lang en --target-lang de \
-    --arch transformer_wmt_en_de_big --share-all-embeddings \
-    --dropout 0.3 --weight-decay 0.0 \
-    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
-    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
-    --lr 0.001 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
-    --max-tokens 3584 --update-freq 16 \
-    --max-update 30000 \
-    --save-dir $CHECKPOINT_DIR
-# Note: the above command assumes 8 GPUs. Adjust `--update-freq` if you have a
-# different number of GPUs.
-```
-Average the last 10 checkpoints:
-```bash
-python scripts/average_checkpoints.py \
-    --inputs $CHECKPOINT_DIR \
-    --num-epoch-checkpoints 10 \
-    --output $CHECKPOINT_DIR/checkpoint.avg10.pt
-```
-Evaluate BLEU:
-```bash
-# tokenized BLEU on newstest2017:
-bash examples/backtranslation/tokenized_bleu.sh \
-    wmt17 \
-    en-de \
-    data-bin/wmt18_en_de \
-    data-bin/wmt18_en_de/code \
-    $CHECKPOINT_DIR/checkpoint.avg10.pt
-# BLEU4 = 29.57, 60.9/35.4/22.9/15.5 (BP=1.000, ratio=1.014, syslen=63049, reflen=62152)
-# compare to 29.46 in Table 1, which is also for tokenized BLEU
-# generally it's better to report (detokenized) sacrebleu though:
-bash examples/backtranslation/sacrebleu.sh \
-    wmt17 \
-    en-de \
-    data-bin/wmt18_en_de \
-    data-bin/wmt18_en_de/code \
-    $CHECKPOINT_DIR/checkpoint.avg10.pt
-# BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.4.3 = 29.0 60.6/34.7/22.4/14.9 (BP = 1.000 ratio = 1.013 hyp_len = 62099 ref_len = 61287)
-```
-#### Step 2. Back-translate monolingual German data
-Train a reverse model (German-English) to do the back-translation:
-```bash
-CHECKPOINT_DIR=checkpoints_de_en_parallel
-fairseq-train --fp16 \
-    data-bin/wmt18_en_de \
-    --source-lang de --target-lang en \
-    --arch transformer_wmt_en_de_big --share-all-embeddings \
-    --dropout 0.3 --weight-decay 0.0 \
-    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
-    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
-    --lr 0.001 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
-    --max-tokens 3584 --update-freq 16 \
-    --max-update 30000 \
-    --save-dir $CHECKPOINT_DIR
-# Note: the above command assumes 8 GPUs. Adjust `--update-freq` if you have a
-# different number of GPUs.
-```
-Let's evaluate the back-translation (BT) model to make sure it is well trained:
-```bash
-bash examples/backtranslation/sacrebleu.sh \
-    wmt17 \
-    de-en \
-    data-bin/wmt18_en_de \
-    data-bin/wmt18_en_de/code \
-    $CHECKPOINT_DIR/checkpoint_best.py
-# BLEU+case.mixed+lang.de-en+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.4.3 = 34.9 66.9/41.8/28.5/19.9 (BP = 0.983 ratio = 0.984 hyp_len = 63342 ref_len = 64399)
-# compare to the best system from WMT'17 which scored 35.1: http://matrix.statmt.org/matrix/systems_list/1868
-```
-Next prepare the monolingual data:
-```bash
-# Download and prepare the monolingual data
-# By default the script samples 25M monolingual sentences, which after
-# deduplication should be just over 24M sentences. These are split into 25
-# shards, each with 1M sentences (except for the last shard).
-cd examples/backtranslation/
-bash prepare-de-monolingual.sh
-cd ../..
-# Binarize each shard of the monolingual data
-TEXT=examples/backtranslation/wmt18_de_mono
-for SHARD in $(seq -f "%02g" 0 24); do \
-    fairseq-preprocess \
-        --only-source \
-        --source-lang de --target-lang en \
-        --joined-dictionary \
-        --srcdict data-bin/wmt18_en_de/dict.de.txt \
-        --testpref $TEXT/bpe.monolingual.dedup.${SHARD} \
-        --destdir data-bin/wmt18_de_mono/shard${SHARD} \
-        --workers 20; \
-    cp data-bin/wmt18_en_de/dict.en.txt data-bin/wmt18_de_mono/shard${SHARD}/; \
-done
-```
-Now we're ready to perform back-translation over the monolingual data. The
-following command generates via sampling, but it's possible to use greedy
-decoding (`--beam 1`), beam search (`--beam 5`),
-top-k sampling (`--sampling --beam 1 --sampling-topk 10`), etc.:
-```bash
-mkdir backtranslation_output
-for SHARD in $(seq -f "%02g" 0 24); do \
-    fairseq-generate --fp16 \
-        data-bin/wmt18_de_mono/shard${SHARD} \
-        --path $CHECKPOINT_DIR/checkpoint_best.pt \
-        --skip-invalid-size-inputs-valid-test \
-        --max-tokens 4096 \
-        --sampling --beam 1 \
-    > backtranslation_output/sampling.shard${SHARD}.out; \
-done
-```
-After BT, use the `extract_bt_data.py` script to re-combine the shards, extract
-the back-translations and apply length ratio filters:
-```bash
-python examples/backtranslation/extract_bt_data.py \
-    --minlen 1 --maxlen 250 --ratio 1.5 \
-    --output backtranslation_output/bt_data --srclang en --tgtlang de \
-    backtranslation_output/sampling.shard*.out
-# Ensure lengths are the same:
-# wc -l backtranslation_output/bt_data.{en,de}
-#   21795614 backtranslation_output/bt_data.en
-#   21795614 backtranslation_output/bt_data.de
-#   43591228 total
-```
-Binarize the filtered BT data and combine it with the parallel data:
-```bash
-TEXT=backtranslation_output
-fairseq-preprocess \
-    --source-lang en --target-lang de \
-    --joined-dictionary \
-    --srcdict data-bin/wmt18_en_de/dict.en.txt \
-    --trainpref $TEXT/bt_data \
-    --destdir data-bin/wmt18_en_de_bt \
-    --workers 20
-# We want to train on the combined data, so we'll symlink the parallel + BT data
-# in the wmt18_en_de_para_plus_bt directory. We link the parallel data as "train"
-# and the BT data as "train1", so that fairseq will combine them automatically
-# and so that we can use the `--upsample-primary` option to upsample the
-# parallel data (if desired).
-PARA_DATA=$(readlink -f data-bin/wmt18_en_de)
-BT_DATA=$(readlink -f data-bin/wmt18_en_de_bt)
-COMB_DATA=data-bin/wmt18_en_de_para_plus_bt
-mkdir -p $COMB_DATA
-for LANG in en de; do \
-    ln -s ${PARA_DATA}/dict.$LANG.txt ${COMB_DATA}/dict.$LANG.txt; \
-    for EXT in bin idx; do \
-        ln -s ${PARA_DATA}/train.en-de.$LANG.$EXT ${COMB_DATA}/train.en-de.$LANG.$EXT; \
-        ln -s ${BT_DATA}/train.en-de.$LANG.$EXT ${COMB_DATA}/train1.en-de.$LANG.$EXT; \
-        ln -s ${PARA_DATA}/valid.en-de.$LANG.$EXT ${COMB_DATA}/valid.en-de.$LANG.$EXT; \
-        ln -s ${PARA_DATA}/test.en-de.$LANG.$EXT ${COMB_DATA}/test.en-de.$LANG.$EXT; \
-    done; \
-done
-```
-#### 3. Train an English-German model over the combined parallel + BT data
-Finally we can train a model over the parallel + BT data:
-```bash
-CHECKPOINT_DIR=checkpoints_en_de_parallel_plus_bt
-fairseq-train --fp16 \
-    data-bin/wmt18_en_de_para_plus_bt \
-    --upsample-primary 16 \
-    --source-lang en --target-lang de \
-    --arch transformer_wmt_en_de_big --share-all-embeddings \
-    --dropout 0.3 --weight-decay 0.0 \
-    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
-    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
-    --lr 0.0007 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
-    --max-tokens 3584 --update-freq 16 \
-    --max-update 100000 \
-    --save-dir $CHECKPOINT_DIR
-# Note: the above command assumes 8 GPUs. Adjust `--update-freq` if you have a
-# different number of GPUs.
-```
-Average the last 10 checkpoints:
-```bash
-python scripts/average_checkpoints.py \
-    --inputs $CHECKPOINT_DIR \
-    --num-epoch-checkpoints 10 \
-    --output $CHECKPOINT_DIR/checkpoint.avg10.pt
-```
-Evaluate BLEU:
-```bash
-# tokenized BLEU on newstest2017:
-bash examples/backtranslation/tokenized_bleu.sh \
-    wmt17 \
-    en-de \
-    data-bin/wmt18_en_de \
-    data-bin/wmt18_en_de/code \
-    $CHECKPOINT_DIR/checkpoint.avg10.pt
-# BLEU4 = 32.35, 64.4/38.9/26.2/18.3 (BP=0.977, ratio=0.977, syslen=60729, reflen=62152)
-# compare to 32.35 in Table 1, which is also for tokenized BLEU
-# generally it's better to report (detokenized) sacrebleu:
-bash examples/backtranslation/sacrebleu.sh \
-    wmt17 \
-    en-de \
-    data-bin/wmt18_en_de \
-    data-bin/wmt18_en_de/code \
-    $CHECKPOINT_DIR/checkpoint.avg10.pt
-# BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.4.3 = 31.5 64.3/38.2/25.6/17.6 (BP = 0.971 ratio = 0.971 hyp_len = 59515 ref_len = 61287)
-```
-## Citation
-```bibtex
-@inproceedings{edunov2018backtranslation,
-  title = {Understanding Back-Translation at Scale},
-  author = {Edunov, Sergey and Ott, Myle and Auli, Michael and Grangier, David},
-  booktitle = {Conference of the Association for Computational Linguistics (ACL)},
-  year = 2018,
-}
-```

examples/backtranslation/deduplicate_lines.py DELETED Viewed

@@ -1,41 +0,0 @@
-#!/usr/bin/python3
-# Copyright (c) Facebook, Inc. and its affiliates.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-import argparse
-import fileinput
-import hashlib
-import sys
-from multiprocessing import Pool
-def get_hashes_and_lines(raw_line):
-    hash = hashlib.md5(raw_line).hexdigest()
-    return hash, raw_line
-def main():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--workers", type=int, default=10)
-    parser.add_argument("files", nargs="*", help="input files")
-    args = parser.parse_args()
-    seen = set()
-    with fileinput.input(args.files, mode="rb") as h:
-        pool = Pool(args.workers)
-        results = pool.imap_unordered(get_hashes_and_lines, h, 1000)
-        for i, (hash, raw_line) in enumerate(results):
-            if hash not in seen:
-                seen.add(hash)
-                sys.stdout.buffer.write(raw_line)
-            if i % 1000000 == 0:
-                print(i, file=sys.stderr, end="", flush=True)
-            elif i % 100000 == 0:
-                print(".", file=sys.stderr, end="", flush=True)
-    print(file=sys.stderr, flush=True)
-if __name__ == "__main__":
-    main()

examples/backtranslation/extract_bt_data.py DELETED Viewed

@@ -1,72 +0,0 @@
-#!/usr/bin/env python
-# Copyright (c) Facebook, Inc. and its affiliates.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-import argparse
-import fileinput
-from tqdm import tqdm
-def main():
-    parser = argparse.ArgumentParser(
-        description=(
-            "Extract back-translations from the stdout of fairseq-generate. "
-            "If there are multiply hypotheses for a source, we only keep the first one. "
-        )
-    )
-    parser.add_argument("--output", required=True, help="output prefix")
-    parser.add_argument(
-        "--srclang", required=True, help="source language (extracted from H-* lines)"
-    )
-    parser.add_argument(
-        "--tgtlang", required=True, help="target language (extracted from S-* lines)"
-    )
-    parser.add_argument("--minlen", type=int, help="min length filter")
-    parser.add_argument("--maxlen", type=int, help="max length filter")
-    parser.add_argument("--ratio", type=float, help="ratio filter")
-    parser.add_argument("files", nargs="*", help="input files")
-    args = parser.parse_args()
-    def validate(src, tgt):
-        srclen = len(src.split(" ")) if src != "" else 0
-        tgtlen = len(tgt.split(" ")) if tgt != "" else 0
-        if (
-            (args.minlen is not None and (srclen < args.minlen or tgtlen < args.minlen))
-            or (
-                args.maxlen is not None
-                and (srclen > args.maxlen or tgtlen > args.maxlen)
-            )
-            or (
-                args.ratio is not None
-                and (max(srclen, tgtlen) / float(min(srclen, tgtlen)) > args.ratio)
-            )
-        ):
-            return False
-        return True
-    def safe_index(toks, index, default):
-        try:
-            return toks[index]
-        except IndexError:
-            return default
-    with open(args.output + "." + args.srclang, "w") as src_h, open(
-        args.output + "." + args.tgtlang, "w"
-    ) as tgt_h:
-        for line in tqdm(fileinput.input(args.files)):
-            if line.startswith("S-"):
-                tgt = safe_index(line.rstrip().split("\t"), 1, "")
-            elif line.startswith("H-"):
-                if tgt is not None:
-                    src = safe_index(line.rstrip().split("\t"), 2, "")
-                    if validate(src, tgt):
-                        print(src, file=src_h)
-                        print(tgt, file=tgt_h)
-                    tgt = None
-if __name__ == "__main__":
-    main()

examples/backtranslation/prepare-de-monolingual.sh DELETED Viewed

@@ -1,98 +0,0 @@
-#!/bin/bash
-SCRIPTS=mosesdecoder/scripts
-TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
-NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
-REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
-BPEROOT=subword-nmt/subword_nmt
-BPE_CODE=wmt18_en_de/code
-SUBSAMPLE_SIZE=25000000
-LANG=de
-OUTDIR=wmt18_${LANG}_mono
-orig=orig
-tmp=$OUTDIR/tmp
-mkdir -p $OUTDIR $tmp
-URLS=(
-    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2007.de.shuffled.gz"
-    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2008.de.shuffled.gz"
-    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2009.de.shuffled.gz"
-    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2010.de.shuffled.gz"
-    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2011.de.shuffled.gz"
-    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2012.de.shuffled.gz"
-    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.de.shuffled.gz"
-    "http://www.statmt.org/wmt15/training-monolingual-news-crawl-v2/news.2014.de.shuffled.v2.gz"
-    "http://data.statmt.org/wmt16/translation-task/news.2015.de.shuffled.gz"
-    "http://data.statmt.org/wmt17/translation-task/news.2016.de.shuffled.gz"
-    "http://data.statmt.org/wmt18/translation-task/news.2017.de.shuffled.deduped.gz"
-)
-FILES=(
-    "news.2007.de.shuffled.gz"
-    "news.2008.de.shuffled.gz"
-    "news.2009.de.shuffled.gz"
-    "news.2010.de.shuffled.gz"
-    "news.2011.de.shuffled.gz"
-    "news.2012.de.shuffled.gz"
-    "news.2013.de.shuffled.gz"
-    "news.2014.de.shuffled.v2.gz"
-    "news.2015.de.shuffled.gz"
-    "news.2016.de.shuffled.gz"
-    "news.2017.de.shuffled.deduped.gz"
-)
-cd $orig
-for ((i=0;i<${#URLS[@]};++i)); do
-    file=${FILES[i]}
-    if [ -f $file ]; then
-        echo "$file already exists, skipping download"
-    else
-        url=${URLS[i]}
-        wget "$url"
-    fi
-done
-cd ..
-if [ -f $tmp/monolingual.${SUBSAMPLE_SIZE}.${LANG} ]; then
-    echo "found monolingual sample, skipping shuffle/sample/tokenize"
-else
-    gzip -c -d -k $(for FILE in "${FILES[@]}"; do echo $orig/$FILE; done) \
-    | shuf -n $SUBSAMPLE_SIZE \
-    | perl $NORM_PUNC $LANG \
-    | perl $REM_NON_PRINT_CHAR \
-    | perl $TOKENIZER -threads 8 -a -l $LANG \
-    > $tmp/monolingual.${SUBSAMPLE_SIZE}.${LANG}
-fi
-if [ -f $tmp/bpe.monolingual.${SUBSAMPLE_SIZE}.${LANG} ]; then
-    echo "found BPE monolingual sample, skipping BPE step"
-else
-    python $BPEROOT/apply_bpe.py -c $BPE_CODE \
-        < $tmp/monolingual.${SUBSAMPLE_SIZE}.${LANG} \
-        > $tmp/bpe.monolingual.${SUBSAMPLE_SIZE}.${LANG}
-fi
-if [ -f $tmp/bpe.monolingual.dedup.${SUBSAMPLE_SIZE}.${LANG} ]; then
-    echo "found deduplicated monolingual sample, skipping deduplication step"
-else
-    python deduplicate_lines.py $tmp/bpe.monolingual.${SUBSAMPLE_SIZE}.${LANG} \
-    > $tmp/bpe.monolingual.dedup.${SUBSAMPLE_SIZE}.${LANG}
-fi
-if [ -f $OUTDIR/bpe.monolingual.dedup.00.de ]; then
-    echo "found sharded data, skipping sharding step"
-else
-    split --lines 1000000 --numeric-suffixes \
-        --additional-suffix .${LANG} \
-        $tmp/bpe.monolingual.dedup.${SUBSAMPLE_SIZE}.${LANG} \
-        $OUTDIR/bpe.monolingual.dedup.
-fi

examples/backtranslation/prepare-wmt18en2de.sh DELETED Viewed

@@ -1,135 +0,0 @@
-#!/bin/bash
-# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh
-echo 'Cloning Moses github repository (for tokenization scripts)...'
-git clone https://github.com/moses-smt/mosesdecoder.git
-echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
-git clone https://github.com/rsennrich/subword-nmt.git
-SCRIPTS=mosesdecoder/scripts
-TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
-CLEAN=$SCRIPTS/training/clean-corpus-n.perl
-NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
-REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
-BPEROOT=subword-nmt/subword_nmt
-BPE_TOKENS=32000
-URLS=(
-    "http://statmt.org/wmt13/training-parallel-europarl-v7.tgz"
-    "http://statmt.org/wmt13/training-parallel-commoncrawl.tgz"
-    "http://data.statmt.org/wmt18/translation-task/training-parallel-nc-v13.tgz"
-    "http://data.statmt.org/wmt18/translation-task/rapid2016.tgz"
-    "http://data.statmt.org/wmt17/translation-task/dev.tgz"
-    "http://statmt.org/wmt14/test-full.tgz"
-)
-FILES=(
-    "training-parallel-europarl-v7.tgz"
-    "training-parallel-commoncrawl.tgz"
-    "training-parallel-nc-v13.tgz"
-    "rapid2016.tgz"
-    "dev.tgz"
-    "test-full.tgz"
-)
-CORPORA=(
-    "training/europarl-v7.de-en"
-    "commoncrawl.de-en"
-    "training-parallel-nc-v13/news-commentary-v13.de-en"
-    "rapid2016.de-en"
-)
-if [ ! -d "$SCRIPTS" ]; then
-    echo "Please set SCRIPTS variable correctly to point to Moses scripts."
-    exit 1
-fi
-OUTDIR=wmt18_en_de
-src=en
-tgt=de
-lang=en-de
-prep=$OUTDIR
-tmp=$prep/tmp
-orig=orig
-mkdir -p $orig $tmp $prep
-cd $orig
-for ((i=0;i<${#URLS[@]};++i)); do
-    file=${FILES[i]}
-    if [ -f $file ]; then
-        echo "$file already exists, skipping download"
-    else
-        url=${URLS[i]}
-        wget "$url"
-        if [ -f $file ]; then
-            echo "$url successfully downloaded."
-        else
-            echo "$url not successfully downloaded."
-            exit 1
-        fi
-        if [ ${file: -4} == ".tgz" ]; then
-            tar zxvf $file
-        elif [ ${file: -4} == ".tar" ]; then
-            tar xvf $file
-        fi
-    fi
-done
-cd ..
-echo "pre-processing train data..."
-for l in $src $tgt; do
-    rm $tmp/train.tags.$lang.tok.$l
-    for f in "${CORPORA[@]}"; do
-        cat $orig/$f.$l | \
-            perl $NORM_PUNC $l | \
-            perl $REM_NON_PRINT_CHAR | \
-            perl $TOKENIZER -threads 8 -a -l $l >> $tmp/train.tags.$lang.tok.$l
-    done
-done
-echo "pre-processing test data..."
-for l in $src $tgt; do
-    if [ "$l" == "$src" ]; then
-        t="src"
-    else
-        t="ref"
-    fi
-    grep '<seg id' $orig/test-full/newstest2014-deen-$t.$l.sgm | \
-        sed -e 's/<seg id="[0-9]*">\s*//g' | \
-        sed -e 's/\s*<\/seg>\s*//g' | \
-        sed -e "s/\’/\'/g" | \
-    perl $TOKENIZER -threads 8 -a -l $l > $tmp/test.$l
-    echo ""
-done
-echo "splitting train and valid..."
-for l in $src $tgt; do
-    awk '{if (NR%100 == 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/valid.$l
-    awk '{if (NR%100 != 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/train.$l
-done
-TRAIN=$tmp/train.de-en
-BPE_CODE=$prep/code
-rm -f $TRAIN
-for l in $src $tgt; do
-    cat $tmp/train.$l >> $TRAIN
-done
-echo "learn_bpe.py on ${TRAIN}..."
-python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE
-for L in $src $tgt; do
-    for f in train.$L valid.$L test.$L; do
-        echo "apply_bpe.py to ${f}..."
-        python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $tmp/bpe.$f
-    done
-done
-perl $CLEAN -ratio 1.5 $tmp/bpe.train $src $tgt $prep/train 1 250
-perl $CLEAN -ratio 1.5 $tmp/bpe.valid $src $tgt $prep/valid 1 250
-for L in $src $tgt; do
-    cp $tmp/bpe.test.$L $prep/test.$L
-done

examples/backtranslation/sacrebleu.sh DELETED Viewed

@@ -1,37 +0,0 @@
-#!/bin/bash
-if [ $# -ne 5 ]; then
-    echo "usage: $0 [dataset=wmt14/full] [langpair=en-de] [databin] [bpecode] [model]"
-    exit
-fi
-DATASET=$1
-LANGPAIR=$2
-DATABIN=$3
-BPECODE=$4
-MODEL=$5
-SRCLANG=$(echo $LANGPAIR | cut -d '-' -f 1)
-TGTLANG=$(echo $LANGPAIR | cut -d '-' -f 2)
-BPEROOT=examples/backtranslation/subword-nmt/subword_nmt
-if [ ! -e $BPEROOT ]; then
-    BPEROOT=subword-nmt/subword_nmt
-    if [ ! -e $BPEROOT ]; then
-        echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
-        git clone https://github.com/rsennrich/subword-nmt.git
-    fi
-fi
-sacrebleu -t $DATASET -l $LANGPAIR --echo src \
-| sacremoses tokenize -a -l $SRCLANG -q \
-| python $BPEROOT/apply_bpe.py -c $BPECODE \
-| fairseq-interactive $DATABIN --path $MODEL \
-    -s $SRCLANG -t $TGTLANG \
-    --beam 5 --remove-bpe --buffer-size 1024 --max-tokens 8000 \
-| grep ^H- | cut -f 3- \
-| sacremoses detokenize -l $TGTLANG -q \
-| sacrebleu -t $DATASET -l $LANGPAIR

examples/backtranslation/tokenized_bleu.sh DELETED Viewed

@@ -1,46 +0,0 @@
-#!/bin/bash
-if [ $# -ne 5 ]; then
-    echo "usage: $0 [dataset=wmt14/full] [langpair=en-de] [databin] [bpecode] [model]"
-    exit
-fi
-DATASET=$1
-LANGPAIR=$2
-DATABIN=$3
-BPECODE=$4
-MODEL=$5
-SRCLANG=$(echo $LANGPAIR | cut -d '-' -f 1)
-TGTLANG=$(echo $LANGPAIR | cut -d '-' -f 2)
-BPEROOT=examples/backtranslation/subword-nmt/subword_nmt
-if [ ! -e $BPEROOT ]; then
-    BPEROOT=subword-nmt/subword_nmt
-    if [ ! -e $BPEROOT ]; then
-        echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
-        git clone https://github.com/rsennrich/subword-nmt.git
-    fi
-fi
-TMP_REF=$(mktemp)
-sacrebleu -t $DATASET -l $LANGPAIR --echo ref -q \
-| sacremoses normalize -l $TGTLANG -q \
-| sacremoses tokenize -a -l $TGTLANG -q \
-> $TMP_REF
-sacrebleu -t $DATASET -l $LANGPAIR --echo src -q \
-| sacremoses normalize -l $SRCLANG -q \
-| sacremoses tokenize -a -l $SRCLANG -q \
-| python $BPEROOT/apply_bpe.py -c $BPECODE \
-| fairseq-interactive $DATABIN --path $MODEL \
-    -s $SRCLANG -t $TGTLANG \
-    --beam 5 --remove-bpe --buffer-size 1024 --max-tokens 8000 \
-| grep ^H- | cut -f 3- \
-| fairseq-score --ref $TMP_REF
-rm -f $TMP_REF

examples/bart/README.glue.md DELETED Viewed

@@ -1,99 +0,0 @@
-# Fine-tuning BART on GLUE tasks
-### 1) Download the data from GLUE website (https://gluebenchmark.com/tasks) using following commands:
-```bash
-wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
-python download_glue_data.py --data_dir glue_data --tasks all
-```
-### 2) Preprocess GLUE task data (same as RoBERTa):
-```bash
-./examples/roberta/preprocess_GLUE_tasks.sh glue_data <glue_task_name>
-```
-`glue_task_name` is one of the following:
-`{ALL, QQP, MNLI, QNLI, MRPC, RTE, STS-B, SST-2, CoLA}`
-Use `ALL` for preprocessing all the glue tasks.
-### 3) Fine-tuning on GLUE task:
-Example fine-tuning cmd for `RTE` task
-```bash
-TOTAL_NUM_UPDATES=2036  # 10 epochs through RTE for bsz 16
-WARMUP_UPDATES=61      # 6 percent of the number of updates
-LR=1e-05                # Peak LR for polynomial LR scheduler.
-NUM_CLASSES=2
-MAX_SENTENCES=16        # Batch size.
-BART_PATH=/path/to/bart/model.pt
-CUDA_VISIBLE_DEVICES=0,1 fairseq-train RTE-bin/ \
-    --restore-file $BART_PATH \
-    --batch-size $MAX_SENTENCES \
-    --max-tokens 4400 \
-    --task sentence_prediction \
-    --add-prev-output-tokens \
-    --layernorm-embedding \
-    --share-all-embeddings \
-    --share-decoder-input-output-embed \
-    --reset-optimizer --reset-dataloader --reset-meters \
-    --required-batch-size-multiple 1 \
-    --init-token 0 \
-    --arch bart_large \
-    --criterion sentence_prediction \
-    --num-classes $NUM_CLASSES \
-    --dropout 0.1 --attention-dropout 0.1 \
-    --weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-08 \
-    --clip-norm 0.0 \
-    --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
-    --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
-    --max-epoch 10 \
-    --find-unused-parameters \
-    --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric;
-```
-For each of the GLUE task, you will need to use following cmd-line arguments:
-Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
----|---|---|---|---|---|---|---|---
-`--num-classes` | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 1
-`--lr` | 5e-6 | 1e-5 | 1e-5 | 1e-5 | 5e-6 | 2e-5 | 2e-5 | 2e-5
-`bsz` | 128 | 32 | 32 | 32 | 128 | 64 | 64 | 32
-`--total-num-update` | 30968 | 33112 | 113272 | 1018 | 5233 | 1148 | 1334 | 1799
-`--warmup-updates` | 1858 | 1986 | 6796 | 61 | 314 | 68 | 80 | 107
-For `STS-B` additionally add `--regression-target --best-checkpoint-metric loss` and remove `--maximize-best-checkpoint-metric`.
-**Note:**
-a) `--total-num-updates` is used by `--polynomial_decay` scheduler and is calculated for `--max-epoch=10` and `--batch-size=32/64/128` depending on the task.
-b) Above cmd-args and hyperparams are tested on Nvidia `V100` GPU with `32gb` of memory for each task. Depending on the GPU memory resources available to you, you can use increase `--update-freq` and reduce `--batch-size`.
-### Inference on GLUE task
-After training the model as mentioned in previous step, you can perform inference with checkpoints in `checkpoints/` directory using following python code snippet:
-```python
-from fairseq.models.bart import BARTModel
-bart = BARTModel.from_pretrained(
-    'checkpoints/',
-    checkpoint_file='checkpoint_best.pt',
-    data_name_or_path='RTE-bin'
-)
-label_fn = lambda label: bart.task.label_dictionary.string(
-    [label + bart.task.label_dictionary.nspecial]
-)
-ncorrect, nsamples = 0, 0
-bart.cuda()
-bart.eval()
-with open('glue_data/RTE/dev.tsv') as fin:
-    fin.readline()
-    for index, line in enumerate(fin):
-        tokens = line.strip().split('\t')
-        sent1, sent2, target = tokens[1], tokens[2], tokens[3]
-        tokens = bart.encode(sent1, sent2)
-        prediction = bart.predict('sentence_classification_head', tokens).argmax().item()
-        prediction_label = label_fn(prediction)
-        ncorrect += int(prediction_label == target)
-        nsamples += 1
-print('| Accuracy: ', float(ncorrect)/float(nsamples))
-```

examples/bart/README.md DELETED Viewed

@@ -1,228 +0,0 @@
-# BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
-[https://arxiv.org/abs/1910.13461](https://arxiv.org/abs/1910.13461)
-## Introduction
-BART is sequence-to-sequence model trained with denoising as pretraining objective. We show that this pretraining objective is more generic and show that we can match [RoBERTa](../roberta) results on SQuAD and GLUE and gain state-of-the-art results on summarization (XSum, CNN dataset), long form generative question answering (ELI5) and dialog response genration (ConvAI2). See the associated paper for more details.
-## Pre-trained models
-Model | Description | # params | Download
----|---|---|---
-`bart.base` | BART model with 6 encoder and decoder layers | 140M | [bart.base.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/bart.base.tar.gz)
-`bart.large` | BART model with 12 encoder and decoder layers | 400M | [bart.large.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/bart.large.tar.gz)
-`bart.large.mnli` | `bart.large` finetuned on `MNLI` | 400M | [bart.large.mnli.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/bart.large.mnli.tar.gz)
-`bart.large.cnn` | `bart.large` finetuned on `CNN-DM` | 400M | [bart.large.cnn.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/bart.large.cnn.tar.gz)
-`bart.large.xsum` | `bart.large` finetuned on `Xsum` | 400M | [bart.large.xsum.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/bart.large.xsum.tar.gz)
-## Results
-**[GLUE (Wang et al., 2019)](https://gluebenchmark.com/)**
-_(dev set, single model, single-task finetuning)_
-Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
----|---|---|---|---|---|---|---|---
-`roberta.large` | 90.2 | 94.7 | 92.2 | 86.6 | 96.4 | 90.9 | 68.0 | 92.4
-`bart.large` | 89.9 | 94.9 | 92.5 | 87.0 | 96.6 | 90.4 | 62.8 | 91.2
-**[SQuAD (Rajpurkar et al., 2018)](https://rajpurkar.github.io/SQuAD-explorer/)**
-_(dev set, no additional data used)_
-Model | SQuAD 1.1 EM/F1 | SQuAD 2.0 EM/F1
----|---|---
-`roberta.large` | 88.9/94.6 | 86.5/89.4
-`bart.large` | 88.8/94.6 | 86.1/89.2
-**[CNN/Daily Mail](http://nlpprogress.com/english/summarization.html)**
-_(test set, no additional data used)_
-Model | R1 | R2 | RL
----|---|---|---
-`BERTSUMEXTABS` | 42.13 | 19.60 | 39.18
-`bart.large` | 44.16 | 21.28 | 40.90
-## Example usage
-##### Load BART from torch.hub (PyTorch >= 1.1):
-```python
-import torch
-bart = torch.hub.load('pytorch/fairseq', 'bart.large')
-bart.eval()  # disable dropout (or leave in train mode to finetune)
-```
-##### Load BART (for PyTorch 1.0 or custom models):
-```python
-# Download bart.large model
-wget https://dl.fbaipublicfiles.com/fairseq/models/bart.large.tar.gz
-tar -xzvf bart.large.tar.gz
-# Load the model in fairseq
-from fairseq.models.bart import BARTModel
-bart = BARTModel.from_pretrained('/path/to/bart.large', checkpoint_file='model.pt')
-bart.eval()  # disable dropout (or leave in train mode to finetune)
-```
-##### Apply Byte-Pair Encoding (BPE) to input text:
-```python
-tokens = bart.encode('Hello world!')
-assert tokens.tolist() == [0, 31414, 232, 328, 2]
-bart.decode(tokens)  # 'Hello world!'
-```
-##### Extract features from BART:
-```python
-# Extract the last layer's features
-last_layer_features = bart.extract_features(tokens)
-assert last_layer_features.size() == torch.Size([1, 5, 1024])
-# Extract all layer's features from decoder (layer 0 is the embedding layer)
-all_layers = bart.extract_features(tokens, return_all_hiddens=True)
-assert len(all_layers) == 13
-assert torch.all(all_layers[-1] == last_layer_features)
-```
-##### Use BART for sentence-pair classification tasks:
-```python
-# Download BART already finetuned for MNLI
-bart = torch.hub.load('pytorch/fairseq', 'bart.large.mnli')
-bart.eval()  # disable dropout for evaluation
-# Encode a pair of sentences and make a prediction
-tokens = bart.encode('BART is a seq2seq model.', 'BART is not sequence to sequence.')
-bart.predict('mnli', tokens).argmax()  # 0: contradiction
-# Encode another pair of sentences
-tokens = bart.encode('BART is denoising autoencoder.', 'BART is version of autoencoder.')
-bart.predict('mnli', tokens).argmax()  # 2: entailment
-```
-##### Register a new (randomly initialized) classification head:
-```python
-bart.register_classification_head('new_task', num_classes=3)
-logprobs = bart.predict('new_task', tokens)
-```
-##### Batched prediction:
-```python
-import torch
-from fairseq.data.data_utils import collate_tokens
-bart = torch.hub.load('pytorch/fairseq', 'bart.large.mnli')
-bart.eval()
-batch_of_pairs = [
-    ['BART is a seq2seq model.', 'BART is not sequence to sequence.'],
-    ['BART is denoising autoencoder.', 'BART is version of autoencoder.'],
-]
-batch = collate_tokens(
-    [bart.encode(pair[0], pair[1]) for pair in batch_of_pairs], pad_idx=1
-)
-logprobs = bart.predict('mnli', batch)
-print(logprobs.argmax(dim=1))
-# tensor([0, 2])
-```
-##### Using the GPU:
-```python
-bart.cuda()
-bart.predict('new_task', tokens)
-```
-#### Filling masks:
-BART can be used to fill multiple `<mask>` tokens in the input.
-```python
-bart = torch.hub.load('pytorch/fairseq', 'bart.base')
-bart.eval()
-bart.fill_mask(['The cat <mask> on the <mask>.'], topk=3, beam=10)
-# [[('The cat was on the ground.', tensor(-0.6183)), ('The cat was on the floor.', tensor(-0.6798)), ('The cat sleeps on the couch.', tensor(-0.6830))]]
-```
-Note that by default we enforce the output length to match the input length.
-This can be disabled by setting ``match_source_len=False``:
-```
-bart.fill_mask(['The cat <mask> on the <mask>.'], topk=3, beam=10, match_source_len=False)
-# [[('The cat was on the ground.', tensor(-0.6185)), ('The cat was asleep on the couch.', tensor(-0.6276)), ('The cat was on the floor.', tensor(-0.6800))]]
-```
-Example code to fill masks for a batch of sentences using GPU
-```
-bart.cuda()
-bart.fill_mask(['The cat <mask> on the <mask>.', 'The dog <mask> on the <mask>.'], topk=3, beam=10)
-# [[('The cat was on the ground.', tensor(-0.6183)), ('The cat was on the floor.', tensor(-0.6798)), ('The cat sleeps on the couch.', tensor(-0.6830))], [('The dog was on the ground.', tensor(-0.6190)), ('The dog lay on the ground.', tensor(-0.6711)),
-('The dog was asleep on the couch', tensor(-0.6796))]]
-```
-#### Evaluating the `bart.large.mnli` model:
-Example python code snippet to evaluate accuracy on the MNLI `dev_matched` set.
-```python
-label_map = {0: 'contradiction', 1: 'neutral', 2: 'entailment'}
-ncorrect, nsamples = 0, 0
-bart.cuda()
-bart.eval()
-with open('glue_data/MNLI/dev_matched.tsv') as fin:
-    fin.readline()
-    for index, line in enumerate(fin):
-        tokens = line.strip().split('\t')
-        sent1, sent2, target = tokens[8], tokens[9], tokens[-1]
-        tokens = bart.encode(sent1, sent2)
-        prediction = bart.predict('mnli', tokens).argmax().item()
-        prediction_label = label_map[prediction]
-        ncorrect += int(prediction_label == target)
-        nsamples += 1
-        print('| Accuracy: ', float(ncorrect)/float(nsamples))
-# Expected output: 0.9010
-```
-#### Evaluating the `bart.large.cnn` model:
-- Follow instructions [here](https://github.com/abisee/cnn-dailymail) to download and process into data-files such that `test.source` and `test.target` has one line for each non-tokenized sample.
-- For simpler preprocessing, you can also `wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz`, although there is no guarantee of identical scores
-- `huggingface/transformers` has a simpler interface that supports [single-gpu](https://github.com/huggingface/transformers/blob/master/examples/legacy/seq2seq/run_eval.py) and [multi-gpu](https://github.com/huggingface/transformers/blob/master/examples/legacy/seq2seq/run_distributed_eval.py) beam search.
-    In `huggingface/transformers`, the BART models' paths are `facebook/bart-large-cnn` and `facebook/bart-large-xsum`.
-In `fairseq`, summaries can be generated using:
-```bash
-cp data-bin/cnn_dm/dict.source.txt  checkpoints/
-python examples/bart/summarize.py \
-  --model-dir pytorch/fairseq \
-  --model-file bart.large.cnn \
-  --src cnn_dm/test.source \
-  --out cnn_dm/test.hypo
-```
-For calculating rouge, install `files2rouge` from [here](https://github.com/pltrdy/files2rouge).
-```bash
-export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar
-# Tokenize hypothesis and target files.
-cat test.hypo | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.hypo.tokenized
-cat test.target | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.hypo.target
-files2rouge test.hypo.tokenized test.hypo.target
-# Expected output: (ROUGE-2 Average_F: 0.21238)
-```
-## Finetuning
-- [Finetuning on GLUE](README.glue.md)
-- [Finetuning on CNN-DM](README.summarization.md)
-## Citation
-```bibtex
-@article{lewis2019bart,
-    title = {BART: Denoising Sequence-to-Sequence Pre-training for Natural
-Language Generation, Translation, and Comprehension},
-    author = {Mike Lewis and Yinhan Liu and Naman Goyal and Marjan Ghazvininejad and
-              Abdelrahman Mohamed and Omer Levy and Veselin Stoyanov
-              and Luke Zettlemoyer },
-    journal={arXiv preprint arXiv:1910.13461},
-    year = {2019},
-}
-```

examples/bart/README.summarization.md DELETED Viewed

@@ -1,102 +0,0 @@
-# Fine-tuning BART on CNN-Dailymail summarization task
-### 1) Download the CNN and Daily Mail data and preprocess it into data files with non-tokenized cased samples.
-Follow the instructions [here](https://github.com/abisee/cnn-dailymail) to download the original CNN and Daily Mail datasets. To preprocess the data, refer to the pointers in [this issue](https://github.com/pytorch/fairseq/issues/1391) or check out the code [here](https://github.com/artmatsak/cnn-dailymail).
-Follow the instructions [here](https://github.com/EdinburghNLP/XSum) to download the original Extreme Summarization datasets, or check out the code [here](https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset), Please keep the raw dataset and make sure no tokenization nor BPE on the dataset.
-### 2) BPE preprocess:
-```bash
-wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'
-wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'
-wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt'
-TASK=cnn_dm
-for SPLIT in train val
-do
-  for LANG in source target
-  do
-    python -m examples.roberta.multiprocessing_bpe_encoder \
-    --encoder-json encoder.json \
-    --vocab-bpe vocab.bpe \
-    --inputs "$TASK/$SPLIT.$LANG" \
-    --outputs "$TASK/$SPLIT.bpe.$LANG" \
-    --workers 60 \
-    --keep-empty;
-  done
-done
-```
-### 3) Binarize dataset:
-```bash
-fairseq-preprocess \
-  --source-lang "source" \
-  --target-lang "target" \
-  --trainpref "${TASK}/train.bpe" \
-  --validpref "${TASK}/val.bpe" \
-  --destdir "${TASK}-bin/" \
-  --workers 60 \
-  --srcdict dict.txt \
-  --tgtdict dict.txt;
-```
-### 4) Fine-tuning on CNN-DM summarization task:
-Example fine-tuning CNN-DM
-```bash
-TOTAL_NUM_UPDATES=20000
-WARMUP_UPDATES=500
-LR=3e-05
-MAX_TOKENS=2048
-UPDATE_FREQ=4
-BART_PATH=/path/to/bart/model.pt
-CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 fairseq-train cnn_dm-bin \
-    --restore-file $BART_PATH \
-    --max-tokens $MAX_TOKENS \
-    --task translation \
-    --source-lang source --target-lang target \
-    --truncate-source \
-    --layernorm-embedding \
-    --share-all-embeddings \
-    --share-decoder-input-output-embed \
-    --reset-optimizer --reset-dataloader --reset-meters \
-    --required-batch-size-multiple 1 \
-    --arch bart_large \
-    --criterion label_smoothed_cross_entropy \
-    --label-smoothing 0.1 \
-    --dropout 0.1 --attention-dropout 0.1 \
-    --weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 \
-    --clip-norm 0.1 \
-    --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
-    --fp16 --update-freq $UPDATE_FREQ \
-    --skip-invalid-size-inputs-valid-test \
-    --find-unused-parameters;
-```
-Above is expected to run on `1` node with `8 32gb-V100`.
-Expected training time is about `5 hours`. Training time can be reduced with distributed training on `4` nodes and `--update-freq 1`.
-Use TOTAL_NUM_UPDATES=15000 UPDATE_FREQ=2 for Xsum task
-### Inference for CNN-DM test data using above trained checkpoint.
-After training the model as mentioned in previous step, you can perform inference with checkpoints in `checkpoints/` directory using `eval_cnn.py`, for example
-```bash
-cp data-bin/cnn_dm/dict.source.txt  checkpoints/
-python examples/bart/summarize.py \
-  --model-dir checkpoints \
-  --model-file checkpoint_best.pt \
-  --src cnn_dm/test.source \
-  --out cnn_dm/test.hypo
-```
-For XSUM, which uses beam=6, lenpen=1.0, max_len_b=60, min_len=10:
-```bash
-cp data-bin/cnn_dm/dict.source.txt  checkpoints/
-python examples/bart/summarize.py \
-  --model-dir checkpoints \
-  --model-file checkpoint_best.pt \
-  --src cnn_dm/test.source \
-  --out cnn_dm/test.hypo \
-  --xsum-kwargs
-```

examples/bart/summarize.py DELETED Viewed

@@ -1,100 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-import torch
-from fairseq.models.bart import BARTModel
-import argparse
-XSUM_KWARGS = dict(beam=6, lenpen=1.0, max_len_b=60, min_len=10, no_repeat_ngram_size=3)
-CNN_KWARGS = dict(beam=4, lenpen=2.0, max_len_b=140, min_len=55, no_repeat_ngram_size=3)
-@torch.no_grad()
-def generate(bart, infile, outfile="bart_hypo.txt", bsz=32, n_obs=None, **eval_kwargs):
-    count = 1
-    # if n_obs is not None: bsz = min(bsz, n_obs)
-    with open(infile) as source, open(outfile, "w") as fout:
-        sline = source.readline().strip()
-        slines = [sline]
-        for sline in source:
-            if n_obs is not None and count > n_obs:
-                break
-            if count % bsz == 0:
-                hypotheses_batch = bart.sample(slines, **eval_kwargs)
-                for hypothesis in hypotheses_batch:
-                    fout.write(hypothesis + "\n")
-                    fout.flush()
-                slines = []
-            slines.append(sline.strip())
-            count += 1
-        if slines != []:
-            hypotheses_batch = bart.sample(slines, **eval_kwargs)
-            for hypothesis in hypotheses_batch:
-                fout.write(hypothesis + "\n")
-                fout.flush()
-def main():
-    """
-    Usage::
-         python examples/bart/summarize.py \
-            --model-dir $HOME/bart.large.cnn \
-            --model-file model.pt \
-            --src $HOME/data-bin/cnn_dm/test.source
-    """
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--model-dir",
-        required=True,
-        type=str,
-        default="bart.large.cnn/",
-        help="path containing model file and src_dict.txt",
-    )
-    parser.add_argument(
-        "--model-file",
-        default="checkpoint_best.pt",
-        help="where in model_dir are weights saved",
-    )
-    parser.add_argument(
-        "--src", default="test.source", help="text to summarize", type=str
-    )
-    parser.add_argument(
-        "--out", default="test.hypo", help="where to save summaries", type=str
-    )
-    parser.add_argument("--bsz", default=32, help="where to save summaries", type=int)
-    parser.add_argument(
-        "--n", default=None, help="how many examples to summarize", type=int
-    )
-    parser.add_argument(
-        "--xsum-kwargs",
-        action="store_true",
-        default=False,
-        help="if true use XSUM_KWARGS else CNN_KWARGS",
-    )
-    args = parser.parse_args()
-    eval_kwargs = XSUM_KWARGS if args.xsum_kwargs else CNN_KWARGS
-    if args.model_dir == "pytorch/fairseq":
-        bart = torch.hub.load("pytorch/fairseq", args.model_file)
-    else:
-        bart = BARTModel.from_pretrained(
-            args.model_dir,
-            checkpoint_file=args.model_file,
-            data_name_or_path=args.model_dir,
-        )
-    bart = bart.eval()
-    if torch.cuda.is_available():
-        bart = bart.cuda().half()
-    generate(
-        bart, args.src, bsz=args.bsz, n_obs=args.n, outfile=args.out, **eval_kwargs
-    )
-if __name__ == "__main__":
-    main()

examples/byte_level_bpe/README.md DELETED Viewed

@@ -1,88 +0,0 @@
-# Neural Machine Translation with Byte-Level Subwords
-https://arxiv.org/abs/1909.03341
-We provide an implementation of byte-level byte-pair encoding (BBPE), taking IWSLT 2017 Fr-En translation as
-example.
-## Data
-Get data and generate fairseq binary dataset:
-```bash
-bash ./get_data.sh
-```
-## Model Training
-Train Transformer model with Bi-GRU embedding contextualization (implemented in `gru_transformer.py`):
-```bash
-# VOCAB=bytes
-# VOCAB=chars
-VOCAB=bbpe2048
-# VOCAB=bpe2048
-# VOCAB=bbpe4096
-# VOCAB=bpe4096
-# VOCAB=bpe16384
-```
-```bash
-fairseq-train "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
-    --arch gru_transformer --encoder-layers 2 --decoder-layers 2 --dropout 0.3 --share-all-embeddings \
-    --optimizer adam --adam-betas '(0.9, 0.98)' \
-    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
-    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
-    --log-format 'simple' --log-interval 100 --save-dir "checkpoints/${VOCAB}" \
-    --batch-size 100 --max-update 100000 --update-freq 2
-```
-## Generation
-`fairseq-generate` requires bytes (BBPE) decoder to convert byte-level representation back to characters:
-```bash
-# BPE=--bpe bytes
-# BPE=--bpe characters
-BPE=--bpe byte_bpe --sentencepiece-model-path data/spm_bbpe2048.model
-# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe2048.model
-# BPE=--bpe byte_bpe --sentencepiece-model-path data/spm_bbpe4096.model
-# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe4096.model
-# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe16384.model
-```
-```bash
-fairseq-generate "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
-    --source-lang fr --gen-subset test --sacrebleu --path "checkpoints/${VOCAB}/checkpoint_last.pt" \
-    --tokenizer moses --moses-target-lang en ${BPE}
-```
-When using `fairseq-interactive`, bytes (BBPE) encoder/decoder is required to tokenize input data and detokenize model predictions:
-```bash
-fairseq-interactive "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
-    --path "checkpoints/${VOCAB}/checkpoint_last.pt" --input data/test.fr --tokenizer moses --moses-source-lang fr \
-    --moses-target-lang en ${BPE} --buffer-size 1000 --max-tokens 10000
-```
-## Results
-| Vocabulary    | Model  | BLEU |
-|:-------------:|:-------------:|:-------------:|
-| Joint BPE 16k ([Kudo, 2018](https://arxiv.org/abs/1804.10959)) | 512d LSTM 2+2 | 33.81 |
-| Joint BPE 16k | Transformer base 2+2 (w/ GRU) | 36.64 (36.72) |
-| Joint BPE 4k | Transformer base 2+2 (w/ GRU) | 35.49 (36.10) |
-| Joint BBPE 4k | Transformer base 2+2 (w/ GRU) | 35.61 (35.82) |
-| Joint BPE 2k | Transformer base 2+2 (w/ GRU) | 34.87 (36.13) |
-| Joint BBPE 2k | Transformer base 2+2 (w/ GRU) | 34.98 (35.43) |
-| Characters | Transformer base 2+2 (w/ GRU) | 31.78 (33.30) |
-| Bytes | Transformer base 2+2 (w/ GRU) | 31.57 (33.62) |
-## Citation
-```
-@misc{wang2019neural,
-    title={Neural Machine Translation with Byte-Level Subwords},
-    author={Changhan Wang and Kyunghyun Cho and Jiatao Gu},
-    year={2019},
-    eprint={1909.03341},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
-}
-```
-## Contact
-Changhan Wang ([changhan@fb.com](mailto:changhan@fb.com)),
-Kyunghyun Cho ([kyunghyuncho@fb.com](mailto:kyunghyuncho@fb.com)),
-Jiatao Gu ([jgu@fb.com](mailto:jgu@fb.com))

examples/byte_level_bpe/get_bitext.py DELETED Viewed

@@ -1,254 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-import argparse
-import os
-import os.path as op
-from collections import namedtuple
-from multiprocessing import cpu_count
-from typing import List, Optional
-import sentencepiece as sp
-from fairseq.data.encoders.byte_bpe import ByteBPE
-from fairseq.data.encoders.byte_utils import byte_encode
-from fairseq.data.encoders.bytes import Bytes
-from fairseq.data.encoders.characters import Characters
-from fairseq.data.encoders.moses_tokenizer import MosesTokenizer
-from fairseq.data.encoders.sentencepiece_bpe import SentencepieceBPE
-SPLITS = ["train", "valid", "test"]
-def _convert_xml(in_path: str, out_path: str):
-    with open(in_path) as f, open(out_path, "w") as f_o:
-        for s in f:
-            ss = s.strip()
-            if not ss.startswith("<seg"):
-                continue
-            ss = ss.replace("</seg>", "").split('">')
-            assert len(ss) == 2
-            f_o.write(ss[1].strip() + "\n")
-def _convert_train(in_path: str, out_path: str):
-    with open(in_path) as f, open(out_path, "w") as f_o:
-        for s in f:
-            ss = s.strip()
-            if ss.startswith("<"):
-                continue
-            f_o.write(ss.strip() + "\n")
-def _get_bytes(in_path: str, out_path: str):
-    with open(in_path) as f, open(out_path, "w") as f_o:
-        for s in f:
-            f_o.write(Bytes.encode(s.strip()) + "\n")
-def _get_chars(in_path: str, out_path: str):
-    with open(in_path) as f, open(out_path, "w") as f_o:
-        for s in f:
-            f_o.write(Characters.encode(s.strip()) + "\n")
-def pretokenize(in_path: str, out_path: str, src: str, tgt: str):
-    Args = namedtuple(
-        "Args",
-        [
-            "moses_source_lang",
-            "moses_target_lang",
-            "moses_no_dash_splits",
-            "moses_no_escape",
-        ],
-    )
-    args = Args(
-        moses_source_lang=src,
-        moses_target_lang=tgt,
-        moses_no_dash_splits=False,
-        moses_no_escape=False,
-    )
-    pretokenizer = MosesTokenizer(args)
-    with open(in_path) as f, open(out_path, "w") as f_o:
-        for s in f:
-            f_o.write(pretokenizer.encode(s.strip()) + "\n")
-def _convert_to_bchar(in_path_prefix: str, src: str, tgt: str, out_path: str):
-    with open(out_path, "w") as f_o:
-        for lang in [src, tgt]:
-            with open(f"{in_path_prefix}.{lang}") as f:
-                for s in f:
-                    f_o.write(byte_encode(s.strip()) + "\n")
-def _get_bpe(in_path: str, model_prefix: str, vocab_size: int):
-    arguments = [
-        f"--input={in_path}",
-        f"--model_prefix={model_prefix}",
-        f"--model_type=bpe",
-        f"--vocab_size={vocab_size}",
-        "--character_coverage=1.0",
-        "--normalization_rule_name=identity",
-        f"--num_threads={cpu_count()}",
-    ]
-    sp.SentencePieceTrainer.Train(" ".join(arguments))
-def _apply_bbpe(model_path: str, in_path: str, out_path: str):
-    Args = namedtuple("Args", ["sentencepiece_model_path"])
-    args = Args(sentencepiece_model_path=model_path)
-    tokenizer = ByteBPE(args)
-    with open(in_path) as f, open(out_path, "w") as f_o:
-        for s in f:
-            f_o.write(tokenizer.encode(s.strip()) + "\n")
-def _apply_bpe(model_path: str, in_path: str, out_path: str):
-    Args = namedtuple("Args", ["sentencepiece_model"])
-    args = Args(sentencepiece_model=model_path)
-    tokenizer = SentencepieceBPE(args)
-    with open(in_path) as f, open(out_path, "w") as f_o:
-        for s in f:
-            f_o.write(tokenizer.encode(s.strip()) + "\n")
-def _concat_files(in_paths: List[str], out_path: str):
-    with open(out_path, "w") as f_o:
-        for p in in_paths:
-            with open(p) as f:
-                for r in f:
-                    f_o.write(r)
-def preprocess_iwslt17(
-    root: str,
-    src: str,
-    tgt: str,
-    bpe_size: Optional[int],
-    need_chars: bool,
-    bbpe_size: Optional[int],
-    need_bytes: bool,
-):
-    # extract bitext
-    in_root = op.join(root, f"{src}-{tgt}")
-    for lang in [src, tgt]:
-        _convert_train(
-            op.join(in_root, f"train.tags.{src}-{tgt}.{lang}"),
-            op.join(root, f"train.{lang}"),
-        )
-        _convert_xml(
-            op.join(in_root, f"IWSLT17.TED.dev2010.{src}-{tgt}.{lang}.xml"),
-            op.join(root, f"valid.{lang}"),
-        )
-        _convert_xml(
-            op.join(in_root, f"IWSLT17.TED.tst2015.{src}-{tgt}.{lang}.xml"),
-            op.join(root, f"test.{lang}"),
-        )
-    # pre-tokenize
-    for lang in [src, tgt]:
-        for split in SPLITS:
-            pretokenize(
-                op.join(root, f"{split}.{lang}"),
-                op.join(root, f"{split}.moses.{lang}"),
-                src,
-                tgt,
-            )
-    # tokenize with BPE vocabulary
-    if bpe_size is not None:
-        # learn vocabulary
-        concated_train_path = op.join(root, "train.all")
-        _concat_files(
-            [op.join(root, "train.moses.fr"), op.join(root, "train.moses.en")],
-            concated_train_path,
-        )
-        bpe_model_prefix = op.join(root, f"spm_bpe{bpe_size}")
-        _get_bpe(concated_train_path, bpe_model_prefix, bpe_size)
-        os.remove(concated_train_path)
-        # apply
-        for lang in [src, tgt]:
-            for split in SPLITS:
-                _apply_bpe(
-                    bpe_model_prefix + ".model",
-                    op.join(root, f"{split}.moses.{lang}"),
-                    op.join(root, f"{split}.moses.bpe{bpe_size}.{lang}"),
-                )
-    # tokenize with bytes vocabulary
-    if need_bytes:
-        for lang in [src, tgt]:
-            for split in SPLITS:
-                _get_bytes(
-                    op.join(root, f"{split}.moses.{lang}"),
-                    op.join(root, f"{split}.moses.bytes.{lang}"),
-                )
-    # tokenize with characters vocabulary
-    if need_chars:
-        for lang in [src, tgt]:
-            for split in SPLITS:
-                _get_chars(
-                    op.join(root, f"{split}.moses.{lang}"),
-                    op.join(root, f"{split}.moses.chars.{lang}"),
-                )
-    # tokenize with byte-level BPE vocabulary
-    if bbpe_size is not None:
-        # learn vocabulary
-        bchar_path = op.join(root, "train.bchar")
-        _convert_to_bchar(op.join(root, "train.moses"), src, tgt, bchar_path)
-        bbpe_model_prefix = op.join(root, f"spm_bbpe{bbpe_size}")
-        _get_bpe(bchar_path, bbpe_model_prefix, bbpe_size)
-        os.remove(bchar_path)
-        # apply
-        for lang in [src, tgt]:
-            for split in SPLITS:
-                _apply_bbpe(
-                    bbpe_model_prefix + ".model",
-                    op.join(root, f"{split}.moses.{lang}"),
-                    op.join(root, f"{split}.moses.bbpe{bbpe_size}.{lang}"),
-                )
-def main():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--root", type=str, default="data")
-    parser.add_argument(
-        "--bpe-vocab",
-        default=None,
-        type=int,
-        help="Generate tokenized bitext with BPE of size K."
-        "Default to None (disabled).",
-    )
-    parser.add_argument(
-        "--bbpe-vocab",
-        default=None,
-        type=int,
-        help="Generate tokenized bitext with BBPE of size K."
-        "Default to None (disabled).",
-    )
-    parser.add_argument(
-        "--byte-vocab",
-        action="store_true",
-        help="Generate tokenized bitext with bytes vocabulary",
-    )
-    parser.add_argument(
-        "--char-vocab",
-        action="store_true",
-        help="Generate tokenized bitext with chars vocabulary",
-    )
-    args = parser.parse_args()
-    preprocess_iwslt17(
-        args.root,
-        "fr",
-        "en",
-        args.bpe_vocab,
-        args.char_vocab,
-        args.bbpe_vocab,
-        args.byte_vocab,
-    )
-if __name__ == "__main__":
-    main()

examples/byte_level_bpe/get_data.sh DELETED Viewed

@@ -1,47 +0,0 @@
-#!/bin/bash
-# Copyright (c) Facebook, Inc. and its affiliates.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-PY_BIN_ROOT=
-# PyPI dependency
-${PY_BIN_ROOT}pip install sentencepiece sacremoses
-# Get data
-if [ ! -d "data" ]; then
-  mkdir data
-fi
-if [ ! -f "data/fr-en.tgz" ]; then
-  wget https://wit3.fbk.eu/archive/2017-01-trnted/texts/fr/en/fr-en.tgz -P data
-  tar xvf data/fr-en.tgz -C data
-fi
-${PY_BIN_ROOT}python get_bitext.py --bpe-vocab 16384 --byte-vocab --char-vocab
-for VOCAB_SIZE in 2048 4096; do
-  ${PY_BIN_ROOT}python get_bitext.py --bpe-vocab ${VOCAB_SIZE} --bbpe-vocab ${VOCAB_SIZE}
-done
-rm -r data/fr-en data/fr-en.tgz
-# Generate binary dataset
-${PY_BIN_ROOT}/fairseq-preprocess --source-lang fr --target-lang en --destdir data/bin_bpe16384 --joined-dictionary \
-  --workers "$(nproc)" --trainpref data/train.moses.bpe16384 --validpref data/valid.moses.bpe16384 \
-  --testpref data/test.moses.bpe16384
-${PY_BIN_ROOT}/fairseq-preprocess --source-lang fr --target-lang en --destdir data/bin_bytes --joined-dictionary \
-  --workers "$(nproc)" --trainpref data/train.moses.bytes --validpref data/valid.moses.bytes \
-  --testpref data/test.moses.bytes
-${PY_BIN_ROOT}/fairseq-preprocess --source-lang fr --target-lang en --destdir data/bin_chars --joined-dictionary \
-  --workers "$(nproc)" --trainpref data/train.moses.chars --validpref data/valid.moses.chars \
-  --testpref data/test.moses.chars
-for VOCAB_SIZE in 2048 4096; do
-  for TYPE in bbpe bpe; do
-    ${PY_BIN_ROOT}/fairseq-preprocess --source-lang fr --target-lang en --destdir "data/bin_${TYPE}${VOCAB_SIZE}" \
-      --joined-dictionary --workers "$(nproc)" --trainpref "data/train.moses.${TYPE}${VOCAB_SIZE}" \
-      --validpref "data/valid.moses.${TYPE}${VOCAB_SIZE}" --testpref "data/test.moses.${TYPE}${VOCAB_SIZE}"
-  done
-done