Beam Datasets
=============
Some datasets are too large to be processed on a single machine. Instead, you can process them with `Apache Beam `_, a library for parallel data processing. The processing pipeline is executed on a distributed processing backend such as `Apache Flink `_, `Apache Spark `_, or `Google Cloud Dataflow `_.
We have already created Beam pipelines for some of the larger datasets like `wikipedia `_, and `wiki40b `_. You can load these normally with :func:`datasets.Datasets.load_dataset`. But if you want to run your own Beam pipeline with Dataflow, here is how:
1. Specify the dataset and configuration you want to process:
.. code-block::
DATASET_NAME=your_dataset_name # ex: wikipedia
CONFIG_NAME=your_config_name # ex: 20200501.en
2. Input your Google Cloud Platform information:
.. code-block::
PROJECT=your_project
BUCKET=your_bucket
REGION=your_region
3. Specify your Python requirements:
.. code-block::
echo "datasets" > /tmp/beam_requirements.txt
echo "apache_beam" >> /tmp/beam_requirements.txt
4. Run the pipeline:
.. code-block::
datasets-cli run_beam datasets/$DATASET_NAME \
--name $CONFIG_NAME \
--save_infos \
--cache_dir gs://$BUCKET/cache/datasets \
--beam_pipeline_options=\
"runner=DataflowRunner,project=$PROJECT,job_name=$DATASET_NAME-gen,"\
"staging_location=gs://$BUCKET/binaries,temp_location=gs://$BUCKET/temp,"\
"region=$REGION,requirements_file=/tmp/beam_requirements.txt"
.. tip::
When you run your pipeline, you can adjust the parameters to change the runner (Flink or Spark), output location (S3 bucket or HDFS), and the number of workers.