Beam Datasets

Some datasets are too large to be processed on a single machine. Instead, you can process them with Apache Beam, a library for parallel data processing. The processing pipeline is executed on a distributed processing backend such as Apache Flink, Apache Spark, or Google Cloud Dataflow.

We have already created Beam pipelines for some of the larger datasets like wikipedia, and wiki40b. You can load these normally with load_dataset(). But if you want to run your own Beam pipeline with Dataflow, here is how:

Specify the dataset and configuration you want to process:

DATASET_NAME=your_dataset_name  # ex: wikipedia
CONFIG_NAME=your_config_name    # ex: 20220301.en

Input your Google Cloud Platform information:

PROJECT=your_project
BUCKET=your_bucket
REGION=your_region

Specify your Python requirements:

echo "datasets" > /tmp/beam_requirements.txt
echo "apache_beam" >> /tmp/beam_requirements.txt

Run the pipeline:

datasets-cli run_beam datasets/$DATASET_NAME \
--name $CONFIG_NAME \
--save_infos \
--cache_dir gs://$BUCKET/cache/datasets \
--beam_pipeline_options=\
"runner=DataflowRunner,project=$PROJECT,job_name=$DATASET_NAME-gen,"\
"staging_location=gs://$BUCKET/binaries,temp_location=gs://$BUCKET/temp,"\
"region=$REGION,requirements_file=/tmp/beam_requirements.txt"

When you run your pipeline, you can adjust the parameters to change the runner (Flink or Spark), output location (S3 bucket or HDFS), and the number of workers.