Some datasets are too large to be processed on a single machine. Instead, you can process them with Apache Beam, a library for parallel data processing. The processing pipeline is executed on a distributed processing backend such as Apache Flink, Apache Spark, or Google Cloud Dataflow.
We have already created Beam pipelines for some of the larger datasets like wikipedia, and wiki40b. You can load these normally with load_dataset(). But if you want to run your own Beam pipeline with Dataflow, here is how:
- Specify the dataset and configuration you want to process:
DATASET_NAME=your_dataset_name # ex: wikipedia CONFIG_NAME=your_config_name # ex: 20220301.en
- Input your Google Cloud Platform information:
PROJECT=your_project BUCKET=your_bucket REGION=your_region
- Specify your Python requirements:
echo "datasets" > /tmp/beam_requirements.txt echo "apache_beam" >> /tmp/beam_requirements.txt
- Run the pipeline:
datasets-cli run_beam datasets/$DATASET_NAME \ --name $CONFIG_NAME \ --save_infos \ --cache_dir gs://$BUCKET/cache/datasets \ --beam_pipeline_options=\ "runner=DataflowRunner,project=$PROJECT,job_name=$DATASET_NAME-gen,"\ "staging_location=gs://$BUCKET/binaries,temp_location=gs://$BUCKET/temp,"\ "region=$REGION,requirements_file=/tmp/beam_requirements.txt"
When you run your pipeline, you can adjust the parameters to change the runner (Flink or Spark), output location (S3 bucket or HDFS), and the number of workers.