fil commited on
Commit
e94b7da
1 Parent(s): d01dd84
Files changed (1) hide show
  1. docs/index.md +1 -1
docs/index.md CHANGED
@@ -2,7 +2,7 @@
2
 
3
  A new fascinating dataset just dropped on 🤗. [French public domain newspapers](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) references about 3 million newspapers and periodicals with their full text OCR’ed and some meta-data.
4
 
5
- The data is stored in 320 chunks weighting about 700MB each, each continaing about 7,500 texts.
6
 
7
  The data loader for this Observable project uses DuckDB to read these 320 parquet files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents —, into a single parquet file. It takes only about 1 minute to run in a hugging-face Space.
8
 
 
2
 
3
  A new fascinating dataset just dropped on 🤗. [French public domain newspapers](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) references about 3 million newspapers and periodicals with their full text OCR’ed and some meta-data.
4
 
5
+ The data is stored in 320 chunks weighting about 700MB each, each containing about 7,500 texts.
6
 
7
  The data loader for this Observable project uses DuckDB to read these 320 parquet files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents —, into a single parquet file. It takes only about 1 minute to run in a hugging-face Space.
8