Spaces:
Running
Running
typo
Browse files- docs/index.md +1 -1
docs/index.md
CHANGED
@@ -2,7 +2,7 @@
|
|
2 |
|
3 |
A new fascinating dataset just dropped on 🤗. [French public domain newspapers](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) references about 3 million newspapers and periodicals with their full text OCR’ed and some meta-data.
|
4 |
|
5 |
-
The data is stored in 320 chunks weighting about 700MB each, each
|
6 |
|
7 |
The data loader for this Observable project uses DuckDB to read these 320 parquet files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents —, into a single parquet file. It takes only about 1 minute to run in a hugging-face Space.
|
8 |
|
|
|
2 |
|
3 |
A new fascinating dataset just dropped on 🤗. [French public domain newspapers](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) references about 3 million newspapers and periodicals with their full text OCR’ed and some meta-data.
|
4 |
|
5 |
+
The data is stored in 320 chunks weighting about 700MB each, each containing about 7,500 texts.
|
6 |
|
7 |
The data loader for this Observable project uses DuckDB to read these 320 parquet files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents —, into a single parquet file. It takes only about 1 minute to run in a hugging-face Space.
|
8 |
|