Spaces:

observablehq
/

fpdn

Running

fil commited on Feb 21, 2024

Commit

66bfd23

unverified ·

2 Parent(s): 13b32a7 c9c3380

cache

uses eleventy's cache location for now

Files changed (2) hide show

docs/data/presse.parquet.sh CHANGED Viewed

@@ -1,27 +1,30 @@
-# install duckdb if not already present
-export PATH=.cache:$PATH
-command -v duckdb || $(
-  mkdir -p .cache
-  curl --location --output .cache/duckdb.zip \
-    https://github.com/duckdb/duckdb/releases/download/v0.10.0/duckdb_cli-linux-amd64.zip && \
-    unzip -qq .cache/duckdb.zip && chmod +x .cache/duckdb
-)
-export TMPDIR="dist"
 mkdir -p $TMPDIR
-echo """
-CREATE TABLE presse AS (
-SELECT title
-     , author
-     , LPAD((REGEXP_EXTRACT(date, '1[0-9][0-9][0-9]') || '-01-01'), 10, '0')::DATE AS year
-  FROM read_parquet(
-    [('https://huggingface.co/datasets/PleIAs/French-PD-Newspapers/resolve/main/gallica_presse_{:d}.parquet').format(n) for n in range(1, 321)])
-  ORDER BY title, author, year
-);
-COPY presse TO '$TMPDIR/presse.parquet' (COMPRESSION 'ZSTD', row_group_size 10000000);
-""" | duckdb
 # isatty
 if [ -t 1 ]; then
@@ -29,5 +32,5 @@ if [ -t 1 ]; then
   echo "duckdb -csv :memory: \"SELECT * FROM '$TMPDIR/presse.parquet'\""
 else
   cat $TMPDIR/presse.parquet
-  rm $TMPDIR/presse.parquet
 fi

+# Use "eleventy" .cache to store our temp files
+export TMPDIR=".cache"
 mkdir -p $TMPDIR
+if [ ! -f "$TMPDIR/presse.parquet" ]; then
+  # install duckdb if not already present
+  export PATH=.cache:$PATH
+  command -v duckdb || $(
+    curl --location --output duckdb.zip \
+      https://github.com/duckdb/duckdb/releases/download/v0.10.0/duckdb_cli-linux-amd64.zip && \
+      unzip -qq duckdb.zip && chmod +x duckdb && mkdir -p .cache && mv duckdb .cache/
+  )
+  echo """
+  CREATE TABLE presse AS (
+  SELECT title
+       , author
+       , LPAD((REGEXP_EXTRACT(date, '1[0-9][0-9][0-9]') || '-01-01'), 10, '0')::DATE AS year
+    FROM read_parquet(
+      [('https://huggingface.co/datasets/PleIAs/French-PD-Newspapers/resolve/main/gallica_presse_{:d}.parquet').format(n) for n in range(1, 321)])
+    ORDER BY title, author, year
+  );
+  COPY presse TO '$TMPDIR/presse.parquet' (COMPRESSION 'ZSTD', row_group_size 10000000);
+  """ | duckdb
+fi
 # isatty
 if [ -t 1 ]; then
   echo "duckdb -csv :memory: \"SELECT * FROM '$TMPDIR/presse.parquet'\""
 else
   cat $TMPDIR/presse.parquet
+  #rm $TMPDIR/presse.parquet
 fi

docs/index.md CHANGED Viewed

@@ -25,7 +25,7 @@
 <p class=signature>by <a href="https://observablehq.com/@fil">Fil</a>
-This new and fascinating dataset just dropped on Hugging Face&nbsp;: [French public domain newspapers](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) 🤗 references about **3&nbsp;million newspapers and periodicals** with their full text OCR’ed and some meta-data.
 The data is stored in 320 large parquet files. The data loader for this [Observable framework](https://observablehq.com/framework) project uses [DuckDB](https://duckdb.org/) to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents&nbsp;—, into a single highly optimized parquet file. This takes only about 1 minute to run in a hugging-face Space.

 <p class=signature>by <a href="https://observablehq.com/@fil">Fil</a>
+This new fascinating dataset just dropped on Hugging Face&nbsp;: [French public domain newspapers](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) 🤗 references about **3&nbsp;million newspapers and periodicals** with their full text OCR’ed and some meta-data.
 The data is stored in 320 large parquet files. The data loader for this [Observable framework](https://observablehq.com/framework) project uses [DuckDB](https://duckdb.org/) to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents&nbsp;—, into a single highly optimized parquet file. This takes only about 1 minute to run in a hugging-face Space.