Spaces:

observablehq
/

fpdn

Running

fil commited on Feb 18, 2024

Commit

e9de1b9

1 Parent(s): bdd103a

improvements suggested by Eric Mauvière; the database is now 2.5MB, less than 1 byte per row!

Files changed (3) hide show

docs/data/presse.parquet.sh CHANGED Viewed

@@ -5,8 +5,10 @@ SELECT title
      , LPAD((REGEXP_EXTRACT(date, '1[0-9][0-9][0-9]') || '-01-01'), 10, '0')::DATE AS year
   FROM read_parquet(
     [('https://huggingface.co/datasets/PleIAs/French-PD-Newspapers/resolve/main/gallica_presse_{:d}.parquet').format(n) for n in range(1, 321)])
 );
-COPY presse TO '$TMPDIR/presse.parquet' (FORMAT 'parquet', COMPRESSION 'GZIP');
 """ | duckdb
 # isatty

      , LPAD((REGEXP_EXTRACT(date, '1[0-9][0-9][0-9]') || '-01-01'), 10, '0')::DATE AS year
   FROM read_parquet(
     [('https://huggingface.co/datasets/PleIAs/French-PD-Newspapers/resolve/main/gallica_presse_{:d}.parquet').format(n) for n in range(1, 321)])
+  ORDER BY year, author
 );
+COPY presse TO '$TMPDIR/presse.parquet' (COMPRESSION 'ZSTD');
 """ | duckdb
 # isatty

docs/index.md CHANGED Viewed

@@ -29,7 +29,7 @@ This new fascinating dataset just dropped on Hugging Face&nbsp;: [French public
 The data is stored in 320 large parquet files. The data loader for this [Observable framework](https://observablehq.com/framework) project uses [DuckDB](https://duckdb.org/) to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents&nbsp;—, into a single highly optimized parquet file. This takes only about 1 minute to run in a hugging-face Space.
-The resulting file is small enough (about 8MB) that we can load it in the browser and create “live” charts with [Observable Plot](https://observablehq.com/plot).
 In this project, I’m exploring two aspects of the dataset:
@@ -65,7 +65,7 @@ Plot.plot({
 });
 ```
-<p class="small note" style="margin-top: 3em;" label=Thanks>Radamés Ajna, Sylvain Lesage and the 🤗 team helped me set up the Dockerfile. Éric Mauvière suggested using STRIP_ACCENTS to normalize the query.
 <style>

 The data is stored in 320 large parquet files. The data loader for this [Observable framework](https://observablehq.com/framework) project uses [DuckDB](https://duckdb.org/) to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents&nbsp;—, into a single highly optimized parquet file. This takes only about 1 minute to run in a hugging-face Space.
+The resulting file is small enough (and almost incredibly small: about 2.5MB, _less than 1 bytes per row!_), that we can load it in the browser and create “live” charts with [Observable Plot](https://observablehq.com/plot).
 In this project, I’m exploring two aspects of the dataset:
 });
 ```
+<p class="small note" style="margin-top: 3em;" label=Thanks>Radamés Ajna, Sylvain Lesage and the 🤗 team helped me set up the Dockerfile. Éric Mauvière suggested many performance improvements.
 <style>

docs/resistance.md CHANGED Viewed

@@ -29,9 +29,14 @@ const db = DuckDBClient.of({ presse: FileAttachment("data/presse.parquet") });
 ```
 ```js echo
-const letemps = db.query(
-  "SELECT year FROM presse WHERE title = 'Le Temps' AND year > '1000'"
-);
 ```
 ```js echo
@@ -42,10 +47,7 @@ display(
     y: { grid: true },
     marks: [
       Plot.ruleY([0]),
-      Plot.rectY(
-        letemps,
-        Plot.binX({ y: "count" }, { x: "year", interval: "year" })
-      ),
     ],
   })
 );
@@ -110,19 +112,21 @@ Let’s focus on the ${start1944.length} publications that started publishing in
 ```js echo
 const start1944 = db.query(`
   SELECT title
-       , CASE WHEN author='None' THEN '' ELSE author END AS author
-       , DATE_PART('year', MIN(year)) AS start
-       , DATE_PART('year', MAX(year)) AS end
        , COUNT(*) AS issues
-    FROM presse
-   GROUP BY 1, 2
-  HAVING DATE_PART('year', MIN(year)) = 1944
    ORDER BY issues DESC
 `);
 ```
 ```js
-display(Inputs.table(start1944));
 ```
 Going through these titles, one gets a pretty impressive picture of the publishing activity in this extreme historic period.

 ```
 ```js echo
+const letemps = db.query(`
+  SELECT year
+       , count(*) "count"
+    FROM presse
+   WHERE title = 'Le Temps'
+     AND year > DATE '1000-01-01'
+   GROUP BY ALL
+`);
 ```
 ```js echo
     y: { grid: true },
     marks: [
       Plot.ruleY([0]),
+      Plot.rectY(letemps, { y: "count", x: "year", interval: "year" }),
     ],
   })
 );
 ```js echo
 const start1944 = db.query(`
   SELECT title
+       , IFNULL(NULLIF(author, 'None'), '') AS author
+       , YEAR(MIN(year)) AS start
+       , YEAR(MAX(year)) AS end
        , COUNT(*) AS issues
+   FROM presse
+   GROUP BY ALL
+   HAVING start = 1944
    ORDER BY issues DESC
 `);
 ```
 ```js
+display(
+  Inputs.table(start1944, { format: { start: (d) => d, end: (d) => d } })
+);
 ```
 Going through these titles, one gets a pretty impressive picture of the publishing activity in this extreme historic period.