fil commited on
Commit
e9de1b9
·
1 Parent(s): bdd103a

improvements suggested by Eric Mauvière; the database is now 2.5MB, less than 1 byte per row!

Browse files
Files changed (3) hide show
  1. docs/data/presse.parquet.sh +3 -1
  2. docs/index.md +2 -2
  3. docs/resistance.md +18 -14
docs/data/presse.parquet.sh CHANGED
@@ -5,8 +5,10 @@ SELECT title
5
  , LPAD((REGEXP_EXTRACT(date, '1[0-9][0-9][0-9]') || '-01-01'), 10, '0')::DATE AS year
6
  FROM read_parquet(
7
  [('https://huggingface.co/datasets/PleIAs/French-PD-Newspapers/resolve/main/gallica_presse_{:d}.parquet').format(n) for n in range(1, 321)])
 
8
  );
9
- COPY presse TO '$TMPDIR/presse.parquet' (FORMAT 'parquet', COMPRESSION 'GZIP');
 
10
  """ | duckdb
11
 
12
  # isatty
 
5
  , LPAD((REGEXP_EXTRACT(date, '1[0-9][0-9][0-9]') || '-01-01'), 10, '0')::DATE AS year
6
  FROM read_parquet(
7
  [('https://huggingface.co/datasets/PleIAs/French-PD-Newspapers/resolve/main/gallica_presse_{:d}.parquet').format(n) for n in range(1, 321)])
8
+ ORDER BY year, author
9
  );
10
+
11
+ COPY presse TO '$TMPDIR/presse.parquet' (COMPRESSION 'ZSTD');
12
  """ | duckdb
13
 
14
  # isatty
docs/index.md CHANGED
@@ -29,7 +29,7 @@ This new fascinating dataset just dropped on Hugging Face : [French public
29
 
30
  The data is stored in 320 large parquet files. The data loader for this [Observable framework](https://observablehq.com/framework) project uses [DuckDB](https://duckdb.org/) to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents —, into a single highly optimized parquet file. This takes only about 1 minute to run in a hugging-face Space.
31
 
32
- The resulting file is small enough (about 8MB) that we can load it in the browser and create “live” charts with [Observable Plot](https://observablehq.com/plot).
33
 
34
  In this project, I’m exploring two aspects of the dataset:
35
 
@@ -65,7 +65,7 @@ Plot.plot({
65
  });
66
  ```
67
 
68
- <p class="small note" style="margin-top: 3em;" label=Thanks>Radamés Ajna, Sylvain Lesage and the 🤗 team helped me set up the Dockerfile. Éric Mauvière suggested using STRIP_ACCENTS to normalize the query.
69
 
70
  <style>
71
 
 
29
 
30
  The data is stored in 320 large parquet files. The data loader for this [Observable framework](https://observablehq.com/framework) project uses [DuckDB](https://duckdb.org/) to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents&nbsp;—, into a single highly optimized parquet file. This takes only about 1 minute to run in a hugging-face Space.
31
 
32
+ The resulting file is small enough (and almost incredibly small: about 2.5MB, _less than 1 bytes per row!_), that we can load it in the browser and create “live” charts with [Observable Plot](https://observablehq.com/plot).
33
 
34
  In this project, I’m exploring two aspects of the dataset:
35
 
 
65
  });
66
  ```
67
 
68
+ <p class="small note" style="margin-top: 3em;" label=Thanks>Radamés Ajna, Sylvain Lesage and the 🤗 team helped me set up the Dockerfile. Éric Mauvière suggested many performance improvements.
69
 
70
  <style>
71
 
docs/resistance.md CHANGED
@@ -29,9 +29,14 @@ const db = DuckDBClient.of({ presse: FileAttachment("data/presse.parquet") });
29
  ```
30
 
31
  ```js echo
32
- const letemps = db.query(
33
- "SELECT year FROM presse WHERE title = 'Le Temps' AND year > '1000'"
34
- );
 
 
 
 
 
35
  ```
36
 
37
  ```js echo
@@ -42,10 +47,7 @@ display(
42
  y: { grid: true },
43
  marks: [
44
  Plot.ruleY([0]),
45
- Plot.rectY(
46
- letemps,
47
- Plot.binX({ y: "count" }, { x: "year", interval: "year" })
48
- ),
49
  ],
50
  })
51
  );
@@ -110,19 +112,21 @@ Let’s focus on the ${start1944.length} publications that started publishing in
110
  ```js echo
111
  const start1944 = db.query(`
112
  SELECT title
113
- , CASE WHEN author='None' THEN '' ELSE author END AS author
114
- , DATE_PART('year', MIN(year)) AS start
115
- , DATE_PART('year', MAX(year)) AS end
116
  , COUNT(*) AS issues
117
- FROM presse
118
- GROUP BY 1, 2
119
- HAVING DATE_PART('year', MIN(year)) = 1944
120
  ORDER BY issues DESC
121
  `);
122
  ```
123
 
124
  ```js
125
- display(Inputs.table(start1944));
 
 
126
  ```
127
 
128
  Going through these titles, one gets a pretty impressive picture of the publishing activity in this extreme historic period.
 
29
  ```
30
 
31
  ```js echo
32
+ const letemps = db.query(`
33
+ SELECT year
34
+ , count(*) "count"
35
+ FROM presse
36
+ WHERE title = 'Le Temps'
37
+ AND year > DATE '1000-01-01'
38
+ GROUP BY ALL
39
+ `);
40
  ```
41
 
42
  ```js echo
 
47
  y: { grid: true },
48
  marks: [
49
  Plot.ruleY([0]),
50
+ Plot.rectY(letemps, { y: "count", x: "year", interval: "year" }),
 
 
 
51
  ],
52
  })
53
  );
 
112
  ```js echo
113
  const start1944 = db.query(`
114
  SELECT title
115
+ , IFNULL(NULLIF(author, 'None'), '') AS author
116
+ , YEAR(MIN(year)) AS start
117
+ , YEAR(MAX(year)) AS end
118
  , COUNT(*) AS issues
119
+ FROM presse
120
+ GROUP BY ALL
121
+ HAVING start = 1944
122
  ORDER BY issues DESC
123
  `);
124
  ```
125
 
126
  ```js
127
+ display(
128
+ Inputs.table(start1944, { format: { start: (d) => d, end: (d) => d } })
129
+ );
130
  ```
131
 
132
  Going through these titles, one gets a pretty impressive picture of the publishing activity in this extreme historic period.