Spaces:
Running
Running
improvements suggested by Eric Mauvière; the database is now 2.5MB, less than 1 byte per row!
Browse files- docs/data/presse.parquet.sh +3 -1
- docs/index.md +2 -2
- docs/resistance.md +18 -14
docs/data/presse.parquet.sh
CHANGED
@@ -5,8 +5,10 @@ SELECT title
|
|
5 |
, LPAD((REGEXP_EXTRACT(date, '1[0-9][0-9][0-9]') || '-01-01'), 10, '0')::DATE AS year
|
6 |
FROM read_parquet(
|
7 |
[('https://huggingface.co/datasets/PleIAs/French-PD-Newspapers/resolve/main/gallica_presse_{:d}.parquet').format(n) for n in range(1, 321)])
|
|
|
8 |
);
|
9 |
-
|
|
|
10 |
""" | duckdb
|
11 |
|
12 |
# isatty
|
|
|
5 |
, LPAD((REGEXP_EXTRACT(date, '1[0-9][0-9][0-9]') || '-01-01'), 10, '0')::DATE AS year
|
6 |
FROM read_parquet(
|
7 |
[('https://huggingface.co/datasets/PleIAs/French-PD-Newspapers/resolve/main/gallica_presse_{:d}.parquet').format(n) for n in range(1, 321)])
|
8 |
+
ORDER BY year, author
|
9 |
);
|
10 |
+
|
11 |
+
COPY presse TO '$TMPDIR/presse.parquet' (COMPRESSION 'ZSTD');
|
12 |
""" | duckdb
|
13 |
|
14 |
# isatty
|
docs/index.md
CHANGED
@@ -29,7 +29,7 @@ This new fascinating dataset just dropped on Hugging Face : [French public
|
|
29 |
|
30 |
The data is stored in 320 large parquet files. The data loader for this [Observable framework](https://observablehq.com/framework) project uses [DuckDB](https://duckdb.org/) to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents —, into a single highly optimized parquet file. This takes only about 1 minute to run in a hugging-face Space.
|
31 |
|
32 |
-
The resulting file is small enough (about
|
33 |
|
34 |
In this project, I’m exploring two aspects of the dataset:
|
35 |
|
@@ -65,7 +65,7 @@ Plot.plot({
|
|
65 |
});
|
66 |
```
|
67 |
|
68 |
-
<p class="small note" style="margin-top: 3em;" label=Thanks>Radamés Ajna, Sylvain Lesage and the 🤗 team helped me set up the Dockerfile. Éric Mauvière suggested
|
69 |
|
70 |
<style>
|
71 |
|
|
|
29 |
|
30 |
The data is stored in 320 large parquet files. The data loader for this [Observable framework](https://observablehq.com/framework) project uses [DuckDB](https://duckdb.org/) to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents —, into a single highly optimized parquet file. This takes only about 1 minute to run in a hugging-face Space.
|
31 |
|
32 |
+
The resulting file is small enough (and almost incredibly small: about 2.5MB, _less than 1 bytes per row!_), that we can load it in the browser and create “live” charts with [Observable Plot](https://observablehq.com/plot).
|
33 |
|
34 |
In this project, I’m exploring two aspects of the dataset:
|
35 |
|
|
|
65 |
});
|
66 |
```
|
67 |
|
68 |
+
<p class="small note" style="margin-top: 3em;" label=Thanks>Radamés Ajna, Sylvain Lesage and the 🤗 team helped me set up the Dockerfile. Éric Mauvière suggested many performance improvements.
|
69 |
|
70 |
<style>
|
71 |
|
docs/resistance.md
CHANGED
@@ -29,9 +29,14 @@ const db = DuckDBClient.of({ presse: FileAttachment("data/presse.parquet") });
|
|
29 |
```
|
30 |
|
31 |
```js echo
|
32 |
-
const letemps = db.query(
|
33 |
-
|
34 |
-
)
|
|
|
|
|
|
|
|
|
|
|
35 |
```
|
36 |
|
37 |
```js echo
|
@@ -42,10 +47,7 @@ display(
|
|
42 |
y: { grid: true },
|
43 |
marks: [
|
44 |
Plot.ruleY([0]),
|
45 |
-
Plot.rectY(
|
46 |
-
letemps,
|
47 |
-
Plot.binX({ y: "count" }, { x: "year", interval: "year" })
|
48 |
-
),
|
49 |
],
|
50 |
})
|
51 |
);
|
@@ -110,19 +112,21 @@ Let’s focus on the ${start1944.length} publications that started publishing in
|
|
110 |
```js echo
|
111 |
const start1944 = db.query(`
|
112 |
SELECT title
|
113 |
-
,
|
114 |
-
,
|
115 |
-
,
|
116 |
, COUNT(*) AS issues
|
117 |
-
|
118 |
-
GROUP BY
|
119 |
-
|
120 |
ORDER BY issues DESC
|
121 |
`);
|
122 |
```
|
123 |
|
124 |
```js
|
125 |
-
display(
|
|
|
|
|
126 |
```
|
127 |
|
128 |
Going through these titles, one gets a pretty impressive picture of the publishing activity in this extreme historic period.
|
|
|
29 |
```
|
30 |
|
31 |
```js echo
|
32 |
+
const letemps = db.query(`
|
33 |
+
SELECT year
|
34 |
+
, count(*) "count"
|
35 |
+
FROM presse
|
36 |
+
WHERE title = 'Le Temps'
|
37 |
+
AND year > DATE '1000-01-01'
|
38 |
+
GROUP BY ALL
|
39 |
+
`);
|
40 |
```
|
41 |
|
42 |
```js echo
|
|
|
47 |
y: { grid: true },
|
48 |
marks: [
|
49 |
Plot.ruleY([0]),
|
50 |
+
Plot.rectY(letemps, { y: "count", x: "year", interval: "year" }),
|
|
|
|
|
|
|
51 |
],
|
52 |
})
|
53 |
);
|
|
|
112 |
```js echo
|
113 |
const start1944 = db.query(`
|
114 |
SELECT title
|
115 |
+
, IFNULL(NULLIF(author, 'None'), '') AS author
|
116 |
+
, YEAR(MIN(year)) AS start
|
117 |
+
, YEAR(MAX(year)) AS end
|
118 |
, COUNT(*) AS issues
|
119 |
+
FROM presse
|
120 |
+
GROUP BY ALL
|
121 |
+
HAVING start = 1944
|
122 |
ORDER BY issues DESC
|
123 |
`);
|
124 |
```
|
125 |
|
126 |
```js
|
127 |
+
display(
|
128 |
+
Inputs.table(start1944, { format: { start: (d) => d, end: (d) => d } })
|
129 |
+
);
|
130 |
```
|
131 |
|
132 |
Going through these titles, one gets a pretty impressive picture of the publishing activity in this extreme historic period.
|