fil commited on
Commit
8abbe81
1 Parent(s): 31b3927
Files changed (1) hide show
  1. docs/index.md +2 -2
docs/index.md CHANGED
@@ -2,9 +2,9 @@
2
 
3
  I'm fascinated by this new dataset https://huggingface.co/datasets/PleIAs/French-PD-Newspapers that just dropped on :hugging_face:.
4
 
5
- It references about 3 million French newspapers, with full-text. ("only" 130 files of about 700MB each :slightly_smiling_face:).
6
 
7
- I wrote a data loader that outputs a single parquet file combining all their metadata (that is, without the text contents). It takes only about 5 minutes to run, thanks to parquet magic (and fiber internet); not sure how much of the ~100GB I downloaded for that. Now I've started exploring the metadata in an observable project.
8
 
9
  In one query, I see that A LOT of publications stopped publishing in 1944/1945, and conversely a large number of newspapers started publishing between 1941 and 1946. This probably includes both collaborationist publications and resistance publications—it would be interesting to find a ML way to separate them.
10
 
 
2
 
3
  I'm fascinated by this new dataset https://huggingface.co/datasets/PleIAs/French-PD-Newspapers that just dropped on :hugging_face:.
4
 
5
+ It references about 3 million French newspapers, with full-text. ("only" 320 files of about 700MB each :slightly_smiling_face:).
6
 
7
+ I wrote a data loader that outputs a single parquet file combining all their metadata (that is, without the text contents). It takes only about 5 minutes to run, thanks to parquet magic (and fiber internet); not sure how much of the ~200+GB I downloaded for that. Now I've started exploring the metadata in an observable project.
8
 
9
  In one query, I see that A LOT of publications stopped publishing in 1944/1945, and conversely a large number of newspapers started publishing between 1941 and 1946. This probably includes both collaborationist publications and resistance publications—it would be interesting to find a ML way to separate them.
10