Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

“French public domain newspapers” example #1499

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions examples/loader-huggingface/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.DS_Store
/dist/
node_modules/
yarn-error.log
7 changes: 7 additions & 0 deletions examples/loader-huggingface/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
[Framework examples →](../)

# Hugging-face data loader with DuckDB

View live: <https://observablehq.observablehq.cloud/framework-example-loader-hugging-face/>

This Observable Framework example demonstrates a DuckDB data loader that downloads and converts databases hosted on Hugging-face, and converts them into a minimized and compressed parquet format.
4 changes: 4 additions & 0 deletions examples/loader-huggingface/observablehq.config.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
export default {
root: "src",
sidebar: true
};
20 changes: 20 additions & 0 deletions examples/loader-huggingface/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"type": "module",
"private": true,
"scripts": {
"clean": "rimraf src/.observablehq/cache",
"build": "rimraf dist && observable build",
"dev": "observable preview",
"deploy": "observable deploy",
"observable": "observable"
},
"dependencies": {
"@observablehq/framework": "latest"
},
"devDependencies": {
"rimraf": "^5.0.5"
},
"engines": {
"node": ">=18"
}
}
1 change: 1 addition & 0 deletions examples/loader-huggingface/src/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/.observablehq/cache/
11 changes: 11 additions & 0 deletions examples/loader-huggingface/src/data/presse.parquet.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
duckdb :memory: -c "
CREATE TABLE presse AS (
SELECT title
, author
, LPAD((REGEXP_EXTRACT(date, '1[0-9][0-9][0-9]') || '-01-01'), 10, '0')::DATE AS year
FROM read_parquet(
[('https://huggingface.co/datasets/PleIAs/French-PD-Newspapers/resolve/main/gallica_presse_{:d}.parquet').format(n) for n in range(1, 321)])
ORDER BY title, author, year
);
COPY presse TO STDOUT (FORMAT 'parquet', COMPRESSION 'ZSTD', row_group_size 10000000);
"
65 changes: 65 additions & 0 deletions examples/loader-huggingface/src/gazette.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
sql:
presse: data/presse.parquet
---

# Gazette

Explore 3 million newspapers **by title**. Type in words such as “jeune”, “révolution”, “république”, “matin”, “soir”, “humanité”, “nouvelle”, “moderne”, “femme”, “paysan”, “ouvrier”, “social”, “résistance” etc. to see different historical trends.

```js
const search = view(
Inputs.text({ type: "search", value: "gazette", submit: true })
);
```

```js
const chart = Plot.plot({
x: { type: "utc", nice: true },
y: {
label: `Share of titles matching ${search}`,
tickFormat: "%",
},
marks: [
Plot.ruleY([0, 0.01], { stroke: ["currentColor"] }),
Plot.areaY(base, {
x: "year",
y: ({ year, total }) => gazette.get(year) / total,
fillOpacity: 0.2,
curve: "step",
}),
Plot.lineY(base, {
x: "year",
y: ({ year, total }) => gazette.get(year) / total,
curve: "step",
}),
],
});

display(chart);
```

I called this page “Gazette” because I was surprised that most of the corpus in the earlier years had a title containing this word. The query uses a case-insensitive [REGEXP_MATCHES](https://duckdb.org/docs/archive/0.9.2/sql/functions/patternmatching) operator to count occurrences; you can query for example “socialis[tm]e” to match both “socialiste” and “socialisme”.

```sql id=results
SELECT year
, COUNT() c
FROM presse
WHERE REGEXP_MATCHES(STRIP_ACCENTS(title), STRIP_ACCENTS(${search}), 'i')
GROUP BY year
```

```js
// A Map for fast retrieval—precisely an InternMap, indexed by Date
const gazette = new d3.InternMap(Array.from(results, ({ year, c }) => [year, c]));
```

```sql id=base
-- base (denominator: count by year) --
SELECT year
, COUNT(*)::int total
FROM presse
WHERE year > '1000'
GROUP BY year
ORDER BY year
```
65 changes: 65 additions & 0 deletions examples/loader-huggingface/src/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
sql:
presse: data/presse.parquet
---

# French public domain newspapers

## A quick glance at 3&nbsp;million periodicals

<p class=signature>by <a href="https://observablehq.com/@fil">Fil</a>

This new fascinating dataset just dropped on Hugging Face: [French public domain newspapers](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) 🤗 references about **3&nbsp;million newspapers and periodicals** with their full text OCR’ed and some meta-data.

The data is stored in 320 large parquet files. The data loader for this [Observable framework](https://observablehq.com/framework) project uses [DuckDB](https://duckdb.org/) to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents&nbsp;—, into a single highly optimized parquet file. This takes only about 1 minute to run in a hugging-face Space.

The resulting file is small enough (and incredibly so: the file weighs about 560kB, _only 1.5 bits per row!_), that we can load it in the browser and create “live” charts with [Observable Plot](https://observablehq.com/plot).

In this project, I’m exploring two aspects of the dataset:

- As I played with the titles, I saw that the word “gazette” was quite frequent in the 17th Century. An exploration of the words used in the titles is on the page [gazette](gazette).

- A lot of publications stopped or started publishing during the second world war. Explored in [resistance](resistance).

This page summarizes the time distribution of the data:

```sql id=dates echo
-- dates --
SELECT year FROM presse WHERE year >= '1000'
```

_Note: due to the date pattern matching I’m using, unknown years are marked as 0000. Hence the filter above._

The chart below indicates that the bulk of the contents collected in this database was published between 1850 and 1950. It’s obviously not that the _presse_ stopped after 1950, but because most of the printed world after that threshold year is still out of reach of researchers, as it is “protected” by copyright or _droit d’auteur._

${Plot.rectY(dates, Plot.binX({ y: "count" }, { x: "year", interval: "5 years" })).plot({ marginLeft: 60 })}

```js echo run=false
Plot.plot({
marks: [
Plot.rectY(
dates,
Plot.binX({ y: "count" }, { x: "year", interval: "5 years" })
),
],
});
```

<p class="small note" style="margin-top: 3em;" label=Thanks>Radamés Ajna, Sylvain Lesage and the 🤗 team helped me set up the Dockerfile. Éric Mauvière suggested many performance improvements.

<style>

.signature a[href] {
color: var(--theme-foreground)
}

.signature {
text-align: right;
font-size: small;
}

.signature::before {
content: "◼︎ ";
}

</style>
107 changes: 107 additions & 0 deletions examples/loader-huggingface/src/resistance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
sql:
presse: data/presse.parquet
---

# Résistance

During the second world war the nazis occupied the northern half of France, and the collaborationist governement of Pétain was left to rule over the southern half (the “[zone libre](https://fr.wikipedia.org/wiki/Zone_libre)”). A lot of newspapers at that time were closed, others submitted to the occupiers (some even enthusiastically collaborated). At the same time, a range of clandestine publications started circulating, often associated with the resistance movements. When the country was Liberated in 1944, the most outrageously collaborationist press was dismantled, other newspapers changed their names and were sometimes taken over by new teams of resistance journalists. The most famous case is “Le Temps,” a daily newspaper that had been [publishing since 1861](<https://fr.wikipedia.org/wiki/Le_Temps_(quotidien_fran%C3%A7ais,_1861-1942)>) and had closed in 1942. Although not a collaborationist newspaper, it was not allowed to reopen, and its assets were transferred to create “Le Monde” on 19&nbsp;December 1944, under Hubert Beuve-Méry.

```sql id=letemps echo
-- letemps --
SELECT year
, count(*) "count"
FROM presse
WHERE title = 'Le Temps'
AND year > DATE '1000-01-01'
GROUP BY ALL
```

```js echo
display(
Plot.plot({
caption: "Number of issues of Le Temps in the dataset, per year",
x: { nice: true },
y: { grid: true },
marks: [
Plot.ruleY([0]),
Plot.rectY(letemps, { y: "count", x: "year", interval: "year" }),
],
})
);
```

(Unfortunately, “Le Monde” is not part of the dataset.)

The number of titles that stopped or started publishing exploded in those fatal years. Note that many of these publications were short-lived, such as this example picked at random in the dataset: [Au-devant de la vie. Organe de l'Union des jeunes filles patriotes (UJFP), Région parisienne](https://gallica.bnf.fr/ark:/12148/bpt6k76208732?rk=21459;2). While the the UJFP (a resistance organisation of communist young women) published several titles during the war, only one issue was distributed under this title.

```sql id=years echo
-- years --
SELECT title
, MIN(year) AS start
, MAX(year) AS end
FROM presse
GROUP BY 1
```

```js echo
display(
Plot.plot({
color: { legend: true },
marks: [
Plot.rectY(
years,
Plot.binX(
{ y: "count" },
{
filter: (d) =>
d.start?.getUTCFullYear() >= 1930 &&
d.start?.getUTCFullYear() <= 1955,
x: "start",
fill: () => "started",
interval: "year",
}
)
),
Plot.rectY(
years,
Plot.binX(
{ y: "count" },
{
filter: (d) =>
d.end?.getUTCFullYear() >= 1930 &&
d.end?.getUTCFullYear() <= 1955,
x: "end",
fill: () => "ended",
mixBlendMode: "multiply",
interval: "year",
}
)
),
Plot.ruleY([0]),
],
})
);
```

Let’s focus on the ${start1944.length} publications that started publishing in 1944, and extract their titles and authors:

```sql id=start1944 echo
SELECT title
, IFNULL(NULLIF(author, 'None'), '') AS author
, YEAR(MIN(year)) AS start
, YEAR(MAX(year)) AS end
, COUNT(*) AS issues
FROM presse
GROUP BY ALL
HAVING start = 1944
ORDER BY issues DESC
```

```js
display(
Inputs.table(start1944, { format: { start: (d) => d, end: (d) => d } })
);
```

Going through these titles, one gets a pretty impressive picture of the publishing activity in this extreme historic period.
35 changes: 35 additions & 0 deletions examples/loader-huggingface/src/source.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Source code

This project relies on a **data loader** that reads all the source files and outputs a single summary file, minimized to contain only a subset of the source information:

```js
import hljs from "npm:highlight.js";
```

`data/presse.parquet.sh`

```js
const pre = display(document.createElement("pre"));
FileAttachment("data/presse.parquet.sh")
.text()
.then(
(text) => (pre.innerHTML = hljs.highlight(text, { language: "bash" }).value)
);
```

This is the file that the other pages reference in the front matter:

```yaml
---
sql:
presse: data/presse.parquet
---
```

and process with [sql](https://observablehq.com/framework/sql) code blocks:

````sql run=false
```sql
SELECT COUNT() FROM presse
```
````