# Methods

## Data collection

To retrieve authoritative data about individuals and their works, we used Wikidata as a gateway within the domain of online catalogs. Wikidata includes both collaborative and automated components, the latter of which extracts information from disparate catalogs and harmonizes knowledge. For example, a notable figure such as Plato is directly linked to over 200 catalogs worldwide through unique identifiers in Wikidata.
Wikidata also has a powerful ontology system linked to Sparql technology that facilitates the extraction of specific information. We used this system to identify CPs. Any new ontology (for example, the creation of a category such as ”painter” or ”geographer”) must be validated by the Wikidata community to ensure the validity of the information. This ontology system consists of a hierarchy of categories and subcategories. For example, a ”painter” is a subcategory of ”visual artist”, which is a subcategory of ”artist”. This system of assigning occupations ensures that we build the dataset without any particular hypothesis in mind, leaving some of the methodological decisions to the contributors and administrators in the Wikidata community. We selected individuals whose occupation falls under the subcategories of ”Writer,” ”Scientist,” or ”Artist” as outlined in the Wikidata ontology in order to cover a wide range of creative occupations. In addition, we limited our selection to those born before 1850, in line with the timeframe of interest for the current study.

We ended up with 1,636 artistic occupations, 1,047 scientific occupations, and 2,215 occupations related to writers. After deduplication (two occupations can be subcategories of the same class, for example, in the Wikidata ontology, an ’astronomer’ is both a ’writer’ and a ’scientist’), we ended up with 3,727 unique occupations (each CP has one or more occupations).
Next, we collected various metadata about the individuals, where available. This included information such as place of birth, place of death and their geographic coordinates, gender, nationality and its geographic location, and date of birth. In addition, we retrieved details about the online catalogs to which they belonged, along with associated identifiers. We also obtained additional metadata related to the catalogs, such as the country of origin for each catalog. Our database includes individuals who belong to one or more of 1,595 online catalogs located in 60 countries on every continent.

## Creating regions and associating individuals to regions

Several levels of geographic granularity are possible. Because we were interested in studying the long-term dynamics of large societies, our delineation of cultural regions is mostly organized around specific languages (Arabic, Persian, Chinese, Sanskrit, Italian, etc.) and varies over time according to the geographic evolution of the language (for example, Egypt is within Greek culture during the Hellenistic and Roman periods, but not during the Arab period). By using a strong cultural marker like language to create the regions, our index is aligned with historical and cultural movements that often transcend smaller political boundaries.
To account for the fact that CPs may have traveled to the place where they worked, a cultural producer contributes to the region in which he dies. This is technically convenient because the place of death can be linked to geographic coordinates in Wikidata. Thus, even in cases where the name of the place of death no longer exists today (Byzantium, for example), it is easily identifiable through its geographic data and can be extracted and then assimilated to a region or city that exists today (Istanbul). If the information about the place of death was not available, we considered the information about ”place of birth” and ”nationality”. We then mapped the obtained birthplace coordinates to modern existing countries (see Supplementary Materials for the modern countries included in these cultural regions).
We then proceeded to assign cultural regions to individuals. For example, in the Latin world, we include individuals who died between -300 and 500 in Italy and between -100 and 500 in Tunisia, Algeria, Morocco, Romania, Croatia, Serbia, Bosnia and Herzegovina, Slovenia, France, United Kingdom, Germany, Switzer- land, Austria, Spain, and Portugal. When we needed to divide a country into north and south, we used the coordinates to delineate the regions. For example, we divided Japan into North Japan and South Japan using the longitude 138.

## Location in time

We use the date of birth + 35 years old to locate them in time. We used the maximum age of productivity of individuals as a reference. If the date of birth was not present, we used the date of death. If the individual is dead before turning 35yo, we also used the date of death to locate him or her in time.
We assumed that a region has high cultural productivity at a specific time when numerous individuals are working in this place. Hence, We had to remove individuals without any way to geolocate them or to date them as those two items are necessary to create the cultural index. We also removed individuals with no references in any online catalogs to avoid ’fake’ individuals that would not exist in any national or international online catalog. Overall, we obtained a dataset of 158,373 individuals dead before to 1850 and active up to 1880.
It should be highlighted that on Wikidata, certain cultural producers (CPs) have ’rounded’ birthdates due to approximation, such as being noted as born in the 10th or 9th century. This can lead to disproportionately high peaks in specific decades. For example, about 6% of Chinese CPs are recorded with such rounded dates, while in Japan, the figure is approximately 14%. In France, it’s around 10%, but significantly higher in the Greek World at 53%, and similarly, 51% in the Latin world. In order to account for that effect, we used a loess regression to smoother the index and removes those artefacts. We also set the birthyear at the mid-time of the century: if an individual has a birthyear at 16th century, we set the birthyear at 1550

## Extracting the works of cultural producers

Given the technical limitations in directly extracting works from the Wikidata pages of notable individ- uals, we extract all the works contained in the Wikidata database and link them to their creator in our database. Specifically, we targeted works falling under subcategories of “Architectural structure”, “Work of art”, “Tool”, “Archaeological site”, or “Infrastructure”. Those categories are sufficiently broad to cover a wide range of works. For instance, a “Painting” is a subcategory of “Work of art” while a ‘Building’ is a subcategory of “Architectural structure”. After linking the works to notable individuals in our database, we obtained a dataset containing 665,108 distinct works. Within this dataset, we observed that 24% of individuals in our database were associated with at least one documented work. The median number of works per individual is 3.

## Environmental Variables

For GDP (Gross Domestic Product) per capita, we used the most consensual dataset, namely the Maddison project. To calculate per capita cultural production, we used the population estimates of the Atlas of World Population. This work offers only imperfect approximations about population size. For China, in particular, the evolution of population is still debated for the early modern period [89, 62]. However, alternative estimates lead to similar trends. For China, using Deng & Sun’s data (2017) would lead to a similar peak during the Ming dynasty. The only exception is medieval Japan for which more recent estimates differ importantly from the Atlas of World Population. In this case, we used more recent estimates.
Because it is hard to strictly isolate the GDP per capita of one unique modern country and because we wanted to use as much data as the CPI to reach a statistically acceptable size. We estimate the GDP per capita of the regions with the known GDP of a specific country. For instance, we used the GDP of Sweden to estimate the GDP of the Nordic Countries, or the GDP per capita of Poland to estimate the GDP per capita of Eastern Europe.

## Differential detection: the Generalized Chao Estimator

To monitor aspects of biodiversity, such as species richness, ecological research typically depends on capture-
recapture surveys [147]: during these bio-registration campaigns, field workers apply miscellaneous trapping
devices (such as cameras) to collect statistics on the abundance of species in a particular geographic area, over
a specific period of time. The resulting data records how often individual species have been observed and can
be used to derive useful quantities about an observed assemblage, such as the number of “singletons” (f1), the
number of species sighted exactly once, and “doubletons” (f2), the number of species observed exactly twice.
Such count data, however, typically suffers from incomplete observations, because many species are hard
to observe (because they are rare, shy, well-camouflaged, etc). Due to such “unseen species”, the observed
species richness will always be an underestimation of the true ecological diversity. In response, ecology has
developed a rich tradition of statistical methods for bias correction in this domain [148], in particular for the
problem of estimating f0 or the number of unseen species on the basis of the observed data. Chao1 [46] is
a commonly used method which estimates a lower bound on f as fˆ using the equation fˆ ≥ f2/2f . To 00 012
correct for the observation bias in a sample and estimate the true species richness Nˆ, fˆ can be added to n 0
(the number of observed species in the assemblage).
Here, we go one step further and demonstrate how covariates (representing historical eras and regions)
can be included in these models to obtain differential loss rates . This approach builds on empirical work in computational sociology [149], in particular the field of statistical criminology which deals with unreported crime rates (“dark numbers”), for instance related to drug abuse or domestic violence [150].
23
To obtain analogous abundance statistics for historic CPs, we use the the number of works which are recorded for each individual (see Supplementary materials for the top types of works in the database), as well as the approximate time and region in which they were active. The number of works per CP is then treated as the number of sightings for a species in ecology, so that we can establish abundance counts. The assumption becomes that CPs must be considered unseen species, if no works survive anymore that can be attributed to them. (Note that we do assume that all CPs present on our database authored at least one work, even if none have been explicitly listed.) In this setup, Chao1 enables us to estimate a lower bound on the number of historic CPs which are not recorded at all. The availability of a lower bound estimate on the true cultural richness in heavily undersampled assemblages constitutes a major step forward in historic analysis where such information was previously inconceivable [38].
While Chao1 is useful to estimate the number of unobserved CPs, it cannot model the factors driving the under-detection. Because the detection rates of historic CPs can be expected to vary appreciably across regions and periods, we turn to a generalized variant of Chao1, derived from the statistical literature in criminology. This method enables the inclusion of CP-level covariates in the estimator, thus providing more accurate estimates across different categories of CPs. For this, we propose an innovative approach that uses diachronic B-splines in a hierarchical Bayesian model to predict the attestation frequencies of CPs over time.
Although any probabilistic classifier will do, our implementation makes use of a Bayesian generalized linear model. This model predicts, for each CP with a sighting frequency y ∈ 1, 2, the probability of the CP being seen twice (instead of once) as a Bernoulli-realized dependent variable on the basis of two predictors: time and region. Because we work with long-range time series, we use B-splines (basis splines) to model the effect of time. Such splines take the form of a smooth curve with an arbitrary, so-called “wiggly” shape [151]: such a model avoids the naive assumption of a linear, monotonic relationship between time and the dependent variable, which is unrealistic. We adopt a cubic spline (i.e. third degree basis functions) with 10 pivot points or “knots”, along evenly spaced quantiles of the date predictor variable. Importantly, the parameters associated with these component splines are fitted locally: their behavior in a specific time range does not directly affect the curve’s behavior in other date ranges. To model the influence of cultural geography, finally, we let the splines vary per region as a random effect. As such, this model ultimately aims to capture the intuition that the detection rates for historic CPs can vary across historic periods and regions.
Chao’s original estimator does not provide insight into the drivers of the observation or detection process: in many assemblages, it can be expected that there exist serious differences in the detection rates across different categories, related to the sorts of survival biases mentioned above [44]. B ̈ohning and colleagues have proposed an extension of Chao’s method, the Generalized Chao Estimator, that allows the inclusion of covariates to help model differential detection rates among species [149]. The original Chao1 estimator only considers f1 and f2, i.e. the respective counts of singleton and doubleton species in the assemblage. In this view, the observation frequency y of species can be modeled as a Poisson-realized variable, which is parametrized by a single rate parameter λ:

y ~ Poisson(λ)

The count data under scrutiny, however, arises from a truncated Poisson distribution, because y ∈ 1,2. Bohning and colleagues have shown that, using the Horvitz-Thompson estimator, we can estimate the Poisson rate parameter λ in this case as λ = 2p/(1 − p). To estimate pˆ, B ̈ohning et al. propose to use a logistic regression of the conventional form:

yi ∼ Bernoulli(pi)
logit(pi) = α

In this model, α represents the intercept – for now, we omit priors from the model specification. The dependent variable take the form of a Bernoulli-distributed variable, that models the probability p that a species will be a doubleton, rather than a singleton in the truncated distribution. To this conventional formulation (with a logit link function), arbitrary, numeric covariates can be included as predictor, e.g.:

yi ∼ Bernoulli(pi)
logit(pi) = α + βxxi

Here, βx represents the coefficient for the arbitrary predictor xi which is available for the i-th species. The Generalized Chao Estimator thus requires fitting a binary classification model to the set of low-frequency species (singletons and doubletons) in the observed data, predicting whether such an instance will be a singleton (negative class) or a doubleton (positive class), on the basis of species-level covariates. The fitted model can then be used to compute a vector of instance-level probabilities pˆ. For each pˆi, the corresponding λi can then be calculated as:

λˆi=2 pˆi (1) 1 − pˆ i

Finally, Nˆ, the new lower bound on the true population size, can be estimated as follows:

Note that, in the absence of covariates, the intercept-only variant of the model reduces to the regular Chao1 estimator. Crucially, the estimator can be applied to any meaningful subset of the available pˆi, for instance to compute Nˆ for specific subgroups or covariate ranges in the data. This allows us to compute the detection rate (n/Nˆ) for different sets of species, for instance, based on combinations of selected covariate levels and ranges. The Generalized estimator can moreover reduce the bias of Chao1 (as a lower bound) and provide a more reliable approximation of the true species diversity. Exploratory simulations on synthetic data support this claim [149]. Moreover, the Generalized estimator does more more justice to the complex reality that detection rates are differential across different CP categories.