BigLAM: BigScience Libraries, Archives and Museums

non-profit

https://github.com/bigscience-workshop/lam

Activity Feed Request to join this org

AI & ML interests

🤗 Hugging Face x 🌸 BigScience initiative to create open source community resources for LAMs.

Recent Activity

skorkmaz88 updated a dataset 5 days ago

biglam/muninn-ww1-documents

skorkmaz88 published a dataset 5 days ago

biglam/muninn-ww1-documents

emanuelaboros authored a paper 23 days ago

A Study of Temporal Fusion Strategies for Named Entity Recognition in Historical Texts

View all activity

Organization Card

Community About org cards

📚 BigLAM

A community-run home for machine-learning-ready datasets from libraries, archives, and museums.

Most cultural-heritage data wasn't originally prepared with ML workflows in mind — it lives in catalogue systems, IIIF endpoints, METS/MODS records, and various idiosyncratic formats that each institution has its own version of. BigLAM is a place where those datasets get repackaged into formats ML practitioners can actually load and work with, contributed by the people who know the source material best.

The org started as a datasets hackathon inside the BigScience project in 2022 and has grown into a standing community for cultural-heritage ML.

What's here

The org is datasets-first: 46+ image, text, and tabular collections from libraries, archives, and museums, prepared so they load cleanly with the datasets library. A handful of models and spaces live here too — mostly early experiments from the BigScience-era hackathon.

For task-specific, deployable models built on top of these datasets, see the sibling org small-models-for-glam.

Contributing a dataset

If you've prepared a LAM dataset that other researchers might use, the best home is usually your institution's own Hugging Face organisation (e.g. NationalLibraryOfScotland). Institutional ownership signals authority over the data and makes long-term maintenance easier. Setting up a new org on the Hub is free and quick.

If your institution isn't on the Hub yet, or you'd prefer to host the dataset here, open a discussion and we'll help get it set up under BigLAM. Useful additions are typically datasets where the format conversion (METS/ALTO → parquet, IIIF manifest → loadable image splits, etc.) has already been done and the licensing is clear enough for open release.

Already have a dataset here that should sit under your institution's org? Open a discussion or issue on the dataset repo — we're happy to transfer ownership.

60+ contributors over the years. Day-to-day maintenance is light-touch; for help with a contribution, open a discussion and someone will see it.