LibRAG / dataset-documentation /DATASETDOC-fa24.md
Daniel Foley
Finishing touches and cleanup
a9e136f
|
raw
history blame
2.55 kB

Project Information

Dataset Information

  • Our data is contained on the SCC at /projectnb/sparkgrp/ml-bpl-rag-data
    • /vectorstore/final_embeddings/metadata_index - faiss index for the metadata
    • /vectorstore/final_embeddings/fulltext_index - faiss index for the OCR text
    • /full_data/bpl_data.json - metadata
    • /full_data/clean_ft.json - fulltext
  • We did not have formal datasets, instead we used the Digital Commonwealth API and created embeddings from it. There is no need for a data dictionary outside of Digital Commonwealth API.
  • What keywords or tags would you attach to the data set?
    • Domain(s) of Application: Natural Language Processing, Library Science
    • Civic tech

The following questions pertain to the datasets you used in your project.
Motivation

  • We needed to create embeddings of the Digital Commonwealth's data in order to perform retrieval

Composition

  • Each entry in the Digital Commonwealth API represents an object in their repo of varying format
  • There were ~1.3 million total objects last we checked, about 147,000 of which containing full-text from OCR'd documents.
  • Our data was a comprehensive snapshot, the API is being updated.
  • Each field from the API represented metadata classifications
  • Data is publicly accessible and non-confidential

Collection Process

  • We collected data from an API endpoint.
  • No sampling was performed
  • This data was collected in October 2024

Preprocessing/cleaning/labeling

  • Very limited character correction was performed on the fulltext data.
  • No transformations were applied outside of embedding.
  • The raw data is saved in ml-bpl-rag-data/full_data/bpl_data.json (metadata) clean_ft.json (fulltext)

Uses

  • Embedding for retrieval

Distribution

  • This data is free to use and access by subsequent students of our project.

Maintenance

There is currently no system in place for cleanly updating the data, though in our instructions within WRITEUP.md we include a way to ingest your own data from the API and embed it.