Viewing samples of individual documents

#1
by ola13 - opened
BigScience Data org

Hi there, I would like to view samples of individual documents from the corpus with accompanying metadata, is it possible to achieve with the existing tools? For now I was only able to gain insights on the dataset level, but perhaps I'm missing something obvious. Thanks!

BigScience Data org
edited Jul 19, 2022

Hi Ola! That's currently possible with this tool: https://huggingface.co/spaces/bigscience-catalogue-lm-data/dataset-explorer

EDIT: The tool I linked doesn't currently display metadata, but it should be straightforward to have it display the meta field which is enforced to be present in every dataset of the corpus.

BigScience Data org

Do you know how I can play with the tool itself to add the extra fields I need locally? I haven't used Spaces before so not sure how the development process works.

BigScience Data org

You can git clone https://huggingface.co/spaces/bigscience-catalogue-lm-data/bigscience-corpus for a local copy; the app.py file is a streamlit app. You could also create your own private space with the same files and have that hosted in your own namespace if you'd like to experiment without a local env.

BigScience Data org
edited Jul 20, 2022

@cakiki thanks for the pointers! A follow up question - I'm browsing data samples along with metadata, however the metadata field comes in the following format:

'meta': "{'file': 'en/2003/isba/9/fc/2.xml'}"

How can I fetch the metadata file? Thanks!

BigScience Data org

This will be specific to the dataset you're looking at, but the file itself won't be accessible unless you look at the original dataset and download its raw files I suppose.

The metadata field will also be dataset specific (we didn't enforce a schema) and won't always be available. (It will sometimes be empty)

christopher changed discussion status to closed

Sign up or log in to comment