Spaces:

bigscience-data
/

bigscience-corpus

Runtime error

Viewing samples of individual documents

by ola13 - opened Jul 19, 2022

BigScience Data org Jul 19, 2022

Hi there, I would like to view samples of individual documents from the corpus with accompanying metadata, is it possible to achieve with the existing tools? For now I was only able to gain insights on the dataset level, but perhaps I'm missing something obvious. Thanks!

christopher

BigScience Data org Jul 19, 2022

•

edited Jul 19, 2022

Hi Ola! That's currently possible with this tool: https://huggingface.co/spaces/bigscience-catalogue-lm-data/dataset-explorer

EDIT: The tool I linked doesn't currently display metadata, but it should be straightforward to have it display the meta field which is enforced to be present in every dataset of the corpus.

ola13

BigScience Data org Jul 19, 2022

Do you know how I can play with the tool itself to add the extra fields I need locally? I haven't used Spaces before so not sure how the development process works.

christopher

BigScience Data org Jul 19, 2022

You can git clone https://huggingface.co/spaces/bigscience-catalogue-lm-data/bigscience-corpus for a local copy; the app.py file is a streamlit app. You could also create your own private space with the same files and have that hosted in your own namespace if you'd like to experiment without a local env.

ola13

BigScience Data org Jul 20, 2022

•

edited Jul 20, 2022

@cakiki thanks for the pointers! A follow up question - I'm browsing data samples along with metadata, however the metadata field comes in the following format:

'meta': "{'file': 'en/2003/isba/9/fc/2.xml'}"

How can I fetch the metadata file? Thanks!

christopher

BigScience Data org Jul 21, 2022

This will be specific to the dataset you're looking at, but the file itself won't be accessible unless you look at the original dataset and download its raw files I suppose.

The metadata field will also be dataset specific (we didn't enforce a schema) and won't always be available. (It will sometimes be empty)

christopher changed discussion status to closed Jul 2

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment