Spaces:
Runtime error
Viewing samples of individual documents
Hi there, I would like to view samples of individual documents from the corpus with accompanying metadata, is it possible to achieve with the existing tools? For now I was only able to gain insights on the dataset level, but perhaps I'm missing something obvious. Thanks!
Hi Ola! That's currently possible with this tool: https://huggingface.co/spaces/bigscience-catalogue-lm-data/dataset-explorer
EDIT: The tool I linked doesn't currently display metadata, but it should be straightforward to have it display the meta
field which is enforced to be present in every dataset of the corpus.
Do you know how I can play with the tool itself to add the extra fields I need locally? I haven't used Spaces before so not sure how the development process works.
You can git clone https://huggingface.co/spaces/bigscience-catalogue-lm-data/bigscience-corpus
for a local copy; the app.py
file is a streamlit app. You could also create your own private space with the same files and have that hosted in your own namespace if you'd like to experiment without a local env.
@cakiki thanks for the pointers! A follow up question - I'm browsing data samples along with metadata, however the metadata field comes in the following format:
'meta': "{'file': 'en/2003/isba/9/fc/2.xml'}"
How can I fetch the metadata file? Thanks!
This will be specific to the dataset you're looking at, but the file itself won't be accessible unless you look at the original dataset and download its raw files I suppose.
The metadata field will also be dataset specific (we didn't enforce a schema) and won't always be available. (It will sometimes be empty)