Spaces:

bigcode
/

in-the-stack

Running on CPU Upgrade

App Files Files Community

in-the-stack / app.py

lewtun HF staff

Relax "LLMs" to "machine learning models"

845f83e almost 2 years ago

raw

history blame

No virus

1.96 kB

	from datasets import load_dataset
	import streamlit as st
	from huggingface_hub import hf_hub_download
	import gzip
	import json


	@st.cache(allow_output_mutation=True)
	def load_all_usernames():
	filepath = hf_hub_download(repo_id="bigcode/the-stack-username-to-repo", filename="username_to_repo.json.gz", repo_type="dataset")

	with gzip.open(filepath, 'r') as f:
	usernames = json.loads(f.read().decode('utf-8'))
	return usernames

	st.image("./banner.png", use_column_width=True)

	st.markdown("_The Stack is an open governance interface between the AI and open source communities._")
	st.title("Am I in The Stack?")
	st.markdown("As part of the BigCode project, we released and maintain [The Stack](https://huggingface.co/datasets/bigcode/the-stack), a 3.1 TB dataset of permissively licensed source code in 30 programming languages. One of our goals in this project is to give the people who wrote this source code a choice as to whether or not it can be employed to develop and evaluate machine learning models, as we acknowledge that not all developers may wish to have their data used for that purpose.")

	st.markdown("This tool lets you check if a repository under a given username is part of The Stack dataset. Would you like to have your data removed from future versions of The Stack? You can opt-out following the instructions [here](https://www.bigcode-project.org/docs/about/the-stack/#how-can-i-request-that-my-data-be-removed-from-the-stack).")

	usernames = load_all_usernames()
	username = st.text_input("Your GitHub Username:")

	if st.button("Check!"):
	if username in usernames:
	repos = usernames[username]
	repo_word = "repository" if len(repos)==1 else "repositories"
	st.markdown(f"Yes, there is code from {len(repos)} {repo_word} in The Stack:")
	for repo_name in repos:
	st.markdown(f"`{repo_name}`")
	else:
	st.markdown("No, your code is not in The Stack.")