Replicating StarCoder Search Index

#84
by Brendan - opened

Hi all, I saw the starcoder search space has been removed. I was interested in reconstructing the search index on my own using the starcoderdata repo. Is there code available somewhere for replicating this, and a rough estimate of the compute requirements for the elastic search index? I would not be serving the index to outside requestors but would want to be able to complete a few hundred to a few thousand searches in an hour. I've used the elastic search index utilities in HF datasets before but didn't know how that particular index was created, or if there would be any necessary configuration to get things running efficiently

Hi Brendan! The search space actually lives here: https://huggingface.co/spaces/bigcode/search (We're currently migrating the backend, so search isn't currently available but it will be again very soon.)

If you're pressed for time, the indexing code lives here, but is not yet really optimized, and you'd most likely need to tweak things to fit your needs.

Most pressing hardware requirement would be disk space (the index needs to fit on disk), and then after that deciding how to shard the data, and whether to tune for indexing or search:

PS: Where did you find the space link you mentioned? Is it displayed somewhere?

Ah I see, thanks for the help, these are great!

Brendan changed discussion status to closed

Ah sorry I had missed your question, It was on this page: https://huggingface.co/bigcode/starcoder#attribution--other-requirements

Sign up or log in to comment