Replicating StarCoder Search Index
Hi all, I saw the starcoder search space has been removed. I was interested in reconstructing the search index on my own using the starcoderdata repo. Is there code available somewhere for replicating this, and a rough estimate of the compute requirements for the elastic search index? I would not be serving the index to outside requestors but would want to be able to complete a few hundred to a few thousand searches in an hour. I've used the elastic search index utilities in HF datasets before but didn't know how that particular index was created, or if there would be any necessary configuration to get things running efficiently
Hi Brendan! The search space actually lives here: https://huggingface.co/spaces/bigcode/search (We're currently migrating the backend, so search isn't currently available but it will be again very soon.)
If you're pressed for time, the indexing code lives here, but is not yet really optimized, and you'd most likely need to tweak things to fit your needs.
Most pressing hardware requirement would be disk space (the index needs to fit on disk), and then after that deciding how to shard the data, and whether to tune for indexing or search:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/size-your-shards.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html
PS: Where did you find the space link you mentioned? Is it displayed somewhere?
Ah I see, thanks for the help, these are great!
Ah sorry I had missed your question, It was on this page: https://huggingface.co/bigcode/starcoder#attribution--other-requirements