How does the embedding model on code text?

#18
by chonghao-cc - opened

Hi Community, first thanks for developing this very nice model. I wonder if the model has been trained on code file and how it does on representing code? If not, do you have some recommendations for embedding for code file?

I've tried to use instructor-large for code search, but it did not seem to work too well.

In case of code what worked out better is parsing it with tree-sitter, then indexing the resulting code constructs in a meaningful way. You don't need embedding for that, only an SQL server or some classic full text search indexing solution.

You can still embed the documentation files (REAME.md, .txt files, PDF and DOC contents, etc.) and all the comments (parsed out by tree-sitter) from the source code files and provide better search based on that.

My project is the AskYourCode plugin for ChatGPT: https://askyourcode.ai

@viktor-ferenczi A issue could be here that the t5 models doesn't have brackets in their vocab.

Thanks Banso for the excellent insight. That would explain the difficulties. I've replaced textual search in the code with simple free text search which is very fast and a good way to find anything not indexed by the tree-sitter pass. It is not ideal for source code comments, but that's a trade-off. Producing those embedding vectors required a GPU, which made it expensive to host the backend of my plugin.

Thats a great invite for trying out https://embaas.io :)

It shouldn't be hard to add the brackets and train it on a small sample size. I will try it out.

Sign up or log in to comment