Asankhaya Sharma PRO

codelion

AI & ML interests

AI/ML, Dev Tools and Application Security

Organizations

codelion's activity

posted an update 3 days ago
view post
Post
1434
WorkerSafetyQAEval: A new benchmark to evaluate worker safety domain question and answering

Happy to share a new benchmark on question and answers for worker safety domain. The benchmark and leaderboard is available at
codelion/worker-safety-qa-eval

We evaluate popular generic chatbots like ChatGPT and HuggingChat on WorkerSafetyQAEval and compare it with a domain specific RAG bot called Securade.ai Safety Copilot - codelion/safety-copilot It highlights the importance of having domain specific knowledge for critical domains like worker safety that require high accuracy. Securade.ai Safety Copilot achieves ~97% on the benchmark setting a new SOTA.

You can read more about the Safety Copilot on https://securade.ai/blog/how-securade-ai-safety-copilot-transforms-worker-safety.html
replied to their post 18 days ago
replied to their post 18 days ago
posted an update 18 days ago
view post
Post
1045
After the announcements yesterday, I got a chance to try the new gemini-1.5-flash model from @goog1e , it is almost as good as gpt-4o on the StaticAnalaysisEval ( patched-codes/static-analysis-eval) It is also a bit faster than gpt-4o and much cheaper.

I did run into a recitation flag with an example in the dataset where the api refused to fix the vulnerability and flagged the input as using copyrighted content. This is something you cannot unset even with the safety filters and seems to be an existing bug https://issuetracker.google.com/issues/331677495

But overall you get gpt-4o level performance for 7% the price, we are thinking of making it default in patchwork - https://github.com/patched-codes/patchwork You can use the google_api_key and model options to choose gemini-1.5-flash-latest to run it with patchwork.
  • 2 replies
Β·
replied to their post 19 days ago
view reply

At the moment we do not have any multimodal examples in the benchmark. The focus has been on vulnerability remediation but I cannot think off any use to utilize it in coding related tasks? Do you have any ideas on how multi modality can be exploited for something like coding?

posted an update 19 days ago
view post
Post
1704
The new gpt-4o model seems to a very good coder. OpenAI reported a 90+ score on openai_humaneval

We tried the new model on our patched-codes/static-analysis-eval which evaluates the model on vulnerability remediation. gpt-4o has reclaimed the top spot on our leaderboard (from meta-llama/Meta-Llama-3-70B-Instruct).

You can now use the new model with our open-source framework PatchWork - https://github.com/patched-codes/patchwork by passing model=gpt-4o on the CLI.
Β·
replied to their post about 1 month ago
replied to Jaward's post about 1 month ago
view reply

Great thanks, would love to see the kind of output it produces directly. We have been trying to automate agentic workflows using an open source framework called patchwork - https://github.com/patched-codes/patchwork

It is more deterministic and we are focussing only specific workflows so would love to compare with something like Devin.

posted an update about 1 month ago
view post
Post
1752
Happy to announce the open source framework to turbo charge devops called patchwork - https://github.com/patched-codes/patchwork

You can use it to build patchflows - workflows that use LLMs for software development tasks like bug fixing, pull request review, library migration and documentation.

Supports any LLM of your choice including our own MoE model - patched-codes/patched-mix-4x7B

Give it a try!
  • 2 replies
Β·
replied to Jaward's post about 1 month ago
view reply

Can you share the apps that it created?

posted an update about 1 month ago
replied to WizardLM's post about 2 months ago
view reply

The weights seem to have been taken down?

posted an update about 2 months ago
view post
Post
1939
We just released a new MoE model (meraGPT/mera-mix-4x7B) that is half as large as Mixtral-8x7B while still been competitive with it across different benchmarks. mera-mix-4x7B achieves 76.37 on the open LLM eval.

You can check mera-mix-4x7B out on HF here - meraGPT/mera-mix-4x7B