BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks
•
31
Track, rank and evaluate open LLMs' CoT quality
View how beam search decoding works, in detail!
Jailbreak the LLM and privacy guardrails