WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks Paper • 2407.05291 • Published Jul 7 • 2
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content Paper • 2406.11811 • Published Jun 17 • 16
Regions of Reliability in the Evaluation of Multivariate Probabilistic Forecasts Paper • 2304.09836 • Published Apr 19, 2023
Capture the Flag: Uncovering Data Insights with Large Language Models Paper • 2312.13876 • Published Dec 21, 2023 • 1
Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting Paper • 2310.08278 • Published Oct 12, 2023 • 3
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? Paper • 2403.07718 • Published Mar 12 • 1
byteSteady: Fast Classification Using Byte-Level n-Gram Embeddings Paper • 2106.13302 • Published Jun 24, 2021