VibeBench/VibeSearchBench

Most search and retrieval benchmarks are embarrassingly easy to game — clean queries, single-turn, obvious ground truth. This project pushes back hard. The 200 tasks are designed to mimic how real users actually search: vague initial intent, progressive disclosure across turns, persona-driven context shifts. Scoring is schema-free knowledge-graph evaluation with triplet F1, which makes it harder to cheat with surface-level string matching. The multi-turn and proactive-search angles are the differentiators — if your retrieval agent collapses when the user says something ambiguous on turn one, this benchmark will surface that failure mode quickly. The honest reservation: 200 tasks is a meaningful set but not a massive one, and the benchmark is newly launched, so the community calibration and baseline comparisons are still thin. Worth watching how the leaderboard fills in over the next month. -> Best for: AI engineer building or evaluating a RAG pipeline or agentic search product