[2511.19829] A Unified Evaluation-Instructed Framework for Query-Dependent Prompt Optimization
Abstract page for arXiv paper 2511.19829: A Unified Evaluation-Instructed Framework for Query-Dependent Prompt Optimization
America Forever Bytes
Other
Abstract page for arXiv paper 2511.19829: A Unified Evaluation-Instructed Framework for Query-Dependent Prompt Optimization
Abstract page for arXiv paper 2511.20409: NormEval: A Unified Multi-Metric Framework for Evaluating Semantic Fidelity in Text Normalization
Abstract page for arXiv paper 2606.01811:
Abstract page for arXiv paper 2606.01973: A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation
Abstract page for arXiv paper 2606.02016: Evaluating Real-World Generalizability of Algorithm Selection Models
Abstract page for arXiv paper 2606.02095: Testing Decision Makers without Counterfactuals
Abstract page for arXiv paper 2606.02302: SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents
Abstract page for arXiv paper 2504.11972: Reassessing Extractive QA Datasets at Scale: LLM-as-a-Judge and In-Depth Analyses
Abstract page for arXiv paper 2605.31545: Preference-Aware Rubric Learning for Personalized Evaluation
Abstract page for arXiv paper 2605.31186: How well does Classification Accuracy capture Concept Drift Detection Quality? An overview of Concept Drift Detection ...
Abstract page for arXiv paper 2605.30557: Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?
An application response I wrote! Feel free to leave feedback! • • What are you most concerned about when it comes to risks from AI? …
Mallika Rao explains how evaluation debt silently triggers regressions in distributed AI systems. She shares a five-layer evaluation stack to align metrics dire...
Abstract page for arXiv paper 2605.28966: The Trust Paradox: How CS Researchers Engage LLM Leaderboards
Abstract page for arXiv paper 2605.29786: Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations
Abstract page for arXiv paper 2605.28882: GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human
Abstract page for arXiv paper 2510.22016: Cost-Sensitive Evaluation for Binary Classifiers
Abstract page for arXiv paper 2605.27463: When prompt perturbations break your A/B test: A valid statistical test for generative surveying
Abstract page for arXiv paper 2605.28565: Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs
Abstract page for arXiv paper 2605.28591: Models That Know How Evaluations Are Designed Score Safer
Abstract page for arXiv paper 2605.28602: Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability
Abstract page for arXiv paper 2605.28616: Measuring Form and Function in Language Models
Abstract page for arXiv paper 2605.28710: Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study