Other

evaluation

Tracked in 23 AFBytes stories. First seen May 28, 2026. Last seen Jun 02, 2026.

Recent coverage

arxiv.org · Jun 2, 2026 04:00 UTC

[2511.19829] A Unified Evaluation-Instructed Framework for Query-Dependent Prompt Optimization

Abstract page for arXiv paper 2511.19829: A Unified Evaluation-Instructed Framework for Query-Dependent Prompt Optimization

science tech

Read story

arxiv.org · Jun 2, 2026 04:00 UTC

[2511.20409] NormEval: A Unified Multi-Metric Framework for Evaluating Semantic Fidelity in Text Normalization

Abstract page for arXiv paper 2511.20409: NormEval: A Unified Multi-Metric Framework for Evaluating Semantic Fidelity in Text Normalization

science tech

Read story

arxiv.org · Jun 2, 2026 04:00 UTC

[2606.01811] "I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise

Abstract page for arXiv paper 2606.01811:

science tech

Read story

arxiv.org · Jun 2, 2026 04:00 UTC

[2606.01973] A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation

Abstract page for arXiv paper 2606.01973: A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation

science tech

Read story

arxiv.org · Jun 2, 2026 04:00 UTC

[2606.02016] Evaluating Real-World Generalizability of Algorithm Selection Models

Abstract page for arXiv paper 2606.02016: Evaluating Real-World Generalizability of Algorithm Selection Models

science tech

Read story

arxiv.org · Jun 2, 2026 04:00 UTC

[2606.02095] Testing Decision Makers without Counterfactuals

Abstract page for arXiv paper 2606.02095: Testing Decision Makers without Counterfactuals

science tech

Read story

arxiv.org · Jun 2, 2026 04:00 UTC

[2606.02302] SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

Abstract page for arXiv paper 2606.02302: SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

science tech

Read story

arxiv.org · Jun 1, 2026 04:00 UTC

[2504.11972] Reassessing Extractive QA Datasets at Scale: LLM-as-a-Judge and In-Depth Analyses

Abstract page for arXiv paper 2504.11972: Reassessing Extractive QA Datasets at Scale: LLM-as-a-Judge and In-Depth Analyses

science tech

Read story

arxiv.org · Jun 1, 2026 04:00 UTC

[2605.31545] Preference-Aware Rubric Learning for Personalized Evaluation

Abstract page for arXiv paper 2605.31545: Preference-Aware Rubric Learning for Personalized Evaluation

science tech

Read story

arxiv.org · Jun 1, 2026 04:00 UTC

[2605.31186] How well does Classification Accuracy capture Concept Drift Detection Quality? An overview of Concept Drift Detection evaluation

Abstract page for arXiv paper 2605.31186: How well does Classification Accuracy capture Concept Drift Detection Quality? An overview of Concept Drift Detection ...

science tech

Read story

arxiv.org · Jun 1, 2026 04:00 UTC

[2605.30557] Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

Abstract page for arXiv paper 2605.30557: Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

science tech

Read story

lesswrong.com · May 31, 2026 19:31 UTC

Why I think evals are pretty important and most worth working on (for me) — LessWrong

An application response I wrote! Feel free to leave feedback! • • What are you most concerned about when it comes to risks from AI? …

science

Read story

infoq.com · May 29, 2026 12:00 UTC

Building Evals for AI Adoption: From Principles to Practice

Mallika Rao explains how evaluation debt silently triggers regressions in distributed AI systems. She shares a five-layer evaluation stack to align metrics dire...

tech

Read story

arxiv.org · May 29, 2026 04:00 UTC

[2605.28966] The Trust Paradox: How CS Researchers Engage LLM Leaderboards

Abstract page for arXiv paper 2605.28966: The Trust Paradox: How CS Researchers Engage LLM Leaderboards

science tech

Read story

arxiv.org · May 29, 2026 04:00 UTC

[2605.29786] Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations

Abstract page for arXiv paper 2605.29786: Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations

science tech

Read story

arxiv.org · May 29, 2026 04:00 UTC

[2605.28882] GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

Abstract page for arXiv paper 2605.28882: GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

science tech

Read story

arxiv.org · May 28, 2026 04:00 UTC

[2510.22016] Cost-Sensitive Evaluation for Binary Classifiers

Abstract page for arXiv paper 2510.22016: Cost-Sensitive Evaluation for Binary Classifiers

science

Read story

arxiv.org · May 28, 2026 04:00 UTC

B test: A valid statistical test for generative surveying

Abstract page for arXiv paper 2605.27463: When prompt perturbations break your A/B test: A valid statistical test for generative surveying

science

Read story

arxiv.org · May 28, 2026 04:00 UTC

[2605.28565] Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs

Abstract page for arXiv paper 2605.28565: Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs

science

Read story

arxiv.org · May 28, 2026 04:00 UTC

[2605.28591] Models That Know How Evaluations Are Designed Score Safer

Abstract page for arXiv paper 2605.28591: Models That Know How Evaluations Are Designed Score Safer

science tech

Read story

arxiv.org · May 28, 2026 04:00 UTC

[2605.28602] Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

Abstract page for arXiv paper 2605.28602: Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

science tech

Read story

arxiv.org · May 28, 2026 04:00 UTC

[2605.28616] Measuring Form and Function in Language Models

Abstract page for arXiv paper 2605.28616: Measuring Form and Function in Language Models

science tech

Read story

arxiv.org · May 28, 2026 04:00 UTC

[2605.28710] Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

Abstract page for arXiv paper 2605.28710: Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

science tech

Read story

Related entities

ai · other
prompt engineering · other
research · other
nlp · technology
diversity · other
arxiv · other
surprise · other
metrics · other
Machine Learning · technology
adaptation · other
algorithms · other
ml · other

Browse all entities

evaluation · AFBytes

Recent coverage