Company

benchmark

Tracked in 13 AFBytes stories. First seen May 28, 2026. Last seen Jun 02, 2026.

Recent coverage

arxiv.org · Jun 2, 2026 04:00 UTC

[2606.01936] What to Format and How: A Benchmark and Workflow Approach for Document Formatting

Abstract page for arXiv paper 2606.01936: What to Format and How: A Benchmark and Workflow Approach for Document Formatting

science tech

Read story

arxiv.org · Jun 2, 2026 04:00 UTC

[2606.02082] Overview of the ClinicalSkillQA 2026 Shared Task on Continuous Perception and Procedural Reasoning in Clinical Skill Assessment

Abstract page for arXiv paper 2606.02082: Overview of the ClinicalSkillQA 2026 Shared Task on Continuous Perception and Procedural Reasoning in Clinical Skill A...

science tech

Read story

arxiv.org · Jun 2, 2026 04:00 UTC

[2606.02246] Ego-METAS: Egocentric online Multimodal Energy-efficient Temporal Action Segmentation benchmark

Abstract page for arXiv paper 2606.02246: Ego-METAS: Egocentric online Multimodal Energy-efficient Temporal Action Segmentation benchmark

science tech

Read story

arxiv.org · Jun 2, 2026 04:00 UTC

[2606.02404] K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Abstract page for arXiv paper 2606.02404: K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

science tech

Read story

arxiv.org · Jun 2, 2026 04:00 UTC

[2606.02443] PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

Abstract page for arXiv paper 2606.02443: PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

science tech

Read story

arxiv.org · Jun 1, 2026 04:00 UTC

[2605.31351] A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation

Abstract page for arXiv paper 2605.31351: A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation

science tech

Read story

arxiv.org · Jun 1, 2026 04:00 UTC

[2605.31113] TSM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices

Abstract page for arXiv paper 2605.31113: TSM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices

science tech

Read story

ventureburn.com · May 29, 2026 15:22 UTC

Pax AI Cuts Crime 27% and Raises $40M Seed Funding

Pax, the frontier AI public safety company, cut crime by 27% in six months and raised $40M in seed funding from Greenoaks and Benchmark.

tech business

Read story

arxiv.org · May 29, 2026 04:00 UTC

[2605.29893] Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories

Abstract page for arXiv paper 2605.29893: Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories

science tech

Read story

arxiv.org · May 29, 2026 04:00 UTC

[2605.30284] ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

Abstract page for arXiv paper 2605.30284: ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

science tech

Read story

arxiv.org · May 29, 2026 04:00 UTC

[2605.29462] Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset

Abstract page for arXiv paper 2605.29462: Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset

science tech

Read story

arxiv.org · May 28, 2026 04:00 UTC

[2604.00913] Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

Abstract page for arXiv paper 2604.00913: Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

science tech

Read story

arxiv.org · May 28, 2026 04:00 UTC

[2605.28721] LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

Abstract page for arXiv paper 2605.28721: LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

science tech

Read story

Related entities

arxiv · other
tech · other
productivity · other
healthcare-ai · other
research · other
action segmentation · other
egocentric · other
research paper · other
agents · other
ai · other
safety · other

Browse all entities

benchmark · AFBytes

Recent coverage