[2606.01936] What to Format and How: A Benchmark and Workflow Approach for Document Formatting
Abstract page for arXiv paper 2606.01936: What to Format and How: A Benchmark and Workflow Approach for Document Formatting
America Forever Bytes
Company
Abstract page for arXiv paper 2606.01936: What to Format and How: A Benchmark and Workflow Approach for Document Formatting
Abstract page for arXiv paper 2606.02082: Overview of the ClinicalSkillQA 2026 Shared Task on Continuous Perception and Procedural Reasoning in Clinical Skill A...
Abstract page for arXiv paper 2606.02246: Ego-METAS: Egocentric online Multimodal Energy-efficient Temporal Action Segmentation benchmark
Abstract page for arXiv paper 2606.02404: K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts
Abstract page for arXiv paper 2606.02443: PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning
Abstract page for arXiv paper 2605.31351: A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation
Abstract page for arXiv paper 2605.31113: TSM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices
Pax, the frontier AI public safety company, cut crime by 27% in six months and raised $40M in seed funding from Greenoaks and Benchmark.
Abstract page for arXiv paper 2605.29893: Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories
Abstract page for arXiv paper 2605.30284: ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure
Abstract page for arXiv paper 2605.29462: Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset
Abstract page for arXiv paper 2604.00913: Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment
Abstract page for arXiv paper 2605.28721: LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?