LURE: Alignment Evaluations to Reduce Evaluation Awareness — LessWrong
TLDR: * Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. * We introdu…
America Forever Bytes
Other
TLDR: * Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. * We introdu…
Article / 1st Jun 2026 Constraining LLMs Just Like Users This post accompanies my recent video on this topic. Large Language Models (LLMs) - often called "AI...
Suppose you are a technical AI safety researcher who has done some research, but has not yet landed a full-time job. In this post, I argue that takin…
Summary Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on th…
At the risk of embarrassing myself, I’ll share a confession. …
Abstract page for arXiv paper 2605.30085: Conformal Certification of Reasoning Trace Prefixes
Abstract page for arXiv paper 2605.29427: FinGuard: Detecting Financial Regulatory Non-Compliance in LLM Interactions
Abstract page for arXiv paper 2605.28467: Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training
Abstract page for arXiv paper 2605.28588: Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem
Abstract page for arXiv paper 2605.28591: Models That Know How Evaluations Are Designed Score Safer
Abstract page for arXiv paper 2605.28597: Position: Retire the "Positive Backdoor" Label -- Secret Alignment Requires Strict and Systematic Evaluation