Other

ai-safety

Tracked in 11 AFBytes stories. First seen May 28, 2026. Last seen Jun 02, 2026.

Recent coverage

lesswrong.com · Jun 2, 2026 18:20 UTC

LURE: Alignment Evaluations to Reduce Evaluation Awareness — LessWrong

TLDR: * Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. * We introdu…

science

Read story

lobste.rs · Jun 1, 2026 23:07 UTC

Constraining LLMs Just Like Users

Article / 1st Jun 2026 Constraining LLMs Just Like Users This post accompanies my recent video on this topic. Large Language Models (LLMs) - often called "AI...

tech

Read story

lesswrong.com · May 31, 2026 08:27 UTC

Why AI safety researchers should consider a contract research manager position — LessWrong

Suppose you are a technical AI safety researcher who has done some research, but has not yet landed a full-time job. In this post, I argue that takin…

science

Read story

lesswrong.com · May 29, 2026 09:56 UTC

Developmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour — LessWrong

Summary Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on th…

science

Read story

lesswrong.com · May 29, 2026 04:08 UTC

Trees are mostly made of air and a generalizable lesson for AI safety — LessWrong

At the risk of embarrassing myself, I’ll share a confession. …

science

Read story

arxiv.org · May 29, 2026 04:00 UTC

[2605.30085] Conformal Certification of Reasoning Trace Prefixes

Abstract page for arXiv paper 2605.30085: Conformal Certification of Reasoning Trace Prefixes

science

Read story

arxiv.org · May 29, 2026 04:00 UTC

[2605.29427] FinGuard: Detecting Financial Regulatory Non-Compliance in LLM Interactions

Abstract page for arXiv paper 2605.29427: FinGuard: Detecting Financial Regulatory Non-Compliance in LLM Interactions

science tech

Read story

arxiv.org · May 28, 2026 04:00 UTC

[2605.28467] Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training

Abstract page for arXiv paper 2605.28467: Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training

science tech

Read story

arxiv.org · May 28, 2026 04:00 UTC

[2605.28588] Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem

Abstract page for arXiv paper 2605.28588: Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem

tech

Read story

arxiv.org · May 28, 2026 04:00 UTC

[2605.28591] Models That Know How Evaluations Are Designed Score Safer

Abstract page for arXiv paper 2605.28591: Models That Know How Evaluations Are Designed Score Safer

science tech

Read story

arxiv.org · May 28, 2026 04:00 UTC

[2605.28597] Position: Retire the "Positive Backdoor" Label -- Secret Alignment Requires Strict and Systematic Evaluation

Abstract page for arXiv paper 2605.28597: Position: Retire the "Positive Backdoor" Label -- Secret Alignment Requires Strict and Systematic Evaluation

tech

Read story

Related entities

benchmarks · other
alignment · other
LLM · technology
Machine Learning · technology
careers · other
research · other
interpretability · other
LessWrong · other
analogy · other
trees · other

Browse all entities

ai-safety · AFBytes

Recent coverage