Technical AI Safety — BlueDot Impact

I completed BlueDot Impact's Technical AI Safety program, a 6-week course covering the technical challenges of building safe AI systems. Key areas of study:

Training Safer Models

Pre-training data filtering, RLHF, Constitutional AI, the limits of training-time safety interventions. Studied Anthropic's work on data poisoning and how small amounts of poisoned data can affect models of any size.

Evaluations

The "science of evals" — why current benchmarks are insufficient, how to design evaluations that measure worst-case behavior, the gap between capability and risk evaluation. Studied METR's approach to third-party evaluation and Anthropic's Responsible Scaling Policy.

Deception and Scheming

How models can exhibit deceptive behavior including sandbagging and fake alignment. Studied OpenAI/Apollo Research's work on detecting scheming, the limits of chain-of-thought faithfulness, and deliberative alignment. Key takeaway: by definition, a scheming AI tries to hide its misalignment, making detection fundamentally challenging.

Interpretability

Mechanistic interpretability, sparse autoencoders, probing classifiers, activation patching, feature steering. Studied Anthropic's Scaling Monosemanticity work and Apollo Research's linear probes for deception detection (achieving 0.96-0.999 AUROC).

Minimizing Harm

Gradual disempowerment as a threat model — how humanity can lose agency slowly through locally rational delegation decisions, without any single catastrophic event. Developed a full kill chain analysis for this threat pathway.

My Orientation

I think about safety from an applied, engineering-oriented perspective. I'm most interested in evaluation methodology, detection systems, and treating deception as a classification/detection challenge rather than purely a policy question.

← Back to Home