Extended your work to reasoning models: findings and potential contribution #2

hyderli · 2026-04-18T11:07:11Z

hyderli
Apr 18, 2026

Hi!

I replicated your Personality Illusion framework on reasoning models (DeepSeek-R1-Distill-Qwen-7B and Llama-8B) as part of a BlueDot Impact Technical AI Safety project.

The core finding holds: self-reported Big Five traits don't predict sycophantic behavior in reasoning models either (all p > 0.10). Some additional findings that might interest you:

The more "agreeable" model (Llama, 3.55) is actually less sycophantic (11.9%) than the less agreeable one (Qwen, 31.2%)
Chain-of-thought analysis shows 73% of sycophantic flips involve rationalized conformity. The model constructs independent-sounding justifications while switching positions, often changing ethical frameworks.
Six steering interventions (CAA variants, CAST, abliteration) all failed to produce statistically significant sycophancy reduction at scale (n≈130), despite showing large effects in small-sample pilots (n≈15-29)
Mid-generation probing shows sycophancy signal strengthens during chain-of-thought reasoning (AUC improves from 0.571 at prompt to 0.702 at "think")

Code here: code
Writeup here: blog post

Two questions:

Would you be interested in a PR contributing the reasoning model extension (scripts, results, behavioral tasks) to the repo?
I'm now looking at connecting this to Anthropic's recent emotion concepts and persona vectors work that is testing whether they predict/steer this sycophancy test better than trait self-reports. Any thoughts on that direction?

Happy to discuss any of the findings or methodology.

Kaminari84 · 2026-05-22T23:21:02Z

Kaminari84
May 22, 2026
Maintainer

Thanks a lot for reaching out!
That sounds very interesting, thanks for sharing these results.

Regarding the questions:

Yes, that would be very interesting. Would you be interested in collaborating on any follow-up paper on that?
This is a very interesting direction as well, but steering vectors are also tricky. Perhaps if some of the issues come from prompt brittleness or post-training alignment, this could show better association.

Happy to connect on LinkedIn

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extended your work to reasoning models: findings and potential contribution #2

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Extended your work to reasoning models: findings and potential contribution #2

Uh oh!

hyderli Apr 18, 2026

Replies: 1 comment

Uh oh!

Kaminari84 May 22, 2026 Maintainer

hyderli
Apr 18, 2026

Kaminari84
May 22, 2026
Maintainer