Kazuki
@kazuki
Cofounder of Glasp. I collect ideas and stories worth sharing 📚
San Francisco, CA
Joined Oct 9, 2020
1073
Following
5839
Followers
1.47k
13.59k
172.10k
www.darioamodei.com/post/the-urgency-of-interpretability
May 16, 2025
162
www.youtube.com/watch?v=ctcMA6chfDY
May 15, 2025
11
blog.duolingo.com/which-countries-study-which-languages-and-what-can-we-learn-from-it/
May 13, 2025
82
www.enterlabyrinth.com/p/the-reading-obsession
May 12, 2025
4
www.newinternet.tech/p/the-new-moat-memory
May 7, 2025
83
read.glasp.co/p/who-is-writing-the-story-of-your
Apr 30, 2025
82
x.com/StartupArchive_/status/1916822616012136588
Apr 29, 2025
41
read.glasp.co/p/human-curiosity-in-the-age-of-ai
Apr 25, 2025
63
www.implications.com/p/the-data-wars-and-reimagining-your
Apr 25, 2025
157
blog.samaltman.com/productivity/
Apr 22, 2025
155
www.mostlymetrics.com/p/how-to-calculate-customer-acquisition
Apr 22, 2025
101
www.mostlymetrics.com/p/how-to-calculate-net-dollar-retention
Apr 22, 2025
4
read.glasp.co/p/use-this-3-step-system-to-get-more
Apr 18, 2025
53
openai.com/index/gpt-4-1/
Apr 14, 2025
8
contraryresearch.substack.com/p/contrary-research-rundown-131
Apr 14, 2025
154
andrewchen.substack.com/p/every-marketing-channel-sucks-right
Apr 9, 2025
155
jamesclear.com/great-speeches/psychology-of-human-misjudgment-by-charlie-munger
Apr 6, 2025
114
www.theringer.com/podcasts/plain-english-with-derek-thompson/2025/02/28/the-end-of-reading
Apr 3, 2025
2
www.oneusefulthing.org/p/the-cybernetic-teammate
Mar 29, 2025
93
www.oneusefulthing.org/p/using-ai-to-make-teaching-easier
Mar 29, 2025
161
www.youtube.com/watch?v=VjJ6xcv7e8s
Mar 28, 2025
194
www.youtube.com/watch?v=TDBXoIiYFdQ
Mar 28, 2025
135
www.jasonfeifer.com/story-of-confidence/
Mar 25, 2025
41
metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
Mar 20, 2025
82
map.simonsarris.com/p/reading-well
Mar 16, 2025
74
collabfund.com/blog/pure-independence/
Mar 11, 2025
187
map.simonsarris.com/p/the-most-precious-resource-is-agency
Mar 8, 2025
83
usefulfictions.substack.com/p/how-to-be-more-agentic
Mar 4, 2025
93
x.com/karpathy/status/1894099637218545984/
Mar 3, 2025
41
www.youtube.com/watch?v=MhVZTzMy-BA
Mar 3, 2025
204
www.jasonfeifer.com/how-failure-builds-trust/
Mar 3, 2025
72
www.theverge.com/press-room/617654/internet-community-future-research
Feb 27, 2025
72
read.glasp.co/p/why-im-building-glasp
Feb 26, 2025
125
multitudes.weisser.io/p/the-dam-has-burst
Feb 25, 2025
31
investing101.substack.com/p/barking-in-public
Feb 24, 2025
73
investing101.substack.com/p/the-wrath-of-reading-and-writing
Feb 24, 2025
93
investing101.substack.com/p/on-writing
Feb 24, 2025
51
investing101.substack.com/p/2024-in-books
Feb 24, 2025
22
www.implications.com/p/outmaneuvering-friction-stages-of
Feb 21, 2025
94
www.jasonfeifer.com/sharing-something-personal-purposefully/
Feb 21, 2025
81
glasp.co/posts/e387a4bb-4be1-4cbd-ba2b-cfbe5d25920c
Feb 18, 2025
41
andysblog.uk/why-blog-if-nobody-reads-it/
Feb 18, 2025
62
fs.blog/richard-feynman-what-problems-to-solve/
Feb 13, 2025
31
putsomethingback.stevejobsarchive.com/internal-meeting-at-apple
Feb 13, 2025
2
putsomethingback.stevejobsarchive.com/
Feb 13, 2025
1
andrewchen.substack.com/p/the-growth-maze-vs-the-idea-maze
Feb 11, 2025
134
www.simplypsychology.org/what-is-the-yerkes-dodson-law.html
Feb 11, 2025
5
blog.samaltman.com/three-observations
Feb 10, 2025
133
openai.com/index/introducing-deep-research/
Feb 6, 2025
72
the progress of the underlying technology is inexorable, driven by forces too powerful to stop, but the way in which it happens—the order in which things are built, the applications we choose, and the details of how it is rolled out to society—are eminently possible to change, and it’s possible to have great positive impact by doing so. We can’t stop the bus, but we can steer it.
When a generative AI system does something, like summarize a financial document, we have no idea, at a specific or precise level, why it makes the choices it does—why it chooses certain words over others, or why it occasionally makes a mistake despite usually being accurate.
Many of the risks and worries associated with generative AI are ultimately consequences of this opacity, and would be much easier to address if the models were interpretable.
Our inability to understand models’ internal mechanisms means that we cannot meaningfully predict such behaviors, and therefore struggle to rule them out; indeed, models do exhibit unexpected emergent behaviors, though none that have yet risen to major levels of concern.
For example, one major concern is AI deception or power-seeking. The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments
there are a huge number of possible ways to “jailbreak” or trick the model, and the only way to discover the existence of a jailbreak is to find it empirically. If instead it were possible to look inside models, we might be able to systematically block all jailbreaks, and also to characterize what dangerous knowledge the models have.
There are other more exotic consequences of opacity, such as that it inhibits our ability to judge whether AI systems are (or may someday be) sentient and may be deserving of important rights. This is a complex enough topic that I won’t get into it in detail, but I suspect it will be important in the future.
we quickly discovered that while some neurons were immediately interpretable, the vast majority were an incoherent pastiche of many different words and concepts. We referred to this phenomenon as superposition, 7 7 The basic idea of superposition was described by Arora et al. in 2016, and more generally traces back to classical mathematical work on compressed sensing. The hypothesis that it explained uninterpretable neurons goes back to early mechanistic interpretability work on vision models. What changed at this time was that it became clear this was going to be a central problem for language models, much worse than in vision. We were able to provide a strong theoretical basis for having conviction that superposition was the right hypothesis to pursue.
and we quickly realized that the models likely contained billions of concepts, but in a hopelessly mixed-up fashion that we couldn’t make any sense of. The model uses superposition because this allows it to express more concepts than it has neurons, enabling it to learn more. If superposition seems tangled and difficult to understand, that’s because, as e
we employed a method called autointerpretability —which uses an AI system itself to analyze interpretability features—to scale the process of not just finding the features, but listing and identifying what they mean in human terms.
Finding and identifying 30 million features is a significant step forward, but we believe there may actually be a billion or more concepts in even a small model
All of this progress, while scientifically impressive, doesn’t directly answer the question of how we can use interpretability to reduce the risks I listed earlier.
Our long-run aspiration is to be able to look at a state-of-the-art model and essentially do a “brain scan”: a checkup that has a high probability of identifying a wide range of issues including tendencies to lie or deceive, power-seeking, flaws in jailbreaks, cognitive strengths and weaknesses of the model as a whole, and much more.
we could have AI systems equivalent to a “country of geniuses in a datacenter” as soon as 2026 or 2027. I am very concerned about deploying such systems without a better handle on interpretability. These systems will be absolutely central to the economy, technology, and national security, and will be capable of so much autonomy that I consider it basically unacceptable for humanity to be totally ignorant of how they work.
Interpretability gets less attention than the constant deluge of model releases, but it is arguably more important. It also feels to me like it is an ideal time to join the field: the recent “circuits” results have opened up many directions in parallel.
Interpretability is also a natural fit for academic and independent researchers: it has the flavor of basic science, and many parts of it can be studied without needing huge computational resources. To be clear, some independent researchers and academics do work on interpretability
Powerful AI will shape humanity’s destiny, and we deserve to understand our own creations before they radically transform our economy, our lives, and our future.