
DeepMind Was Two Steps Ahead, AGAIN!
Transcript
For years, the Mississippians has
treated AGI as a long-term scientific
mission to recreate the human mind, not
just a race to build a profitable
software product. When the word thought
go was essentially unsolvable for AI,
DeepMind went all in on AlphaGo,
investing heavily with no immediate
return. Right now, history is repeating
itself. DeepMind is pouring resources
into another seemingly intractable
problem that nobody else wants to touch.
Depending on how this plays out, they
are either executing an extraordinary
vision or one of the biggest
miscalculations in modern history. To
understand what DeepMind is building, we
need to look at the blueprint of modern
AI. It is held up by four main pillars:
transformer based, autoregressive,
pre-trained, and generative. Most
researchers agree that this is a
specific combination has enormous
potential, but also has theoretical
limitations that prevent it from
reaching human-level intelligence. And
that's exactly why DeepMind is already
moving beyond it. The first piece of the
puzzle is replacing pure transformer
models. DeepMind has been working on of
serious modifications to the
transformer, seen in efforts like the
Griffin architecture, recurrent Gemma,
Titans, and even Gemma 4. The
transformer is a very powerful idea.
It's basically a stack of layers where
each token repeatedly looks at all other
tokens in the sequence, decides which
ones matter most, and uses that to
update its own meaning in context. This
mechanism works very well because it can
dynamically decide how each token
relates to every other token in the
sequence, which makes it very powerful
for building context of representations.
But, the trade-off is that it does not a
scale cheaply. As the sequence gets
longer, every token has to compare
itself with every other token. So, the
number of interactions grows roughly
with the square of the sequence length,
which becomes expensive very quickly.
While modern systems still use
attention, they rarely use fully naive
all-to-all attention. Instead, they
adopt a growing range of efficient
approximities, sparse patterns, and
hybrid architectures to control the
quadratic cost while preserving most of
the modeling power. Most leading labs
have largely avoided experimenting with
entirely new foundations, but DeepMind
is actively exploring alternatives and
extensions to the transformer, including
recurrent and state-space inspired
architectures. You might remember back
in December 2024, Google released the
famous Titans paper, learning to
memorize at test time. The team has
since moved to publish nested learning,
and more recently, Titans plus Miras,
helping AI have long-term memory. The
idea for this approach is, instead of
remembering everything all the time,
what if we develop a specific strategy
to keep only part of the input? Titans
showed that the model can be more
selective about what it keeps by using a
surprise mechanism. They taught the
model to memorize unexpected things, and
it was able to become more effective on
common sense reasoning, genomics, and
time series tasks. This approach seems
like a promising direction for
complementing or extending transformers,
and it could even help with continual
learning since part of the adaptation
happens at inference time. So far, it is
perhaps the most notable approach to
challenge pure transformer
architectures, and almost all of the
research behind it is still coming from
DeepMind. But in a lesser-known release,
Google introduced Recurrent Gemma,
moving past transformers for efficient
open language models based on the
Griffin architecture. The Griffin
architecture mixes, and I'll explain
these fancy words in a second, gated
linear recurrences with local attention.
If we were to compare the standard
attention to Griffin, it's something
like this. The transformer is performing
what we call global attention. Imagine
you start reading a book. When you are
reading the first page, there is
virtually no difference between Griffin
and a standard attention. But when you
get to the second page, you have to tear
the first one and have it right in front
of you. If you keep reading in the page
100, when you have carried 99 previous
pages, this method becomes an unbearable
liability. That is stack of annoying
pages that you had to carry all the way
is the previous context. That is called
the KV cache in a standard attention.
But Griffin says, instead of having the
entire thing carried over, introduce an
index card that is constantly rewritten
for every page. And instead of global
attention, the model has only local
attention. Something like 2,000 tokens
to understand the context of the words
within the current page. The AI
synthesizes the core meaning of the
input into a fixed size state. There is
only one word left out, which is the
gated part, and that's the trickiest
part. The Griffin architecture relies on
a bunch of matrices called gate weights
that learn during the training the best
strategies to retain information. Like
they might learn bold words, a specific
numbers, start of a new chapter, are
more important and need to be
remembered. But realistically, even much
more complicated strategies of what
makes thing worth remembering or not.
And finally, when something is decided
to enter the index card, there is a
recurrence gate. This part looks at the
past information and the new information
and decides how best it can integrate
them. Hypothetically, if you learn the
main character in the book is tall in
the first page and has blue eyes in the
page 65, the index card will contain a
unified representation of a main
character that is tall and has blue
eyes. So, the index card is an evolving
a state of past information, not a list
of items. The recurrent Gemma was a jump
on benchmarks, especially related to
long context. One of the rare
transformer challenges that was given a
decent shot at proving itself. DeepMind
then squeezed the golden juice out of
these experiments to build a model with
maximum compute and memory efficiency.
Gemma 4 first of all uses some sparsity
methods to make the model much lower
weight. For example, the 4 billion
parameter model is actually about 25
billion parameters, but Gemma has two
categories of A and E that reduce the
number of active or effective parameters
using the mixture of experts or per
layer embedding methods, which we don't
get into because they are a bit off
topic. But after that, the real
innovation is how this tiny model that
runs on edge devices is not only
multimodal and accepts raw video and
audio. It also has a long context of up
to 200 and 56K tokens. The Gemma team
achieved this using a local sliding
window attention for some layers and
global attention for others. Imagine
you're reading a paragraph inside a huge
book. You first put intense attention
into that single paragraph. That's your
local sliding window. But then after
that, you can go back and read the
relevant pages again, so you understand
the paragraph in the full context of the
book. In a document of let's say 256K
tokens, some layers only look at a
sliding window of 1,000 tokens and
ensure local coherence, while a limited
number of other layers take the entire
global attention into account.
Therefore, connect the local context to
the entire document. These are just a
number of more well-known, more mature
attempts at optimizing, extending, or
outright replacing the transformer.
These attempts might seem like consumer
products, but more than anything, they
are public experiments serving
DeepMind's core strategy towards AGI.
>> We need one or two more big
breakthroughs before we'll get to AGI.
And I think they're along the lines of
things like continual learning, better
memory, longer context windows, or or
perhaps more efficient context windows
would be the right way to say it. So,
don't store everything, just store the
important things. That would be a lot
more efficient. That's what the brain
does.
>> Right alongside the transformer, though,
is another key idea called auto
regression. As you probably know, the
generation process in GPTs is next token
prediction in an iterative process. That
process is so well-known that some
people might assume that the transformer
and even the entire LLM space is
inherently next token predictors. But
even the original transformer that was
invented for translation wasn't a pure
autoregressive model. It was only turned
into a decoder only transformer that is
fully autoregressive by Alec Radford to
go from the original intended machine
translation function to a more general
self-supervised next token predictor.
The main invention of the transformer
was the self-attention mechanism that we
talked about. And that doesn't require
next token prediction or auto
regression. So there is a whole area of
transformer base but not auto regressive
models. Diffusion language models are
the best example of that. Listen to
Professor Stefano Ermon, one of the
pioneers of diffusion models.
>> All existing or most of the existing
LLMs are auto regressive, meaning that
they generate text or code
one word, one token at a time, left to
right. On the other hand, some of the
best generative models for images,
video, music are diffusion based where
the object is generated by an iterative
refinement process where you start with
a rough guess what the answer should be
and then you keep refining it.
And crucially, this refinement process
is highly parallel. The neural network
is able to modify multiple tokens at the
same time.
>> Why is that better?
>> This generative process is significantly
more efficient. It's faster, it's
cheaper and because there is also
built-in error correction, the network
is trained to fix mistakes and that's
how model is learning at inference time.
What we see is it's the future where all
LLMs are going to be eventually
diffusion based because this is
uh
far superior approach.
>> There are three main advantages to
diffusion models and we talk about them
very briefly here. Speed and efficiency.
Although diffusion models loop as well,
they usually need far fewer iterations
to reach a final answer because the
process is much more parallel which GPUs
love. A reasonable number of steps for
language is often in the tens or at most
low hundreds as opposed to thousands or
even tens of thousands of steps for
auto-regressive models. That's why
diffusion language models are already
about 10 times faster in practice.
Smarter outputs. Auto-regressive LLMs
generate text from left to right. Once a
token is generated, there is no
opportunity to revise it. Diffusion
models on the other hand can more
naturally correct mistakes by working on
the entire response holistically. Third,
flexibility. While auto-regressive
models expect the prompt to be a prefix,
diffusion LLMs allow the prompt to sit
at any arbitrary position. This
especially makes a lot of sense for
writing code. Diffusion model is
naturally trained to see the entire code
and complete the missing part. But
auto-regressive models struggle because
they don't naturally see what comes
after the edit. They have to engage in
this awkward computation and often
rewrite the parts that they are not
supposed to touch. It is not obvious
that a scale is not going to just wash
over all of these differences. And it
might not be worth it to go back and
start from the beginning. But although
everyone else is essentially ignoring
diffusion models, a super small lab out
of a Stanford called Inception Labs has
already introduced the first reasoning
LLM powered by diffusion. It is
ridiculously faster than models of its
size while being competitive in quality.
This is the sort of jump that forces
reimagining the stack. Google DeepMind
is the only major lab working on a
diffusion language model and they
introduced Gemini Diffusion mid-2025.
is still out on the importance of
diffusion language modeling. It might
even be that a hybrid approach could
work really well. But that's just one
aspect of diffusion. While OpenAI and
Anthropic are 100% focused on
auto-regressive transformer models and
OpenAI even recently shut down Sora 2, a
diffusion-based model, letting go of a
billion-dollar contract with Disney
because of the decision to double down
on the GPT technology.
>> But the thing that's important to
realize is technologically that the Sora
models, which are incredible models, by
the way,
are different branch of the tech tree
than the core reasoning GPT series. Mhm.
They're just built in a very different
way. And to some extent, we're really
saying that pursuing both branches is
very hard for us to do.
>> Google DeepMind has lots of different
projects built on different stacks.
>> And their mindset is apparent in the
diverse projects they maintain. We'll
talk about the importance of these
projects, not just as cool media
generators, but as part of the core
strategy towards AGI. They have VEO and
Genie, Gemini diffusion, the Imagine
lineup, which is diffusion-based image,
Vias the Nano Banana line that is based
on Gemini, the Gemma family, AlphaFold
{slash} AlphaGenome line, and many more
projects. Even though these models
consume massive amounts of compute and
research capacity,
>> And so I think there's something about
if you're if you branch too far and you
have two different artifacts, that is
very hard to sustain in a world where
there is limited compute.
>> DeepMind is betting they are for the
next stage.
>> This is one advantage we have of having
such a deep and rich research bench. We
can go after both of those things at
maximum with maximum force, both, you
know, scaling up the current paradigms
and ideas and
and then really new blue sky ideas for
new architectures and things, you know,
the kinds of things we've invented over
the last 10 years as Google and
DeepMind.
>> So far, it might seem like that DeepMind
is just keeping every possible research
direction open, but that is not quite
the case. They are actually making a
very specific bet on a very specific
vision of how AGI should look like. And
the picture becomes more clear in the
next two pillars, generative and
pre-trained. I've been playing around
with some new AI tools lately, and
honestly, it is getting a scary good. I
feel like most people still think AI
music is where it was about a year ago,
where it sounded cool for like 10
seconds and then completely fell apart.
But I just generated this entire song,
the beats, the vocals, everything.
>> It's a lonely [singing] little heaven,
but it's peaceful and it will do.
[music]
>> Using Mubert AI and it sounds absolutely
incredible. Mubert is basically an AI
music creation tool that can generate
full studio quality songs,
instrumentals, and background tracks.
The interface is really easy to get
into. You can switch between an easy
mode and a custom mode, depending on how
much control you want. Custom mode gives
you a lot more flexibility to fine-tune
the results. There are two main models,
O2 that is built for generating a
standalone full song,
and V9, which offers precise control,
studio-level mixing, enhanced
multilingual support, and a stronger
instrumental music effects. [music] It
is highly flexible and efficient for
custom generation and remixing in
styles.
>> [music]
>> As a creator, finding a background music
that actually fits your videos is
extremely annoying. Even if you pay for
premium subscriptions. I used Mubert's
V9 model to generate a custom background
track for this video's intro, and I can
even use the new soundtrack feature for
educational content. You can upload any
video and automatically generate a track
that fits the composition. The music
[music] is royalty-free and cleared for
commercial use, which makes the whole
process way easier. Just as a side note,
one of my favorite use cases is remixing
Persian classical music into different
languages and styles. Let me play two
tracks that I've had on repeat lately.
>> Kiss me, my love. [singing]
>> [music]
>> Kiss me, my love.
>> [music]
>> Just for the very last time.
>> [music]
>> I am [singing] the harvest
that sleeps in
the sea.
>> For
>> [singing]
>> I am
the morning breaking the dark.
>> Check out Marika and give your videos
their own custom AI generated
soundtrack. Link in the description.
Thanks Marika for sponsoring this part
of the video. Here is Demis Hassabis
explaining how he disagrees with Yann
LeCun and why he's doubling down on a
particular vision of AGI.
>> Now, it remains to be seen whether just
sort of scaling up existing ideas and
technologies will be enough to do that
or we need one or two more
really big insightful innovations. I'm
probably if you were to push me I would
say I would be in the latter camp. But I
think no matter what camp you're in,
we're going to need large foundation
models as the key component of the final
AGI systems. Of that I'm sure. So I
don't I'm not a subscriber to someone
like Yann LeCun who thinks you know that
they're just sort of a some kind of dead
end. I think the only debate in my mind
is are they a key component or the only
component?
>> A lot of people might automatically side
with Demis because he's so popular and
is doing very practical research, but
you'll also easily get Yann's point when
it's laid out in simple terms. He's one
of the foundational architects of modern
deep learning era. His work on
convolutional neural networks in the
1980s and 90s directly shaped modern
computer vision and neural network
research in general. Yann explains that
the self-supervised method of predicting
the next token works in language because
language itself is a constrained
notation we made to communicate.
>> So in the case of text, that's a very
simple problem because you only have a
finite number of words in the
dictionary. And so you can never predict
exactly what word follows a sequence,
but you can
predict a probability distribution over
all words in the dictionary.
Um and that's good enough. Uh, you can
represent uncertainty in your
prediction.
>> The model has only about 50 to 100,000
possible tokens for the output, and it
can assign probabilities to them. But,
when it comes to something like video, a
natural modality, training a model on
self-supervised video frames is
impossible because the number of
possible frames for a full HD video
dwarfs the number of atoms in the
observable universe. The number of
possible futures is mathematically
intractable.
>> You can't do this with video. We do not
know how to represent appropriate
probability distribution over the set of
all images or video frames or video
segment. It's actually a
mathematically intractable problem.
So, it's not just a question of like we
don't have big enough computers. It's
just like intrinsically intractable.
>> But, this argument is actually a bit of
a straw man because no one is trying to
predict every pixel on the screen. This
is evidenced by all of these successful
AI video generators, such as Sora,
SeeDance 2.0, or VEO. The debate is
actually far more nuanced than that. And
this is probably the most fascinating
argument at the center of AI right now.
One of those defining disagreements that
people will be still writing books about
a decade from now.
>> In a normal video diffusion model,
researchers don't jump straight into
predicting pixels. Instead, the process
is more like this. You use an encoder to
move the image into the latent space.
You're essentially building a compressed
visual language, a compact
representation that is still allows for
reconstruction. And this word is very
important, reconstruction. Second, train
a second model, like a diffusion model,
inside this frozen latent space. You're
essentially teaching it to write in this
compressed visual language. It learns to
denoise random noise into a valid
latent, which you then decode back into
pixels. Let me rephrase all of that
outside the technical jargon and go into
the philosophy of it. Instead of
predicting inside the space of all
possible future frames, modern video
models do something much smarter. A cat
walking across a room is not just any
possible image. There are physics,
geometry, object permanence, lighting,
motion continuity, and camera
constraints that massively reduce the
space of plausible futures. So, the
encoder compresses videos in a way that
preserves the underlying regularities of
the real world. It exploits the fact
that real-world videos occupy an
extremely tiny, structured subset of all
possible frames, making the diffusion
process vastly more tractable. But, here
is the trillion-dollar question. Who
cares? Why would Google DeepMind jump
through so many hoops just to create a
realistic-looking video? Strangely
enough, this slot machine may be the
clearest path we currently know toward
AGI. In this part of the video, I have
to bring so many different ideas
together, so just bear with me. The
biggest flaw in the current state of the
art language models is a phenomenon
called jagged intelligence. Today's
models can achieve a gold medal in math
olympiad and solve open mathematical
conjectures, but still tell you to walk
to a car wash because it is close
enough. Other labs are publicly
accepting this jaggedness, focusing on
building narrow, but useful and probable
tools.
>> We are going to have AGI within the next
couple years in a way that
it's still going to be jagged, but that
the the floor of task will just be
almost for any intellectual task of how
you use your computer. The AI will be
able to do that.
>> Um, but it's one thing that I think
Anthropic have fully focused on. You
know, they don't make image models,
multimodal models, world models. They
just do, you know, coding and language
models. And um, they're very, very good
at that.
>> But, if the goal is AGI, we must solve
this. The issue is spoken language, and
especially written language, are just
artificial artifacts of the human mind.
Language is a super lossy compression of
our inner model of the world, and it is
not a fundamental part of human
intelligence. Think about the difference
between someone who has read a thousand
books on Formula 1 racing and Lewis
Hamilton. The reader knows all the facts
and physics, but the driver has
developed such a deep and robust
understanding of how the machine works
that the car essentially feels like an
extension of his own body. The human
mind relies on a wordless internal
representation of reality. That is the
core of intelligence. Generating the
words to approximate that internal
understanding is the trivial part in
comparison. And this leads us to the
holy grail of modern AI, the world
model. The first step to building it,
teaching machines to predict basic
physics across primary natural human
modalities, vision, sound, and touch.
>> So, if you show a video to a computer,
um and train some big neural net to
predict what's going to happen next in
the video,
if the system is capable of learning
this and doing a good job at that
prediction, it will probably have
understood a lot about the underlying
nature of the physical world.
Things that, you know, objects move
according to
uh particular laws, right? So, animate
objects can move in things that are more
unpredictable, but still, you know,
satisfying some constraints.
>> Now, Demis Hassabis' core argument is
generative AI video can force AI to
develop a physics-based natural world
model.
>> But, how is an image generator close to
AGI?
>> Oh, well, of course. Look, let's take
image generators, but also uh let's talk
about our video generator VO, which is
the state of the art in video
generation. I think that's even more
interesting and from an AGI perspective.
You know, you can think of a video model
that can generate you 10 seconds, 20
seconds of a realistic scene. It's sort
of a model of the physical world.
Intuitive physics, we sometimes call it
in physics land. And it's sort of
intuitively understood how uh liquids
and and and and and objects behave in
the world. And that's um and obviously
one way to exhibit understanding is to
be able to generate it at least to the
to the to the human eye being accurate
enough to to be satisfying to the human
eye. Obviously, it's not completely
accurate from a physics point of view
and we're getting it we're we're going
to improve that. But it's it's it's
steps towards having this idea of a
world model a system that can understand
the world and the mechanics and the
causality of the world. And then of
course that would be I think essential
for AGI because that would allow these
systems to plan long-term plan in the
real world.
>> But although we now know empirically
that even generative models have to
develop some model of the world, Jan
argues that this is not really a world
model. The point of contention is that
the whole encoder diffusion decoder
stack we talked about is trained with
one purpose to reconstruct the original
image. And Jan's view is if the training
objective is reconstruction at pixel
level, then the model is wasting the
vast majority of its compute trying to
predict the random movements of leaves
on a tree. That's not only useless, but
impossible to predict. Joint embedding
architectures that Jan is advocating for
do something very different. Jepa is not
a generative model, and right now it is
one of the few serious alternatives to
purely generative approaches. The latest
implementation of this architecture
consists of three main parts: a context
encoder, a target encoder, and a
predictor. The target encoder processes
the whole image. The context encoder
sees the visible part of the image. Then
the predictor tries to predict the
target representation of the missing
patch from the context representation.
The most important thing to understand
is that the learning signal comes from
the pressure for the context and target
encoders to agree on a similar latent
representation because both are derived
from the same underlying scene. So this
whole generative versus Jepa argument
comes down to this very nuanced point.
Both architectures learn some model of
the world, but the key difference is in
a generative setup, the final objective
is usually keep enough information to
reconstruct the original input as
accurately as possible. In Jeppa, the
objective is predict a compatible
representation of a missing part. The
model only needs to match
representations in embedding a space.
With video, the same principle applies.
If a ball rolls behind a wall, Jeppa
must predict a compatible future
representation where the ball emerges on
the other side. Over enough examples,
the model can develop an implicit
understanding of object permanence
without ever generating the video
itself. Now, this raises an obvious
question. If Jeppa sounds so reasonable,
why isn't everyone just switching?
Theoretically, architectures like this
should eventually outperform purely
generative systems, but empirically,
generative models are still performing
much better across many important
domains. Yann argues that predictive
world models are the more scalable
long-term path toward human-level
intelligence. Meanwhile, DeepMind is
saying generative training may already
contain enough pressure to learn
surprisingly deep world models
implicitly. I know it has already been
pretty hard to follow this part, but it
is incomplete without this final point.
The generative diffusion model itself is
not the final product for DeepMind. The
approach is to build a hyperrealistic
and grounded simulation of the world
using generative diffusion models. Then,
use that simulated world to train the
multimodal AI model Gemini to develop an
explicit evolving model of reality
internally. Essentially, a multimodal,
probably autoregressive model gathering
experience in diffusion-based simulated
worlds. That's Gemini Omni, by the way.
It isn't just another diffusion-based
video generator. It is an early
iteration of DeepMind's world model, an
autoregressive multimodal core built for
anything to anything capabilities
starting with video. Finally, the only
big idea remaining is larger-scale
pre-training as the foundation of the
whole system, which DeepMind has been a
really big fan of. But, highly respected
researchers like Ilya Sutskever and
Richard Sutton suggest that larger-scale
pre-training is probably a dead end. The
core issue is that larger-scale
pre-training almost by definition
already assumes an inefficient learner.
Ilya Sutskever's and Richard Sutton's
argument can be divided into two parts.
First, learning from experience. The
physical world is a messy environment
full of unorganized goals, very subtle
learning signals, and super dynamic
interactions. Performing all of the
training in an isolated lab with rigid
arbitrary goals will never result in a
truly intelligent and robust system. The
model should require only minimal
pre-training paired with an extremely
efficient learning algorithm. Argument
number two is biological plausibility.
>> The thing
that happened with AGI and pre-training
is that in some sense they overshot the
target. If you think about the term AGI,
you will realize, and especially in the
context of pre-training, you will
realize that a human being is not an AGI
because a human being Yes, there is
definitely a foundation of skills. A
human being lacks a huge amount of
knowledge.
Instead, we rely on continual learning.
It's a process as opposed to you drop
the finished thing.
>> The thing you're pointing out with
superintelligence
is not some finished
mind which knows how to do every single
job in the economy cuz the way, say,
the original, I think, OpenAI charter or
whatever defines AGI is like it can do
every single job that a every single
thing a human can do.
You're proposing instead
a mind which can learn to do any single
every single job.
>> Yes.
>> And that is superintelligence. And then
but once you have the learning
algorithm,
it gets deployed into the world the same
way a human laborer might join an
organization.
>> Most current learning systems operate
with extremely rigid objectives. Reach
the target and you receive a reward.
Fail and you get punished. Learning
becomes a harsh binary process. Human
learning works very differently. You do
not need to crash the car to understand
danger. You do not need endless miles of
driving before receiving feedback. The
moment you sit behind the wheel, you
already have a sense of what optimal
should feel like. You have a sense of
balance control confidence
uncertainty, and whether something feels
wrong. The brain is constantly
evaluating itself long before any
catastrophic outcome occurs because the
learning signals are so rich. Humans
learn with astonishing efficiency and
are surprisingly robust. Google
DeepMind, however, has been advocating
for larger-scale pre-training as part of
the core recipe for AGI.
>> I think by far, probably my betting
would be the quickest way to get to AGI
and the most likely plausible way is to
use all the knowledge that's existing in
the world right now on things like the
web, ingesting all of that information,
and I don't see why you wouldn't start
with a model as a kind of prior or or to
build on and to make predictions that
helps bootstrap your learning. I just
think it
doesn't make sense not to make use of
that. So my my betting would be is that
you know, the final AGI system will have
these large multi-modal
models as part of the the overall
solution, but probably won't be enough
on their own. You will need this
additional planning search on top.
>> This is the same philosophy that built
AlphaGo. That breakthrough proved deep
learning could solve near infinite
problems using deeply efficient pattern
recognition. It gave DeepMind the
conviction that AGI, despite seeming
impossibly vast, can be solved using
those same principles. From one angle,
Google DeepMind is doubling down on
generative AI and massive pre-trained
models. From another, they are actively
pushing research far beyond the standard
transformers and auto-regressive models.
It is a highly specific bet on exactly
what AGI should look like. Fortunately,
it will only take a couple more years
for us to understand who is right.
Honestly, what an odd time to be alive.
Thank you for watching and I see you in
the next one.