DeepMind Was Two Steps Ahead, AGAIN!

Pourya Kordi·31:07en

Watch on YouTube ↗

Transcript

0:00

For years, the Mississippians has

treated AGI as a long-term scientific

mission to recreate the human mind, not

just a race to build a profitable

software product. When the word thought

go was essentially unsolvable for AI,

DeepMind went all in on AlphaGo,

investing heavily with no immediate

return. Right now, history is repeating

itself. DeepMind is pouring resources

into another seemingly intractable

problem that nobody else wants to touch.

Depending on how this plays out, they

are either executing an extraordinary

vision or one of the biggest

miscalculations in modern history. To

understand what DeepMind is building, we

need to look at the blueprint of modern

AI. It is held up by four main pillars:

transformer based, autoregressive,

pre-trained, and generative. Most

researchers agree that this is a

specific combination has enormous

potential, but also has theoretical

limitations that prevent it from

reaching human-level intelligence. And

that's exactly why DeepMind is already

moving beyond it. The first piece of the

puzzle is replacing pure transformer

models. DeepMind has been working on of

serious modifications to the

transformer, seen in efforts like the

Griffin architecture, recurrent Gemma,

Titans, and even Gemma 4. The

transformer is a very powerful idea.

It's basically a stack of layers where

each token repeatedly looks at all other

tokens in the sequence, decides which

ones matter most, and uses that to

update its own meaning in context. This

mechanism works very well because it can

dynamically decide how each token

relates to every other token in the

sequence, which makes it very powerful

for building context of representations.

But, the trade-off is that it does not a

scale cheaply. As the sequence gets

longer, every token has to compare

itself with every other token. So, the

number of interactions grows roughly

with the square of the sequence length,

which becomes expensive very quickly.

While modern systems still use

attention, they rarely use fully naive

all-to-all attention. Instead, they

adopt a growing range of efficient

approximities, sparse patterns, and

hybrid architectures to control the

quadratic cost while preserving most of

the modeling power. Most leading labs

have largely avoided experimenting with

entirely new foundations, but DeepMind

is actively exploring alternatives and

extensions to the transformer, including

recurrent and state-space inspired

architectures. You might remember back

in December 2024, Google released the

famous Titans paper, learning to

memorize at test time. The team has

since moved to publish nested learning,

and more recently, Titans plus Miras,

helping AI have long-term memory. The

idea for this approach is, instead of

remembering everything all the time,

what if we develop a specific strategy

to keep only part of the input? Titans

showed that the model can be more

selective about what it keeps by using a

surprise mechanism. They taught the

model to memorize unexpected things, and

it was able to become more effective on

common sense reasoning, genomics, and

time series tasks. This approach seems

like a promising direction for

complementing or extending transformers,

and it could even help with continual

learning since part of the adaptation

happens at inference time. So far, it is

perhaps the most notable approach to

challenge pure transformer

architectures, and almost all of the

research behind it is still coming from

DeepMind. But in a lesser-known release,

Google introduced Recurrent Gemma,

moving past transformers for efficient

open language models based on the

Griffin architecture. The Griffin

architecture mixes, and I'll explain

these fancy words in a second, gated

linear recurrences with local attention.

If we were to compare the standard

attention to Griffin, it's something

like this. The transformer is performing

what we call global attention. Imagine

you start reading a book. When you are

reading the first page, there is

virtually no difference between Griffin

and a standard attention. But when you

get to the second page, you have to tear

the first one and have it right in front

of you. If you keep reading in the page

100, when you have carried 99 previous

pages, this method becomes an unbearable

liability. That is stack of annoying

pages that you had to carry all the way

is the previous context. That is called

the KV cache in a standard attention.

But Griffin says, instead of having the

entire thing carried over, introduce an

index card that is constantly rewritten

for every page. And instead of global

attention, the model has only local

attention. Something like 2,000 tokens

to understand the context of the words

within the current page. The AI

synthesizes the core meaning of the

input into a fixed size state. There is

only one word left out, which is the

gated part, and that's the trickiest

part. The Griffin architecture relies on

a bunch of matrices called gate weights

that learn during the training the best

strategies to retain information. Like

they might learn bold words, a specific

numbers, start of a new chapter, are

more important and need to be

remembered. But realistically, even much

more complicated strategies of what

makes thing worth remembering or not.

And finally, when something is decided

to enter the index card, there is a

recurrence gate. This part looks at the

past information and the new information

and decides how best it can integrate

them. Hypothetically, if you learn the

main character in the book is tall in

the first page and has blue eyes in the

page 65, the index card will contain a

unified representation of a main

character that is tall and has blue

eyes. So, the index card is an evolving

a state of past information, not a list

of items. The recurrent Gemma was a jump

on benchmarks, especially related to

long context. One of the rare

transformer challenges that was given a

decent shot at proving itself. DeepMind

then squeezed the golden juice out of

these experiments to build a model with

maximum compute and memory efficiency.

Gemma 4 first of all uses some sparsity

methods to make the model much lower

weight. For example, the 4 billion

parameter model is actually about 25

billion parameters, but Gemma has two

categories of A and E that reduce the

number of active or effective parameters

using the mixture of experts or per

layer embedding methods, which we don't

get into because they are a bit off

topic. But after that, the real

innovation is how this tiny model that

runs on edge devices is not only

multimodal and accepts raw video and

audio. It also has a long context of up

to 200 and 56K tokens. The Gemma team

achieved this using a local sliding

window attention for some layers and

global attention for others. Imagine

you're reading a paragraph inside a huge

book. You first put intense attention

into that single paragraph. That's your

local sliding window. But then after

that, you can go back and read the

relevant pages again, so you understand

the paragraph in the full context of the

book. In a document of let's say 256K

tokens, some layers only look at a

sliding window of 1,000 tokens and

ensure local coherence, while a limited

number of other layers take the entire

global attention into account.

Therefore, connect the local context to

the entire document. These are just a

number of more well-known, more mature

attempts at optimizing, extending, or

outright replacing the transformer.

These attempts might seem like consumer

products, but more than anything, they

are public experiments serving

DeepMind's core strategy towards AGI.

>> We need one or two more big

breakthroughs before we'll get to AGI.

And I think they're along the lines of

things like continual learning, better

memory, longer context windows, or or

perhaps more efficient context windows

would be the right way to say it. So,

don't store everything, just store the

important things. That would be a lot

more efficient. That's what the brain

does.

>> Right alongside the transformer, though,

is another key idea called auto

regression. As you probably know, the

generation process in GPTs is next token

prediction in an iterative process. That

process is so well-known that some

people might assume that the transformer

and even the entire LLM space is

inherently next token predictors. But

even the original transformer that was

invented for translation wasn't a pure

autoregressive model. It was only turned

into a decoder only transformer that is

fully autoregressive by Alec Radford to

go from the original intended machine

translation function to a more general

self-supervised next token predictor.

The main invention of the transformer

was the self-attention mechanism that we

talked about. And that doesn't require

next token prediction or auto

regression. So there is a whole area of

transformer base but not auto regressive

models. Diffusion language models are

the best example of that. Listen to

Professor Stefano Ermon, one of the

pioneers of diffusion models.

>> All existing or most of the existing

LLMs are auto regressive, meaning that

they generate text or code

one word, one token at a time, left to

right. On the other hand, some of the

best generative models for images,

video, music are diffusion based where

the object is generated by an iterative

refinement process where you start with

a rough guess what the answer should be

and then you keep refining it.

And crucially, this refinement process

is highly parallel. The neural network

is able to modify multiple tokens at the

same time.

>> Why is that better?

>> This generative process is significantly

more efficient. It's faster, it's

cheaper and because there is also

built-in error correction, the network

is trained to fix mistakes and that's

how model is learning at inference time.

What we see is it's the future where all

LLMs are going to be eventually

diffusion based because this is

far superior approach.

>> There are three main advantages to

diffusion models and we talk about them

very briefly here. Speed and efficiency.

Although diffusion models loop as well,

they usually need far fewer iterations

to reach a final answer because the

process is much more parallel which GPUs

love. A reasonable number of steps for

language is often in the tens or at most

low hundreds as opposed to thousands or

even tens of thousands of steps for

auto-regressive models. That's why

diffusion language models are already

about 10 times faster in practice.

Smarter outputs. Auto-regressive LLMs

generate text from left to right. Once a

token is generated, there is no

opportunity to revise it. Diffusion

models on the other hand can more

naturally correct mistakes by working on

the entire response holistically. Third,

flexibility. While auto-regressive

models expect the prompt to be a prefix,

diffusion LLMs allow the prompt to sit

at any arbitrary position. This

especially makes a lot of sense for

writing code. Diffusion model is

naturally trained to see the entire code

and complete the missing part. But

auto-regressive models struggle because

they don't naturally see what comes

after the edit. They have to engage in

this awkward computation and often

rewrite the parts that they are not

supposed to touch. It is not obvious

that a scale is not going to just wash

over all of these differences. And it

might not be worth it to go back and

start from the beginning. But although

everyone else is essentially ignoring

diffusion models, a super small lab out

of a Stanford called Inception Labs has

already introduced the first reasoning

LLM powered by diffusion. It is

ridiculously faster than models of its

size while being competitive in quality.

This is the sort of jump that forces

reimagining the stack. Google DeepMind

is the only major lab working on a

diffusion language model and they

introduced Gemini Diffusion mid-2025.

is still out on the importance of

diffusion language modeling. It might

even be that a hybrid approach could

work really well. But that's just one

aspect of diffusion. While OpenAI and

Anthropic are 100% focused on

auto-regressive transformer models and

OpenAI even recently shut down Sora 2, a

diffusion-based model, letting go of a

billion-dollar contract with Disney

because of the decision to double down

on the GPT technology.

>> But the thing that's important to

realize is technologically that the Sora

models, which are incredible models, by

the way,

are different branch of the tech tree

than the core reasoning GPT series. Mhm.

They're just built in a very different

way. And to some extent, we're really

saying that pursuing both branches is

very hard for us to do.

>> Google DeepMind has lots of different

projects built on different stacks.

>> And their mindset is apparent in the

diverse projects they maintain. We'll

talk about the importance of these

projects, not just as cool media

generators, but as part of the core

strategy towards AGI. They have VEO and

Genie, Gemini diffusion, the Imagine

lineup, which is diffusion-based image,

Vias the Nano Banana line that is based

on Gemini, the Gemma family, AlphaFold

{slash} AlphaGenome line, and many more

projects. Even though these models

consume massive amounts of compute and

research capacity,

>> And so I think there's something about

if you're if you branch too far and you

have two different artifacts, that is

very hard to sustain in a world where

there is limited compute.

>> DeepMind is betting they are for the

next stage.

>> This is one advantage we have of having

such a deep and rich research bench. We

can go after both of those things at

maximum with maximum force, both, you

know, scaling up the current paradigms

and ideas and

and then really new blue sky ideas for

new architectures and things, you know,

the kinds of things we've invented over

the last 10 years as Google and

DeepMind.

>> So far, it might seem like that DeepMind

is just keeping every possible research

direction open, but that is not quite

the case. They are actually making a

very specific bet on a very specific

vision of how AGI should look like. And

the picture becomes more clear in the

next two pillars, generative and

pre-trained. I've been playing around

with some new AI tools lately, and

honestly, it is getting a scary good. I

feel like most people still think AI

music is where it was about a year ago,

where it sounded cool for like 10

seconds and then completely fell apart.

But I just generated this entire song,

the beats, the vocals, everything.

>> It's a lonely [singing] little heaven,

but it's peaceful and it will do.

[music]

>> Using Mubert AI and it sounds absolutely

incredible. Mubert is basically an AI

music creation tool that can generate

full studio quality songs,

instrumentals, and background tracks.

The interface is really easy to get

into. You can switch between an easy

mode and a custom mode, depending on how

much control you want. Custom mode gives

you a lot more flexibility to fine-tune

the results. There are two main models,

O2 that is built for generating a

standalone full song,

and V9, which offers precise control,

studio-level mixing, enhanced

multilingual support, and a stronger

instrumental music effects. [music] It

is highly flexible and efficient for

custom generation and remixing in

styles.

>> [music]

>> As a creator, finding a background music

that actually fits your videos is

extremely annoying. Even if you pay for

premium subscriptions. I used Mubert's

V9 model to generate a custom background

track for this video's intro, and I can

even use the new soundtrack feature for

educational content. You can upload any

video and automatically generate a track

that fits the composition. The music

[music] is royalty-free and cleared for

commercial use, which makes the whole

process way easier. Just as a side note,

one of my favorite use cases is remixing

Persian classical music into different

languages and styles. Let me play two

tracks that I've had on repeat lately.

>> Kiss me, my love. [singing]

>> [music]

>> Kiss me, my love.

>> [music]

>> Just for the very last time.

>> [music]

>> I am [singing] the harvest

that sleeps in

the sea.

>> For

>> [singing]

>> I am

the morning breaking the dark.

>> Check out Marika and give your videos

their own custom AI generated

soundtrack. Link in the description.

Thanks Marika for sponsoring this part

of the video. Here is Demis Hassabis

explaining how he disagrees with Yann

LeCun and why he's doubling down on a

particular vision of AGI.

>> Now, it remains to be seen whether just

sort of scaling up existing ideas and

technologies will be enough to do that

or we need one or two more

really big insightful innovations. I'm

probably if you were to push me I would

say I would be in the latter camp. But I

think no matter what camp you're in,

we're going to need large foundation

models as the key component of the final

AGI systems. Of that I'm sure. So I

don't I'm not a subscriber to someone

like Yann LeCun who thinks you know that

they're just sort of a some kind of dead

end. I think the only debate in my mind

is are they a key component or the only

component?

>> A lot of people might automatically side

with Demis because he's so popular and

is doing very practical research, but

you'll also easily get Yann's point when

it's laid out in simple terms. He's one

of the foundational architects of modern

deep learning era. His work on

convolutional neural networks in the

1980s and 90s directly shaped modern

computer vision and neural network

research in general. Yann explains that

the self-supervised method of predicting

the next token works in language because

language itself is a constrained

notation we made to communicate.

>> So in the case of text, that's a very

simple problem because you only have a

finite number of words in the

dictionary. And so you can never predict

exactly what word follows a sequence,

but you can

predict a probability distribution over

all words in the dictionary.

Um and that's good enough. Uh, you can

represent uncertainty in your

prediction.

>> The model has only about 50 to 100,000

possible tokens for the output, and it

can assign probabilities to them. But,

when it comes to something like video, a

natural modality, training a model on

self-supervised video frames is

impossible because the number of

possible frames for a full HD video

dwarfs the number of atoms in the

observable universe. The number of

possible futures is mathematically

intractable.

>> You can't do this with video. We do not

know how to represent appropriate

probability distribution over the set of

all images or video frames or video

segment. It's actually a

mathematically intractable problem.

So, it's not just a question of like we

don't have big enough computers. It's

just like intrinsically intractable.

>> But, this argument is actually a bit of

a straw man because no one is trying to

predict every pixel on the screen. This

is evidenced by all of these successful

AI video generators, such as Sora,

SeeDance 2.0, or VEO. The debate is

actually far more nuanced than that. And

this is probably the most fascinating

argument at the center of AI right now.

One of those defining disagreements that

people will be still writing books about

a decade from now.

>> In a normal video diffusion model,

researchers don't jump straight into

predicting pixels. Instead, the process

is more like this. You use an encoder to

move the image into the latent space.

You're essentially building a compressed

visual language, a compact

representation that is still allows for

reconstruction. And this word is very

important, reconstruction. Second, train

a second model, like a diffusion model,

inside this frozen latent space. You're

essentially teaching it to write in this

compressed visual language. It learns to

denoise random noise into a valid

latent, which you then decode back into

pixels. Let me rephrase all of that

outside the technical jargon and go into

the philosophy of it. Instead of

predicting inside the space of all

possible future frames, modern video

models do something much smarter. A cat

walking across a room is not just any

possible image. There are physics,

geometry, object permanence, lighting,

motion continuity, and camera

constraints that massively reduce the

space of plausible futures. So, the

encoder compresses videos in a way that

preserves the underlying regularities of

the real world. It exploits the fact

that real-world videos occupy an

extremely tiny, structured subset of all

possible frames, making the diffusion

process vastly more tractable. But, here

is the trillion-dollar question. Who

cares? Why would Google DeepMind jump

through so many hoops just to create a

realistic-looking video? Strangely

enough, this slot machine may be the

clearest path we currently know toward

AGI. In this part of the video, I have

to bring so many different ideas

together, so just bear with me. The

biggest flaw in the current state of the

art language models is a phenomenon

called jagged intelligence. Today's

models can achieve a gold medal in math

olympiad and solve open mathematical

conjectures, but still tell you to walk

to a car wash because it is close

enough. Other labs are publicly

accepting this jaggedness, focusing on

building narrow, but useful and probable

tools.

>> We are going to have AGI within the next

couple years in a way that

it's still going to be jagged, but that

the the floor of task will just be

almost for any intellectual task of how

you use your computer. The AI will be

able to do that.

>> Um, but it's one thing that I think

Anthropic have fully focused on. You

know, they don't make image models,

multimodal models, world models. They

just do, you know, coding and language

models. And um, they're very, very good

at that.

>> But, if the goal is AGI, we must solve

this. The issue is spoken language, and

especially written language, are just

artificial artifacts of the human mind.

Language is a super lossy compression of

our inner model of the world, and it is

not a fundamental part of human

intelligence. Think about the difference

between someone who has read a thousand

books on Formula 1 racing and Lewis

Hamilton. The reader knows all the facts

and physics, but the driver has

developed such a deep and robust

understanding of how the machine works

that the car essentially feels like an

extension of his own body. The human

mind relies on a wordless internal

representation of reality. That is the

core of intelligence. Generating the

words to approximate that internal

understanding is the trivial part in

comparison. And this leads us to the

holy grail of modern AI, the world

model. The first step to building it,

teaching machines to predict basic

physics across primary natural human

modalities, vision, sound, and touch.

>> So, if you show a video to a computer,

um and train some big neural net to

predict what's going to happen next in

the video,

if the system is capable of learning

this and doing a good job at that

prediction, it will probably have

understood a lot about the underlying

nature of the physical world.

Things that, you know, objects move

according to

uh particular laws, right? So, animate

objects can move in things that are more

unpredictable, but still, you know,

satisfying some constraints.

>> Now, Demis Hassabis' core argument is

generative AI video can force AI to

develop a physics-based natural world

model.

>> But, how is an image generator close to

AGI?

>> Oh, well, of course. Look, let's take

image generators, but also uh let's talk

about our video generator VO, which is

the state of the art in video

generation. I think that's even more

interesting and from an AGI perspective.

You know, you can think of a video model

that can generate you 10 seconds, 20

seconds of a realistic scene. It's sort

of a model of the physical world.

Intuitive physics, we sometimes call it

in physics land. And it's sort of

intuitively understood how uh liquids

and and and and and objects behave in

the world. And that's um and obviously

one way to exhibit understanding is to

be able to generate it at least to the

to the to the human eye being accurate

enough to to be satisfying to the human

eye. Obviously, it's not completely

accurate from a physics point of view

and we're getting it we're we're going

to improve that. But it's it's it's

steps towards having this idea of a

world model a system that can understand

the world and the mechanics and the

causality of the world. And then of

course that would be I think essential

for AGI because that would allow these

systems to plan long-term plan in the

real world.

>> But although we now know empirically

that even generative models have to

develop some model of the world, Jan

argues that this is not really a world

model. The point of contention is that

the whole encoder diffusion decoder

stack we talked about is trained with

one purpose to reconstruct the original

image. And Jan's view is if the training

objective is reconstruction at pixel

level, then the model is wasting the

vast majority of its compute trying to

predict the random movements of leaves

on a tree. That's not only useless, but

impossible to predict. Joint embedding

architectures that Jan is advocating for

do something very different. Jepa is not

a generative model, and right now it is

one of the few serious alternatives to

purely generative approaches. The latest

implementation of this architecture

consists of three main parts: a context

encoder, a target encoder, and a

predictor. The target encoder processes

the whole image. The context encoder

sees the visible part of the image. Then

the predictor tries to predict the

target representation of the missing

patch from the context representation.

The most important thing to understand

is that the learning signal comes from

the pressure for the context and target

encoders to agree on a similar latent

representation because both are derived

from the same underlying scene. So this

whole generative versus Jepa argument

comes down to this very nuanced point.

Both architectures learn some model of

the world, but the key difference is in

a generative setup, the final objective

is usually keep enough information to

reconstruct the original input as

accurately as possible. In Jeppa, the

objective is predict a compatible

representation of a missing part. The

model only needs to match

representations in embedding a space.

With video, the same principle applies.

If a ball rolls behind a wall, Jeppa

must predict a compatible future

representation where the ball emerges on

the other side. Over enough examples,

the model can develop an implicit

understanding of object permanence

without ever generating the video

itself. Now, this raises an obvious

question. If Jeppa sounds so reasonable,

why isn't everyone just switching?

Theoretically, architectures like this

should eventually outperform purely

generative systems, but empirically,

generative models are still performing

much better across many important

domains. Yann argues that predictive

world models are the more scalable

long-term path toward human-level

intelligence. Meanwhile, DeepMind is

saying generative training may already

contain enough pressure to learn

surprisingly deep world models

implicitly. I know it has already been

pretty hard to follow this part, but it

is incomplete without this final point.

The generative diffusion model itself is

not the final product for DeepMind. The

approach is to build a hyperrealistic

and grounded simulation of the world

using generative diffusion models. Then,

use that simulated world to train the

multimodal AI model Gemini to develop an

explicit evolving model of reality

internally. Essentially, a multimodal,

probably autoregressive model gathering

experience in diffusion-based simulated

worlds. That's Gemini Omni, by the way.

It isn't just another diffusion-based

video generator. It is an early

iteration of DeepMind's world model, an

autoregressive multimodal core built for

anything to anything capabilities

starting with video. Finally, the only

big idea remaining is larger-scale

pre-training as the foundation of the

whole system, which DeepMind has been a

really big fan of. But, highly respected

researchers like Ilya Sutskever and

Richard Sutton suggest that larger-scale

pre-training is probably a dead end. The

core issue is that larger-scale

pre-training almost by definition

already assumes an inefficient learner.

Ilya Sutskever's and Richard Sutton's

argument can be divided into two parts.

First, learning from experience. The

physical world is a messy environment

full of unorganized goals, very subtle

learning signals, and super dynamic

interactions. Performing all of the

training in an isolated lab with rigid

arbitrary goals will never result in a

truly intelligent and robust system. The

model should require only minimal

pre-training paired with an extremely

efficient learning algorithm. Argument

number two is biological plausibility.

>> The thing

that happened with AGI and pre-training

is that in some sense they overshot the

target. If you think about the term AGI,

you will realize, and especially in the

context of pre-training, you will

realize that a human being is not an AGI

because a human being Yes, there is

definitely a foundation of skills. A

human being lacks a huge amount of

knowledge.

Instead, we rely on continual learning.

It's a process as opposed to you drop

the finished thing.

>> The thing you're pointing out with

superintelligence

is not some finished

mind which knows how to do every single

job in the economy cuz the way, say,

the original, I think, OpenAI charter or

whatever defines AGI is like it can do

every single job that a every single

thing a human can do.

You're proposing instead

a mind which can learn to do any single

every single job.

>> Yes.

>> And that is superintelligence. And then

but once you have the learning

algorithm,

it gets deployed into the world the same

way a human laborer might join an

organization.

>> Most current learning systems operate

with extremely rigid objectives. Reach

the target and you receive a reward.

Fail and you get punished. Learning

becomes a harsh binary process. Human

learning works very differently. You do

not need to crash the car to understand

danger. You do not need endless miles of

driving before receiving feedback. The

moment you sit behind the wheel, you

already have a sense of what optimal

should feel like. You have a sense of

balance control confidence

uncertainty, and whether something feels

wrong. The brain is constantly

evaluating itself long before any

catastrophic outcome occurs because the

learning signals are so rich. Humans

learn with astonishing efficiency and

are surprisingly robust. Google

DeepMind, however, has been advocating

for larger-scale pre-training as part of

the core recipe for AGI.

>> I think by far, probably my betting

would be the quickest way to get to AGI

and the most likely plausible way is to

use all the knowledge that's existing in

the world right now on things like the

web, ingesting all of that information,

and I don't see why you wouldn't start

with a model as a kind of prior or or to

build on and to make predictions that

helps bootstrap your learning. I just

think it

doesn't make sense not to make use of

that. So my my betting would be is that

you know, the final AGI system will have

these large multi-modal

models as part of the the overall

solution, but probably won't be enough

on their own. You will need this

additional planning search on top.

>> This is the same philosophy that built

AlphaGo. That breakthrough proved deep

learning could solve near infinite

problems using deeply efficient pattern

recognition. It gave DeepMind the

conviction that AGI, despite seeming

impossibly vast, can be solved using

those same principles. From one angle,

Google DeepMind is doubling down on

generative AI and massive pre-trained

models. From another, they are actively

pushing research far beyond the standard

transformers and auto-regressive models.

It is a highly specific bet on exactly

what AGI should look like. Fortunately,

it will only take a couple more years

for us to understand who is right.

Honestly, what an odd time to be alive.

Thank you for watching and I see you in

the next one.