Google DeepMind CEO demonstrates Genie 2, world-building AI model that could train robots
This week, 60 Minutes correspondent Scott Pelley reported on a massive leap forward in artificial intelligence technology led by Google DeepMind, the company's AI research hub.
Pelley took Astra, Google DeepMind's AI assistant that can see and hear with cameras and microphones, for a test drive on the streets of London, near DeepMind's headquarters in the United Kingdom.
"What can you tell me about this building I'm looking at?" he asked, wearing a pair of black glasses equipped with Astra, microphones, and a camera.
"This is the Coal Drops Yard, a shopping and dining district," the AI agent replied.
In a gallery filled with artwork chosen by 60 Minutes, Pelley raised a smartphone and asked Astra what painting he was standing in front of.
The AI agent recognized the painting as "Automat" by Edward Hopper.
Pelley asked Astra what emotions are being expressed by the painting's subject, a woman sitting alone in a cafeteria.
Astra said she "appears pensive and contemplative," and that her expression suggests "a sense of solitude."
And with a little push, Astra could do even more: it created a story around the painting.
"It's a chilly evening in the city. A Tuesday, perhaps. The woman, perhaps named Eleanor, sits alone in the diner, enjoying a warm cup of coffee," Astra said.
"She has found herself thinking about the future, wondering if she should pursue her dreams."
In an interview with Google DeepMind CEO and co-founder Demis Hassabis, Pelley asked if there were moments when an AI agent did something unexpected.
"That has happened many times…since the beginning of DeepMind," he told Pelley.
"[With] recent systems like Astra… being able to be that good at understanding the physical world was not something we were expecting it to be that good at that quickly."
While reporting this story, 60 Minutes learned more about advancements in generative AI that produce images, video and even 3D interactive environments.
Two years ago, Pelley and a 60 Minutes team saw a demo for an AI model that could produce short videos using simple text commands.
After typing in a text prompt to generate a "golden retriever with wings," several images appeared on screen, showing a golden-haired puppy walking through grass with wings, the image fairly blurry and distorted.
Two years later, the technology has made astounding progress.
Director of product development Tom Hume showed 60 Minutes associate producer Katie Brennan a demonstration of Veo 2, a video-generating AI model.
A similar prompt, but with even more text added to it, produced a photorealistic video of a golden retriever puppy with wings running through a field of grass and flowers.
Sunlight shone through its birdlike wings that flapped as it ran. It looked like a live-action scene filmed with a movie camera, sharp and detailed.
Hassabis and DeepMind research scientist Jack Parker-Holder showed Pelley an AI model called Genie 2.
Genie 2 can create a 3D world from a single static image that can be explored by a human player or AI agent.
Parker-Holder pointed to an employee's photograph on a screen: the view from the top of a waterfall in California, looking out at the horizon.
"So, we prompt the model with this image, which is not game-like, and Genie converts it into a game-like world that you can then interact in," he explained.
Suddenly, a video played of what looked like a first-person video game that starts at the top of the waterfall in the photograph.
The avatar walked around the pool at the top of the waterfall, water droplets misting into the air. As they turned right, a landscape that wasn't in the original photograph appeared.
In another example, a paper plane soared through a Western landscape. New features came into view as the plane soared ahead.
"Every subsequent frame is generated by the AI," Parker-Holder explained.
Hassabis and Parker-Holder told Pelley that these simulated 3D environments can also be used to train AI "agents" that can perform tasks.
An image of a knight with a torch standing in front of three doorways came on the screen. The doorway on the right leads to a flight of stairs.
Parker-Holder explained that they took one of their "most capable AI agents" and asked it to go up the staircase.
The AI-controlled knight walked up the stairs, blue light pouring over the staircase and new walls appearing around him.
"The Genie world model is creating the world around it on the fly and sort of imagining what's up there," Parker-Holder explained.
Pelley asked Hassabis what the practical implication of this technology would be.
"There's lots of implications for entertainment, and generating games and videos," Hassabis said.
"But actually, the bigger goal… is building a world model, a model that can understand our world."
Hassabis said future versions of this technology could create an infinite variety of simulated environments, where AI agents could learn new skills, perform tasks, and interact with people and objects.
Hassabis said this training could also work for robots.
"It's much harder to collect data in the real world, much more expensive, much slower. For example, robotics data," Hassabis explained.
"You can only collect a small amount of that in the real world. But in simulated worlds, you can collect almost an unlimited amount. So, you'd learn first in simulated worlds with the robot, as a simulated robot. And then you would fine tune it at the end on a little bit of real-world data."
Pelley wondered if Google's trove of geographic data, collected for Google Earth, Google Maps and Google Street View, could also be used to train AI.
"That's what we're exploring at the moment actually… potentially using Street View kind of data to give real-world understanding and geographical understanding to our AI systems," Hassabis said.
"On the other hand, you can imagine things like… bringing to life static images of real places, whether it's your own holiday photos or actually Street View…[and] making them interactive and 3D, so you can look around."
The video above was produced by Will Croxton. It was edited by Sarah Shafer.
Artwork courtesy of the Heirs of Josephine N. Hopper / ARS, NY