Founded at the Massachusetts Institute of Technology in 1899, MIT Technology Review is a world-renowned, independent media company whose insight, analysis, reviews, interviews and live events explain the newest technologies and their commercial, social and political impact.
The AI program Sora generated a video featuring this artificial woman based on a text prompt
Sora/OpenAI
OpenAI has unveiled its latest artificial intelligence system, a program called Sora that can transform text descriptions into photorealistic videos. The video generation model is spurring excitement about advancing AI technology, along with growing concerns over how artificial deepfake videos worsen misinformation and disinformation during a pivotal election year worldwide.
The Sora AI model can currently create videos up to 60 seconds long using either text instructions alone or text combined with an image. One demonstration video starts with a text prompt that describes how “a stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage”. Other examples include a dog frolicking in the snow, vehicles driving along roads and more fantastical scenarios such as sharks swimming in midair between city skyscrapers.
“As with other techniques in generative AI, there is no reason to believe that text-to-video will not continue to rapidly improve – moving us closer and closer to a time when it will be difficult to distinguish the fake from the real,” says Hany Farid at the University of California, Berkeley. “This technology, if combined with AI-powered voice cloning, could open up an entirely new front when it comes to creating deepfakes of people saying and doing things they never did.”
Sora is based in part on OpenAI’s preexisting technologies, such as the image generator DALL-E and the GPT large language models. Text-to-video AI models have lagged somewhat behind those other technologies in terms of realism and accessibility, but the Sora demonstration is an “order of magnitude more believable and less cartoonish” than what has come before, says Rachel Tobac, co-founder of SocialProof Security, a white-hat hacking organisation focused on social engineering.
To achieve this higher level of realism, Sora combines two different AI approaches. The first is a diffusion model similar to those used in AI image generators such as DALL-E. These models learn to gradually convert randomised image pixels into a coherent image. The second AI technique is called “transformer architecture” and is used to contextualise and piece together sequential data. For example, large language models use transformer architecture to assemble words into generally comprehensible sentences. In this case, OpenAI broke down video clips into visual “spacetime patches” that Sora’s transformer architecture could process.
Sora’s videos still contain plenty of mistakes, such as a walking human’s left and right legs swapping places, a chair randomly floating in midair or a bitten cookie magically having no bite mark. Still, Jim Fan, a senior research scientist at NVIDIA, took to the social media platform X to praise Sora as a “data-driven physics engine” that can simulate worlds.
The fact that Sora’s videos still display some strange glitches when depicting complex scenes with lots of movement suggests that such deepfake videos will be detectable for now, says Arvind Narayanan at Princeton University. But he also cautioned that in the long run “we will need to find other ways to adapt as a society”.
OpenAI has held off on making Sora publicly available while it performs “red team” exercises where experts try to break the AI model’s safeguards in order to assess its potential for misuse. The select group of people currently testing Sora are “domain experts in areas like misinformation, hateful content and bias”, says an OpenAI spokesperson.
This testing is vital because artificial videos could let bad actors generate false footage in order to, for instance, harass someone or sway a political election. Misinformation and disinformation fuelled by AI-generated deepfakes ranks as a major concern for leaders in academia, business, government and other sectors, as well as for AI experts.
“Sora is absolutely capable of creating videos that could trick everyday folks,” says Tobac. “Video does not need to be perfect to be believable as many people still don’t realise that video can be manipulated as easily as pictures.”
AI companies will need to collaborate with social media networks and governments to handle the scale of misinformation and disinformation likely to occur once Sora becomes open to the public, says Tobac. Defences could include implementing unique identifiers, or “watermarks”, for AI-generated content.
When asked if OpenAI has any plans to make Sora more widely available in 2024, the OpenAI spokesperson described the company as “taking several important safety steps ahead of making Sora available in OpenAI’s products”. For instance, the company already uses automated processes aimed at preventing its commercial AI models from generating depictions of extreme violence, sexual content, hateful imagery and real politicians or celebrities. With more people than ever before participating in elections this year, those safety steps will be crucial.
We already know that OpenAI’s chatbots can pass the bar exam without going to law school. Now, just in time for the Oscars, a new OpenAI app called Sora hopes to master cinema without going to film school. For now a research product, Sora is going out to a few select creators and a number of security experts who will red-team it for safety vulnerabilities. OpenAI plans to make it available to all wannabe auteurs at some unspecified date, but it decided to preview it in advance.
Other companies, from giants like Google to startups like Runway, have already revealed text-to-video AI projects. But OpenAI says that Sora is distinguished by its striking photorealism—something I haven’t seen in its competitors—and its ability to produce longer clips than the brief snippets other models typically do, up to one minute. The researchers I spoke to won’t say how long it takes to render all that video, but when pressed, they described it as more in the “going out for a burrito” ballpark than “taking a few days off.” If the hand-picked examples I saw are to be believed, the effort is worth it.
OpenAI didn’t let me enter my own prompts, but it shared four instances of Sora’s power. (None approached the purported one-minute limit; the longest was 17 seconds.) The first came from a detailed prompt that sounded like an obsessive screenwriter’s setup: “Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.”
AI-generated video made with OpenAI’s Sora.
Courtesy of OpenAI
The result is a convincing view of what is unmistakably Tokyo, in that magic moment when snowflakes and cherry blossoms coexist. The virtual camera, as if affixed to a drone, follows a couple as they slowly stroll through a streetscape. One of the passersby is wearing a mask. Cars rumble by on a riverside roadway to their left, and to the right shoppers flit in and out of a row of tiny shops.
It’s not perfect. Only when you watch the clip a few times do you realize that the main characters—a couple strolling down the snow-covered sidewalk—would have faced a dilemma had the virtual camera kept running. The sidewalk they occupy seems to dead-end; they would have had to step over a small guardrail to a weird parallel walkway on their right. Despite this mild glitch, the Tokyo example is a mind-blowing exercise in world-building. Down the road, production designers will debate whether it’s a powerful collaborator or a job killer. Also, the people in this video—who are entirely generated by a digital neural network—aren’t shown in close-up, and they don’t do any emoting. But the Sora team says that in other instances they’ve had fake actors showing real emotions.
The other clips are also impressive, notably one asking for “an animated scene of a short fluffy monster kneeling beside a red candle,” along with some detailed stage directions (“wide eyes and open mouth”) and a description of the desired vibe of the clip. Sora produces a Pixar-esque creature that seems to have DNA from a Furby, a Gremlin, and Sully in Monsters, Inc. I remember when that latter film came out, Pixar made a huge deal of how difficult it was to create the ultra-complex texture of a monster’s fur as the creature moved around. It took all of Pixar’s wizards months to get it right. OpenAI’s new text-to-video machine … just did it.
“It learns about 3D geometry and consistency,” says Tim Brooks, a research scientist on the project, of that accomplishment. “We didn’t bake that in—it just entirely emerged from seeing a lot of data.”
AI-generated video made with the prompt, “animated scene features a close-up of a short fluffy monster kneeling beside a melting red candle. the art style is 3d and realistic, with a focus on lighting and texture. the mood of the painting is one of wonder and curiosity, as the monster gazes at the flame with wide eyes and open mouth. its pose and expression convey a sense of innocence and playfulness, as if it is exploring the world around it for the first time. the use of warm colors and dramatic lighting further enhances the cozy atmosphere of the image.”
Courtesy of OpenAI
While the scenes are certainly impressive, the most startling of Sora’s capabilities are those that it has not been trained for. Powered by a version of the diffusion model used by OpenAI’s Dalle-3 image generator as well as the transformer-based engine of GPT-4, Sora does not merely churn out videos that fulfill the demands of the prompts, but does so in a way that shows an emergent grasp of cinematic grammar.
That translates into a flair for storytelling. In another video that was created off of a prompt for “a gorgeously rendered papercraft world of a coral reef, rife with colorful fish and sea creatures.” Bill Peebles, another researcher on the project, notes that Sora created a narrative thrust by its camera angles and timing. “There’s actually multiple shot changes—these are not stitched together, but generated by the model in one go,” he says. “We didn’t tell it to do that, it just automatically did it.”
AI-generated video made with the prompt “a gorgeously rendered papercraft world of a coral reef, rife with colorful fish and sea creatures.”Courtesy of OpenAI
In another example I didn’t view, Sora was prompted to give a tour of a zoo. “It started off with the name of the zoo on a big sign, gradually panned down, and then had a number of shot changes to show the different animals that live at the zoo,” says Peebles, “It did it in a nice and cinematic way that it hadn’t been explicitly instructed to do.”
One feature in Sora that the OpenAI team didn’t show, and may not release for quite a while, is the ability to generate videos from a single image or a sequence of frames. “This is going to be another really cool way to improve storytelling capabilities,” says Brooks. “You can draw exactly what you have on your mind and then animate it to life.” OpenAI is aware that this feature also has the potential to produce deepfakes and misinformation. “We’re going to be very careful about all the safety implications for this,” Peebles adds.