By day I study the potential of AI systems. By night I dabble in writing for TV. These worlds collide in video generation technology. So this is a cross-post between my newsletter The Stark Way (mainly about AI, science, & philosophy) and unless pictures post office (the Substack for the production company I’m affiliated with).
The most impressive video generation is text-to-video: type in a prompt, just as you would for Midjourney (text-to-image) or ChatGPT (text-to-text), and the AI generates a sequence of video, either animation or realistic-ish film footage. From a few words, a whole new world.
When OpenAI’s Sora was previewed this February, it seemed to be the first example of plausible video generation. Immediately, figures in Hollywood and Silicon Valley began speculating about it being the future of filmmaking. One day soon, they prophesied, a kid in a basement will type in a single sentence prompt and a feature film would pop out: “Arthouse romantic comedy about a neurotic writer falling in love with their biggest critic; Wes Anderson meets Miranda July.” Technology that good would surely be, like the film idea just given, the end of culture.
Or would it be a new renaissance?
Tech companies push the democratisation line. With text-to-video, anyone can become a Hollywood auteur or prestige TV showrunner. Goodbye nepo babies. The world will be flooded with new voices. Communities locked out of the culture industry will benefit from the removal of all barriers.
This is disingenuous for multiple reasons. It’s especially rich given these models are trained on creators’ IP. But is it realistic? What’s the full potential of text-to-video?
It’s bigger than kids in basements…
In this post I cover (I) the current tech, then extrapolate to (II) the near future, then I define (III) the limits of the technology, and finally I consider (IV) the ultimate stakes of easily generated content.
I. Text-to-video today
There are already dozens of low end apps out there, some of them free, which can help you make a short video with a simple text prompt. Just as Midjourney or Dalle allow you to type in a description, in natural language, and then magic up an image, apps like Pictory or Lumen5 can do it for “video”.
There are also some that are just video tools rather than full-blown video generation from scratch. E.g. tools where you upload two images and it bridges them with animation (image-to-video); or where you upload live action footage and it converts it to animation (one of many kinds of video-to-video). Such tools are already being integrated into film editing software.1
But they aren’t creating the next Palme d’Or winner or even the next episode of South Park.2 These simple apps use templates, pre-made assets, AI avatars, and then cobble together stock footage and graphics based on a script you enter. Some of them scaffold the work with a storyboard tool. So step 1 generates a storyboard, which you tinker with, and step 2 turns that into video. They can easily build a 45 second explainer video for your website or, with AI generated voiceover, a news item for TikTok. If you spend five minutes perusing the demos, you’ll see that (at this stage) they look a little crappy, a little corporate. But if you make 1000 videos a month like this and upload them to short-form platforms (which, following the TikTokification of all media, is every platform) some of them might get lots of views with little effort. And now imagine thousands of creators making thousands of videos like this and you have an acceleration of existing trends in AI-generated news and children’s content flooding YouTube, Facebook, and TikTok.
The hype in Hollywood and Silicon Valley, however, is over the genuinely generative video platforms like OpenAI’s Sora, Runway, and The Simulation (and now the Chinese platform Kling). Their CEOs envisage a day when you type in a simple natural language prompt — no storyboarding — and it generates a longer video, with sound, music, characters, and dialogue all rendered either in some animation style or imitating live action.
Currently, the best models can produce a few seconds of coherent footage. Then problems associated with generative AI creep in, like people having extra fingers or people’s faces changing between shots. Maintaining continuity between shots, let alone scenes, is so far impossible. But, early days.
The economic fallout has begun. Tyler Perry reportedly pulled the plug on a planned $800 million studio extension at a fairly late stage, after seeing Sora demoed in February. And OpenAI have money to pay creators to become early adopters. Just a few weeks ago, a paid ad session at Tribeca Film Festival featured five shorts made using Sora. I don’t know if they’re any good. But short films just got cheaper and everyone but the key creatives just got a little less employable.3 😬
II. The next 3–5 years
It should be clear to anyone that a simple extrapolation of where we’re at pretty easily gets you to a place where these video generation tools are essentially like 3D animation, game design, or VFX software today.
Likely scenario: picture a studio that employs five people to work the AI tools, just as an animation studio employs a bunch of animators. Out of respect for actual animators, let’s call these people generators. A script is written by a human writer (maybe with some AI help, but it needs to be good script, not generic sludge). From the script, a storyboard is produced: maybe even by a human storyboard artist. And then the director oversees the five generators as each works on a different shot. The shots look photorealistic, like they were recorded by a camera. But they’re entirely fabricated, in seconds, by an algorithm.
The shots are made consistent because in the first stage of production, text-to-image AI was used to generate the characters, key locations, etc. They’re saved as assets. Individual shots are then produced by dropping in assets to make an image (i.e. compositing), perhaps a first and last frame, with the intervening frames generated. If a shot is some person walking through an underground carpark, the carpark might and person might have been pre-generated; but the full shot of them walking is created and refined by one of the generators. They might need to touch up some details, remove some fingers, and apply various after effects. But it means the shot is all done in minutes without the need to go out and film it, without the need to animate it, and it’s hi-def photorealism.4
So now you can make a “live action” feature in the way you’d do an animation — or you can make animation, but quicker/cheaper. This doesn’t mean some random outsider makes a film on their laptop that wins an Oscar. That won’t be possible for a long time, if ever. But it means no cast, crew, art department, costume, catering, location, or stunts.
In this scenario, you still need a writer, a director, and a few other key decision-makers: perhaps an editor, producer, sound designer (AI-generated sound effects are already possible, as is AI music, but someone needs to generate it and make the selections). But they’ll be making creative decisions about which AI generated content to use and how to modify it.
There are degrees of effort and quality even in this world of generated video. You could really ramp up the speed and lower the cost by pumping out hours of children’s cartoons. Or you could have more generators taking care over every shot and make something more slick but still far cheaper than live action. The most attractive proposition, though, will be making more franchise content for less.
Some genres are more easily generated than others. Fans of those genres will split in two: some will be chuffed that there’s now an endless supply of their drug of choice; others will be appalled that what they were consuming was so formulaic, so iterable, so identikit that it could be created by a computing paradigm based on next word prediction.5
Legal challenges might slow this down. Lawsuits over theft of IP are happening already.6 I know nothing about the law, but recent deals struck by guilds suggest that it’s possible to hold studios to standards in terms of using actors’ likenesses, etc.
But there’s a counter-trend for media companies to see the writing on the wall and partner with places like OpenAI in return for help making their product cheaper. (E.g. The folks at Time just gave OpenAI all their content in return for OpenAI giving them voice generation help to have their articles read aloud and presumably to help generate articles written in the house style.)
Porn content will probably be flooded early, especially animated porn. And because porn is kind of a grey market, the displacement of existing porn creators will probably not produce as much public outcry.
Video generation might not even need to rely on IP. Runway think the key to video generation is having the AI learn a general world model. Once you train it on enough free and open source video on YouTube — or get every employee to wear a GoPro around all day and just capture everything they see — you have sufficient visual evidence of how the world works: how objects fall, how liquids and solids behave, how people move when they walk, object permanence, different lighting conditions, etc. At some point the algorithm gets a general idea of the physics of stuff on the surface of our planet and the way natural and human made objects look, move, and transform. Then it is able to produce plausible renderings of new sequences of reality.
Eventually, text-to-video will evolve into text-to-game and text-to-VR.
III. Beyond 5 years
Who knows? I don’t do forecasts. But I do predictions. The difference being that while I don’t know what will actually happen or when (forecast), I’m willing to predict what is possible/impossible, based on my best theories of how AI, filmmaking, and economics work.
Prediction: I think video generation — even with coherent and suspenseful plotlines, consistent and deep characters, beautiful and innovative visuals — must be possible for AI to achieve.
So long as the meaning of sounds and images is conventional (based on patterns of use by humans), there should be no physical or computational limit to how well AI can learn the conventions of cinema, speech, art, sound design, music, facial expression, and narrative to craft very well-honed work. Further, the AI should be able to recombine these conventions in arresting and profound ways. This is already starting to happen, very occasionally, with AI generated text and image. Certainly with human feedback and human prompting, there shouldn’t be any upper creative limit on AI-generated cinema and TV.
I have no idea when or if this level of creativity will be met. But I’m confident it’s technically feasible because there is no scientific reason, that I can see, that would prohibit it. (For the details of why I think this you’ll need to trawl through my posts about AI at The Stark Way. Here’s an interesting one.)
I find even this speculative and extreme future much less concerning than the near term prospects that are basically already happening. Because if true AI-cinema is invented, it will mean AI is generating not only more time-killing content, it will be producing time-enriching art. Humans can do it. So systems trained on far more information than any human and supplemented with human feedback, must be capable eventually of doing the same.
Before then, we have real problems.
IV. Real stakes
Between Hollywood and Silicon Valley, which are five hours apart by road, there’s a cultural chasm. It’s evident in what I call the zombie irony of these AI companies who name their products after bad things from speculative fiction.7 Like when they reference Skynet (from Terminator) killing us all, they’re sort of in on the joke. But then Fable Studio called their generative video platform The Simulation.8 Are they being knowingly ironic? or do they simply not realise that normal people might want a different future to them? After closely scrutinising AI and Big Tech the last few years, I’m sure it’s the latter.
The tech people also love the idea of maximum content: quantity is quality. This is anathema to most people in the arts and media. But quantity over quality has characterised mass culture from the start. Most content is mid. There have always been genres dedicated to dishing up what is highly consumable not lifechanging: penny dreadfuls, fanfics, soap operas, reality TV, iPhone games, etc.
This sounds snarky or elitist. Didn’t we decide, back in the ‘90s, that there was no distinction between high-culture and low-culture? And didn’t we all do affirmations during COVID that there were no more “guilty pleasures”, no more shame, just pleasure in whatever we want to watch?
Although English teachers and reviewers still struggle to articulate the difference between the two kinds of entertainment, I think it’s simple, if subtle. Some entertainment — in any medium — is for using up time, making it go faster. And some entertainment returns us to time and slows it down. An example is the difference between ambient TV vs mindful TV, as outlined by Meg O’Connell on this very Substack.
The mark of real art was worked out by some ancient Greeks9 who called this aisthesis, the root word of our aesthetics. The original meaning was to do with perception. The theory was that art should slow down our perception. It should make the flow of time or the experience of existence palpable. Often this means taking something familiar and defamiliarising it. It breaks us out of default perception where everything goes along as we’d expect.
Ambient TV and other forms of time-using or time-killing entertainment do the opposite. They’re frictionless. Our brains don’t have to work and suddenly we’ve watched five episodes and it’s time to go to sleep for real now, during which our surprise-starved brains will amuse themselves by generating improbable new worlds (dreams: a nothing-to-VR technology).
Text-to-video sludge, like any form of time-killing entertainment, is an assault on human lucidity. It amplifies a pre-existing problem. Any tech that makes addictive content easier to produce or consume is a new threat to attention. The tech giants are openly vying to grab every second of attention because it is immensely profitable. The Tiktokification of media also means Big Tech (in which I now include the likes of Netflix and Spotify, as well as the Chinese government) have unprecedented data on precisely what makes us swipe or keep watching. They know our micro-tastes much better than we ever could.
This is terrifying if you dwell on it. Lots of people want some narcotising entertainment requiring little engagement. Something they can watch that pushes all the right buttons at the right timings to hack our dopaminergic rhythms. Slot machines do this. This is not so far off having an opioid addiction.
When we combine AI generated video with data scraping, wearables, the metaverse, and computer-brain interfaces, a banal but real dystopia looms: the prospect of a personalised, automatically generated content stream that is optimised to capture and hold your attention.10
This fentanyl of the mind might not be “harmful” but it is evil. It’s not harmful to keep someone entertained with their consent. But anyone who makes money by sapping people’s time is unmistakably evil. What else could we call it?
True artists will find ways to free our attention. The resistance will include astonishing new art, enabled by AI, in a medium we can barely yet imagine. In the meantime, we need to get better at protecting our attention.
Also, everything here applies to sound design, music production, graphic design, photography, writing, etc. All electronic media types can now be generated and multimodal models are being developed that are anything-to-anything.
The irony-blind team behind The Simulation released a paper describing their generated “episode” of South Park. Here’s a quotation that captures the vapidity of some of these dreamers: “Maybe people are still upset about the last season of Game of Thrones. Imagine if you could ask your A.I. to make a new ending that goes a different way and maybe even put yourself in there as a main character or something.”
[EDIT 10-07-24]: I forgot to mention, as important context to all this, that video generation (like image, music, and text generation) is actually one of the only areas of the economy where generative AI will make a substantial difference. As a reference and learning tool, it can help in education, therapy, coding, etc. But it’s really only creative industries, sadly, where it’s a job-killer. Freelancers illustrators and copywriters were the first victims.
Basically, we’ll eventually see generation platforms that are “full-stack”: they’ll combine existing tools for image generation, animation, upscaling, 3D, NeRF, editing, digital effects, (as well as voice and music generation). So you’ve got like a Final Cut Pro or After Effects style software package but you can start with nothing and generate and edit everything in there and it comes out looking like live action (if you want it to).
Similarly I hope that the things we use ChatGPT for now are instructive. Even people who are sceptical of or hostile to AI use ChatGPT to generate the fluff they need for a job reference, a grant proposal, a form letter. As these things become solely generated by AI the exercise should become meaningless. Much of the writing requested by bureaucracies, HR departments, and teachers will be unmasked as what it is: pointless busy work.
Here’s an interview with Mira Murati, OpenAI’s CTO, struggling with questions about whether Sora is trained on copyrighted material. Skip to 00:56 for that. We don’t yet have an AI that can convincingly infer human motivations; nor do we yet have a tech leader who would be any good at poker.
See the classic Torment Nexus meme.
The name is from their CEO Edward Saatchi. Yes, he’s a nepo baby, so there is something in common with Hollywood.
Anyone wanting to know more will have to do serious research, sadly. The classicist James Porter is one of the only people to work on this. He’s excavated the aesthetic theories of pre-Socratic philosophers and reinterpreted later philosophers, like Aristotle, in light of that.
This is an old trope in sci-fi or fantasy: the seductive but fake world. Again, observe the zombie irony of The Simulation and their bizarre winking at the idea of their technology leading to some Matrix-like hell.
Sorry for all the footnotes!