Only one thing can save us from predatory AIs: our own data
As Hollywood recycles comic book remakes with funding from China, the once-great cultural capital appears to be in a state of steep decline. But in one area, at least, the movie industry is offering a bold and inspiring vision for the future. The writers and actors who went on strike to prevent studio owners from using artificial intelligence (AI) to exploit and expropriate their work are on the front lines of a new data war triggered by the explosive growth of AI. This past July, Fran Drescher, former star of the popular sitcom The Nanny, and now president of the SAG-AFTRA union, described the stakes: “If we don’t stand tall right now, we are all going to be in trouble, we are all going to be in jeopardy of being replaced by machines.”
Drescher is right. Across the U.S. and the global economy, corporations are strip mining people’s data and then using it to train the AI systems that will put those very people out of work. Two days before this article was published, the Writers Guild of America reached a tentative agreement with the studios for a new three-year deal. According to Variety, “details of language around the use of generative AI in content production was one of the last items that the sides worked on before closing the pact.” With the writers bringing their five-month strike to a close, the actors guild could soon reach its own deal. While the Hollywood shutdown has undoubtedly been painful for many of the people involved, we believe that workers who do not manage to secure protections against AI will be in for far worse outcomes. Doomsday scenarios about fully sentient AI—aside from being good marketing gimmicks for the tech industry—distract from the real and immediate danger posed by the technology: Not the rise of terminator robots but that tens or hundreds of millions of people will suddenly lose their jobs.
Though AI technologies have been around for a while, they have only recently begun to work in a way that everyday people can understand. ChatGPT isn’t the super intelligence of science fiction, but it delivers results on par with the capabilities of a well-educated or skilled human being across various tasks. As a result, thousands of new AI companies launched over the last couple of years have collectively generated billions of unique monthly visitors to their sites. And as the immense power of Facebook, Twitter, and other social networking sites has shown, platforms capable of generating billions of visitors are extraordinarily valuable.
Both stand-alone AIs like ChatGPT, and those built into other applications where they can do everything from automate medical screenings to “socialize” with lonely people, are poised to become the most valuable technological artifacts ever built—potentially worth tens of trillions of dollars over the next decades. The problem is that AIs are data gluttons. Because there’s a strong correlation between the amount of data used to train an AI and the quality of its outputs (large quantities of data even allow you to ignore many quality issues), the companies running them need access to vast sums that are constantly replenished.
Until recently, it was relatively easy to acquire vast quantities of data. AI developers and platforms acted like prospectors in an unregulated frontier, hauling publicly available information from the open internet with abandon. Untold reams of data were simply screen-scraped, downloaded from questionable sources, or acquired under academic access. Now that AIs have proven their commercial worth, however, with trillions in potential revenue on the line, gates are being put up at a rapid clip.
Already, Twitter, Reddit, and other big sites are restricting access to the information collected from their users to prevent companies from harvesting and then monetizing that data. Twitter introduced a rate limit, setting a cap on the number of posts that users can see per day. These moves have triggered a backlash from users who view such policies as intrusive and shortsighted. However, the truth is that sites that don’t build such barriers are inviting other companies to treat them as a free resource and training ground. Even the AI sites themselves, from OpenAI to Midjourney, are putting limits in place to prevent their products from being used to train other AIs. Tech companies are also signing small data licensing deals as part of a strategy to build a legal case showing that the unlicensed use of their data damages their business. Finally, there has been a growing number of copyright suits made against the larger AI companies by copyright holders, from big firms like Getty Images to individuals like comedian Sarah Silverman. In her lawsuit, Silverman’s attorneys contend that OpenAI and Meta acquired her book—along with countless others—through illegal torrenting sites and then used her intellectual property to train their AI models.