AI trained on synthetic data has the potential to devolve into its own dangerous feedback loop

Sign up for the daily CJR newsletter.

Recent New York Times investigative reporting has shed new light on the ethics of developing artificial intelligence systems at OpenAI, Microsoft, Google, and Meta. It revealed that in creating the latest generative AI, companies changed their own privacy policies and considered flouting copyright law in order to ingest the trillions of words available on the internet.

More important, the reporting reiterated claims from current industry leaders, like Sam Altman—OpenAI’s notorious CEO—that the main problem facing the development of more advanced AI is that these systems will soon run out of available data to devour. Thus, the largest AI companies in the world are increasingly turning to “synthetic data,” or information generated by AI itself, rather than humans, to continue to train their systems.

As a tech policy expert, I believe the use of synthetic data presents one of the greatest ethical issues with the future of AI. Using it to train new AI only compounds the problems of bias from the past. And, coupled with generative AI’s tendency to create false information, the use of synthetic data has the potential for AI to devolve into its own dangerous feedback loop.

AI is only as good as the data that it is trained on. Or, as the computer science saying goes: garbage in, garbage out. Years before the release of ChatGPT, groundbreaking internet studies scholar and endowed professor Safiya Noble argued that early search algorithms displayed bias based on “data discrimination” that produced racist and sexist results. And AI policy pioneer and civil rights lawyer Rashida Richardson wrote that based on historical practices of segregation and discrimination, the training databases available to early predictive AI systems were often full of “dirty data” or information that is “inaccurate, skewed, or systemically biased.”

Computer scientist and AI researcher Timnit Gebru warned that the racist and misogynistic views prevalent on the internet were “overrepresented in the training data” of early AI language models. She predicted that by “encoding bias,” generative AI would be set up “to further amplify biases and harms.” While Gebru was ousted from Google for her research, it proved to be prescient as more powerful generative AI was soon unleashed into the world in 2022.

Last month, one year after the release of Copilot Designer, Microsoft’s AI image generator, an engineer named Shane Jones confirmed that Noble, Richardson, and Gebru’s foresight had become fact and urged the company to remove the product from public use. Jones said that while testing the AI system, he found that without much prompting, he was easily able to generate volumes of racist, misogynistic, and violent content. He also said the ease of creating these images gave him “insight into what the training dataset probably was.”

Not only are AI systems consuming and replicating bias—AI that is trained on biased data has a tendency to “hallucinate,” or generate incomplete or wholly inaccurate information. Recently, AI chatbots have created nonexistent court cases to cite as legal precedent and developed fake academic citations including authors, dates, and journal names for research. AI chatbots have also encouraged business owners to break the law and offered fictitious discounts to airline customers. These hallucinations are so widespread that the Washington Post observed that they “have come to seem more like a feature than a bug.”

It’s clear that current generative AI systems have shown that based on their original training data, their output is to replicate bias and create false information. The pathway of training new systems with synthetic data would mean constantly feeding biased and inaccurate outputs back into the system as new training data. Without intervention, this cycle ensures that the system will only double down on its own biases and inaccuracies. One only needs to look at the echo chamber of hate speech and misinformation created by less intelligent social media technology to understand where such an infinite loop can lead.

I believe that, now more than ever, it’s time for people to organize and demand that AI companies pause their advance toward deploying more powerful systems and work to fix the technology’s current failures. While it may seem like a far-fetched idea, in February, Google decided to suspend its AI chatbot after it was enveloped in a public scandal. And just last month, in the wake of reporting about a rise in scams using the cloned voices of loved ones to solicit ransom, OpenAI announced it would not be releasing its new AI voice generator, citing its “potential for synthetic voice misuse.”

But I believe that society can’t just rely on the promises of American tech companies that have a history of putting profits and power above people. That’s why I argue that Congress needs to create an agency to regulate the industry. In the realm of AI, this agency should address potential harms by prohibiting the use of synthetic data and by requiring companies to audit and clean the original training data being used by their systems.

AI is now an omnipresent part of our lives. If we pause to fix the mistakes of the past and create new ethical guidelines and guardrails, it doesn’t have to become an existential threat to our future.

Anika Collier Navaroli is a senior fellow at the Tow Center for Digital Journalism at Columbia University and a Public Voices Fellow on technology in the public interest with the OpEd Project. She previously held senior policy positions at Twitter and Twitch.

Op-Ed: AI’s Most Pressing Ethics Problem

About the Tow Center

More from CJR

Visions of 2050

Recipe Book

Pivoting to Creator

The Direct-to-Consumer Playbook

About

Support CJR

Advertise