Text-image generators are a handy way to produce arresting images. What combination of words creates images that are art, compared to those words that generate dull or banal images?
Last month, the website Dall:E (named after the Spanish artist Dali, and the Pixar character Wall-E, from the eponymous 2008 movie), announced that users were creating over two million AI generated images per day. The site added that it had fine-tuned its filters to reject violent or sexual content or other images that violate its policies.
But given the ease of access and increased sophistication of text-image generators, many experts predict that it won’t be long before the technology becomes yet another weapon in the arsenal of those looking to spread disinformation and propaganda. The technology already raises serious questions about copyright and the commercial use of artificially generated images.
Getty Images, for instance, unlike some of its competitors, banned the sale of AI generated illustrations on its site in September because of uncertainty around the legality of such images, while also announcing a partnership with a site that uses similar technology to enable the substantial and creative editing of existing images. The difference being emphasized here is that between image generation and image editing, even if the effect of the editing is to create an entirely different image.
In a recently released report, Democracy Reporting International observed that this “combination of a text model and a synthetic image creator raises the prospect that we will see a shift in disinformation strategies, moving from manipulation of existing content to the creation of new realities.” For the researchers the application of AI technology goes “beyond the manipulation of existing media” to the “production of fully synthetic content… eventually allowing for the quick and easy generation of fake visual evidence as a direct complement to false (news) narratives.”
Another significant concern, say critics, is that the AI technology will continue to reproduce stereotypes and biases that already exist within our society as it pulls from existing images online when it generates pictorial responses to textual commands. This would make it easier for those who want to create visual “evidence” to display alongside falsified narratives targeting marginalized communities.
Democracy Reporting International does offer recommendations on how to prepare for and respond to the growing mass of AI-created content. It argues that widespread digital literacy is essential if people are to recognize false narratives and disinformation. The researchers also suggest prebunking, that is being proactive in countering falsified images and text, rather than to merely react.
I spoke with Beatriz Almeida Saab, co-author of the report, about the threat text-image generators represent and how best to mitigate potential damage. This conversation has been edited for length and clarity.
While preparing the report, what came up unexpectedly for you in your research?
The threat is not the technology itself, but the access to this technology. Because the technology to manipulate media has always been there, it’s just a matter of how easy and fast you can do this. Plus, we’ve seen that people believe in much less sophisticated manners of manipulation. Our whole point is that it will get to the point where malicious actors will have easy access to this and it will be effortless. This kind of technology is open access meaning it’s available for everybody. There’s no regulation in place, meaning that if we are not discussing it at a policy level, how will we be prepared to see the consequences?
What would be your nightmare scenario with text-to-image generation?
A malicious actor creates a false headline, builds a story around it, and uses artificial intelligence (AI), specifically text-to-image generation models, to create an image that perfectly supports their false narrative, manufacturing realistic fake evidence. Consequently, this false narrative is harder to verify and debunk, so people will not change their minds as a shred of fake evidence supports the story, and there is no room for questioning an image.
How does text-to-image generation differ from “deep fakes” that already exist?
Deep fakes are typically used as an umbrella description of all forms of audio-visual manipulation — video, audio, or both. They are highly sophisticated manipulations using AI-driven technology, enabling those aiming to spread disinformation to make it seem that someone said or did something that they did not or that an event took place that never actually occurred. The main difference between deep fakes and text-prompt generated images is that deep fakes refer to sophisticated manipulations of existing audio-visual content. Text-to-image creation is novel as it moves from manipulating pre-existing media to the entire generation of new media, to the creation of an image that reflects the desired reality.
Who is most directly impacted by the implementation of this technology? What responsibility do people on the frontlines of this new tech have?
On one level, everyone is impacted. The way we consume information, images, and everything online will change. We need to learn how to discern what is true from what is false online, which is very hard. A researcher we interviewed for the report pointed out that your brain will already process information just by consuming it, whether it’s true or not. Your subconscious will process it, and it will stay with you. It also impacts what we call provenance technology stakeholders, who can detect media authenticity. So it impacts the way you debunk. It affects the way you fact check. It involves all these stakeholders because creating fake evidence to support a false narrative is very serious.