Connecting you with nature
Am I Real?Real Background, Wildlife Created With Adobe Photoshop (Beta, Generative AI module)
I’m selective, which also means careful, about how much news I watch, read, or listen to, and where it comes from. There’s more than a lot of hype and headlines-only out there. So, when I recently started seeing a lot in the news about Artificial Intelligence (AI), and specifically, Generative Artificial Intelligence (Generative AI), including a meeting President Biden just had with tech leaders to discuss regulating AI, I thought I should do some more reading on the topic.
This blog post summarizes my recent research on issues and topics in Generative AI, including my discovery that over 140 of my copyrighted images were “scraped” from my website and used by a Generative AI tech company. As a photographer who has active copyright registrations on their work, sells their work, and regularly publishes copyright-protected images on the Internet, the alarm bells regarding Generative AI are deafening. Because Generative AI involves much more than improper copying of photographs from the web, you don’t have to be a professional photographer to be concerned and alarmed by what Generative AI has morphed into, and what the industry has done. Please take the time to read the rest of this post.
What is Generative AI?
Generative AI is a computer science discipline where computers are “trained” on “vast quantities of preexisting human authored works.” When a user types some words in a text prompt, the computers learn how to generate new content based on the content they were trained on. The resulting output may be text (words), visual (photographs or other images), or audio (music,speech), and is determined by the AI model based on its design and the material it has been trained on.
Generative AI models learn the patterns and structure of their input training data, and then generate new data that has similar characteristics. There are several Generative AI models, or systems, including ChatGPT (and its variant Bing Chat), a chatbot built by OpenAI using their GPT-3 and GPT-4 foundational large language models, and Bard, a chatbot built by Google using their LaMDA foundation model. Other generative AI models include artificial intelligence art systems such as Stable Diffusion, Midjourney, and DALL-E.
Understanding “training data” is critical to understanding where one of the loudest alarms is ringing in the field of Generative AI. In the context of Generative AI, “web scraping” is a common method to gather large amounts of training data from the internet in order to “train” Generative AI models. Web scraping sounds bad and it can be bad. However, it’s legal when done according to the rules. Scraping the web refers to the automated process of extracting data from public websites. It involves using software tools or scripts to access web pages, retrieve their content, and extract specific information such as text, images, or structured data. Generative AI models including, Stable Diffusion, Midjourney, Dall-E, and ChatGPT make wide-ranging use of content scraped from the Internet. Web scraping does not involve asking for permission or notifying in advance that your website has been “scraped.” Hear the alarm bells?
Why Does This Matter?
Any search on your favorite web browser of “benefits of Generative AI” will tell you something about how Generative AI will revolutionize and transform the world, while also saving money, time, reduce barriers to learning, enhance creativity, and so much more. In a nutshell, there are certainly benefits of Generative AI, and that matters.
What also matters is that we don’t allow the incredible hype of Generative AI to distract from its real risks. The risk I’m personally familiar with is web scraping in order to get training data for Generative AI companies. The Congressional Research Service (CRS) reported in May 2023 (https://crsreports.congress.gov/, Report number R47569, Generative Artificial Intelligence and Data Privacy: A Primer, May 23, 2023),
“…. such models [web scraping] rely on privacy-invasive methods for mass data collection, typically without the consent or compensation of the original user, creator, or owner. Additionally, some models may be trained on sensitive data and reveal personal information to users. In a company blog post, Google AI researchers noted, “Because these datasets can be large (hundreds of gigabytes) and pull from a range of sources, they can sometimes contain sensitive data, including personally identifiable information (PII)—names, phone numbers, addresses, etc., even if trained on public data.”
CRS also reported:
“Generative AI datasets can include information posted on publicly available internet sites, including PII and sensitive and copyrighted content. They may also include publicly available content that is erroneous, pornographic, or potentially harmful.”
These CRS findings point to serious and potentially unlawful activities. Hear the alarm bells?
The CRS report also discussed a new tool for artists and others to identify and report content that’s been scraped and found its way into these Generative AI training datasets called “HaveIBeenTrained.” HaveIBeenTrained is an organization created by artists that provides anyone an opportunity to opt-out or opt-in if they discover their content has been scraped up into specific AI training sets. As of June 24, 2023, the landing page for the HaveIBeenTrained website states, “Over 1.4 billion images opted out and counting.”
I plugged my website into HaveIBeenTrained and discovered that over 140 photographs from my site had been scraped and were in the LAION-5B training data set. AI researchers download a subset of the LAION-5B data to train Generative AI image synthesis models such as Stable Diffusion and Google Imagen. Although all work on my website is copyrighted, and I didn’t grant permission, give consent, or was compensated, the Generative AI companies decided anyway to copy my work for their use and profit. When users find their work has been scraped, HaveIBeenTrained provides an opt-out or opt-in option. Since I own the domain for my website, I opted out for my domain (my web site). With that said, it’s completely unknown to me whether that occurred in practice. There’s no entity enforcing or overseeing any of this voluntary behavior. I don’t have access to the training data sets where my images were copied. Given the complete lack of good faith and appearance of unlawful behavior by some Generative AI companies, I’m approaching this situation with a large dose of healthy skepticism. I encourage the same to anyone impacted in this way. Hear the alarm bells ringing?
Is it Legal?
Very big and critical question. I address two of many legal issues in this area, (1) legality of Generative AI companies scraping from the web and using copyrighted content in their training datasets, and (2) legality of copyrighting content generated from AI systems (i.e., not human-authored content), whether text, image, or audio.
First, as mentioned earlier, web scraping is legal when done according to the rules. However, AI companies are seeing a growing number of lawsuits from artists and others concerning web scraping copyrighted content and using it in Generative AI training data sets. In January 2023, Getty Images, a stock photo company, initiated legal action in the United Kingdom against Stable Diffusion AI, a Generative AI company. The basis of the lawsuit is Getty Images belief that Stability AI “unlawfully copied and processed millions of images protected by copyright” to train its software. The following month in February 2023, Getty Images filed a second action in the United States also against Stable Diffusion AI, alleging they copied more than 12 million photographs from Getty Images’ collection, along with the associated captions and metadata, without permission from or compensation to Getty Images, as part of its efforts to build a competing business. Getty Images said, “As part of its unlawful scheme, Stability AI has removed or altered Getty Images’ copyright management information, provided false copyright management information, and infringed Getty Images’ famous trademarks.”
Another lawsuit against Stable Diffusion, and two other AI companies - Midjourney and Deviant Art - was also filed in 2023 by a group of three artists. The artists — Sarah Andersen, Kelly McKernan, and Karla Ortiz — allege that these organizations have infringed the rights of “millions of artists” by training their AI tools on five billion images scraped from the web “without the consent of the original artists.”
Shortly after I published this blog article, in June 2023, two new class action lawsuits were filed against Generative AI company, OpenAI. More information on these actions are found here: https://clarksonlawfirm.com/togetheronai/ and https://www.saverilawfirm.com/chatgpt-language-model-litigation. There are other current lawsuits against Generative AI companies, including another filed in July 2023 by accomplished Comedian Sarah Silverman, https://www.documentcloud.org/documents/23869693-silverman-openai-complaint. It’s important to watch these cases, particularly for future litigation, including class actions.
A second issue that’s come up with Generative AI is whether products of this tool can be copyrighted. Generative AI starts with entering a text prompt, sort of like doing a Google search. For example, if you wanted a photo of a grey owl against a snowy background, you would simply type that in the prompt and the AI will generate it for you. Is that copyright-able? This issue has come before the US Copyright Office, and will likely continue to come before the Office. The Office’s current policy position, expressed in a March 2023 Federal Register Notice (Copyright Registration Guidance: Works Containing Material Generated by Artificial Intelligence, Federal Register /Vol. 88, No. 51 /Thursday, March 16, 2023 /Rules and Regulations, U.S. Copyright Office, Library of Congress), reads:
“In the Office’s view, it is well established that copyright can protect only material that is the product of human creativity [not Generative AI products]. Most fundamentally, the term ‘‘author,’’ which is used in both the Constitution and the Copyright Act, excludes non-humans. The Office’s registration policies and regulations reflect statutory and judicial guidance on this issue.”
While anyone can mark anything with a copyright symbol it doesn’t mean it’s registered with the U.S. Copyright Office, or that it could be registered. Official copyright registrations (which I have on my published photographs) is what secures the ability to take legal action for copyright infringement. The Federal Register notice provided examples of cases where the US Copyright Office has refused to register Generative AI-based copyright applications.
Matters surrounding Generative AI and copyright protections are currently very active policy issues with the U.S. Copyright Office. In spring 2023, the Office hosted four virtual listening sessions on the use of artificial intelligence to generate works in creative fields. “Copyright Office staff asked participants to discuss their hopes, concerns, and questions about generative AI and copyright law. The sessions were fully remote and focused on literary works, including print journalism and software; visual arts; audiovisual works; and music and sound recordings.”
In June 2023, the U.S. Copyright Office hosted another virtual event exploring guidance for registration of works containing Generative AI content. More events are on the way -- on July 26, 2023 the Office will host a virtual discussion on global perspectives on copyright and AI. Event description -- "Leading international experts will discuss how other countries are approaching copyright questions such as authorship, training, exceptions and limitations, and infringement. They will provide an overview of legislative developments in other regions and highlight possible areas of convergence and divergence involving generative AI." Sign up for notices about all U.S. Copyright Office events here, https://www.copyright.gov/events/
Anticipate more to come on this.
Is it Ethical?
The use of AI raises ethical questions because an AI system will reinforce what it has already learned. This becomes a problem because the kind of machine learning that underpins many of the most advanced AI tools are only as smart, fair, accurate, and balanced as the data they’re trained on. Because humans select the data used to train an AI program, the potential for bias in what the machine has learned is a risk and must be monitored closely.
Other ethical issues are that AI makes it difficult to determine the authenticity of media and the products of Generative AI, including images and artwork. This works to erode trust in the people (artists, photographers, authors, students, creators, journalists) and their industries or avocations (journalism, art, photography, education, etc…) and leads to confusion about the truth. Hear the alarm bells ringing?
What’s the Government Response?
Some good news -- although the federal government doesn’t appear to move as fast as the tech industry, the government has issued guidance, executive orders, a “Blueprint for an AI Bill of Rights,” and has taken other action related to the growth and growing use of AI. The U.S. Senate Committee on Homeland Security & Governmental Affairs held a hearing in March 2023, on “Artificial Intelligence: Risks and Opportunities.” In June 2023, President Biden met with technology professionals in California to discuss rapid developments in AI. The goal of the meetings was to have an in-depth discussion about how AI should be regulated in the future so that its economic and security potential can be fully realized.
Reading the tea leaves indicates that U.S. regulation is coming. Many leaders in the industry are calling for it. Those familiar with the regulatory process know it’s anything but fast, but a signal that government regulations are coming often prompts action from those who know they’ll be coming under regulation.
The AI Bill of Rights Blueprint is an important document, released by the White House in October 2022. It suggests ways to make AI more transparent, less discriminatory, and safer to use. There are many important aspects of this Blueprint, some of which will likely be reflected in future regulations. I highlight a few of the many important provisions of this Bill of Rights below, as they reflect the values and intent of government decisions regarding AI:
“You should be protected from unsafe or ineffective systems. Automated systems should be developed with consultation from diverse communities, stakeholders, and domain experts to identify concerns, risks, and potential impacts of the system. Systems should undergo pre-deployment testing, risk identification and mitigation, and ongoing monitoring that demonstrate they are safe and effective based on their intended use, mitigation of unsafe outcomes including those beyond the intended use, and adherence to domain-specific standards.”
“You should be protected from abusive data practices via built-in protections and you should have agency over how data about you is used. You should be protected from violations of privacy through design choices that ensure such protections are included by default, including ensuring that data collection conforms to reasonable expectations and that only data strictly necessary for the specific context is collected. Designers, developers, and deployers of automated systems should seek your permission and respect your decisions regarding collection, use, access, transfer, and deletion of your data in appropriate ways and to the greatest extent possible; where not possible, alternative privacy by design safeguards should be used. Systems should not employ user experience and design decisions that obfuscate user choice or burden users with defaults that are privacy invasive.”
“You should know that an automated system is being used and understand how and why it contributes to outcomes that impact you. Designers, developers, and deployers of automated systems should provide generally accessible plain language documentation including clear descriptions of the overall system functioning and the role automation plays, notice that such systems are in use, the individual or organization responsible for the system, and explanations of outcomes that are clear, timely, and accessible.”
Final Thoughts (for now)
There will be much more to come on Generative AI. It’s my goal to update this blog piece with new information as things will undoubtedly evolve, change, and improve.
Here are a few final thoughts for now:
Be a Responsible User of AI
Photography Competitions and Art Show Promoters – Review your guidelines, criteria, and jury process
Stay Informed and Protect Your Work
I’ve provided links and references to the sources I’ve used in preparing this blog. I encourage everyone to review these yourself and to keep up with developments in AI. It is here to stay and will only grow. Here’s a few of the key documents and reading:
Other Sources, Information, and Latest News
https://www.copyright.gov/events/
https://www.cbsnews.com/news/chatgpt-judge-fines-lawyers-who-used-ai/
https://glaze.cs.uchicago.edu/
https://www.hsgac.senate.gov/hearings/artificial-intelligence-risks-and-opportunities/
https://www.youtube.com/watch?v=o9t3XS1XtRE
https://zapier.com/blog/ai-art-generator/
https://time.com/6266606/how-to-spot-deepfake-pope/
https://www.scientificamerican.com/article/how-my-ai-image-won-a-major-photography-competition/
https://www.thephoblographer.com/2023/02/24/ai-generated-visuals-dont-deserve-to-be-called-photos/
https://generated.photos/humans
https://www.theartnewspaper.com/2023/04/20/are-ai-photographs-actually-photographs
https://en.wikipedia.org/wiki/Generative_artificial_intelligence
https://www.lexology.com/library/detail.aspx?g=1b32633f-91ea-482c-bed9-4cf8ed9bba80
https://sports.yahoo.com/lawsuits-over-stability-ais-stable-164923270.html?guccounter=1
https://petapixel.com/2023/04/26/photo-contests-are-woefully-unprepared-for-ai/