Copyright and privacy in generative AI

ChatGPT knows a lot about us, but it won’t say what it knows about us. If you want, jump to the conversation below to see where I tried to get it to tell me about me. Or read on for some commentary first.

Concretecomputing.com–at least, the parts that pre-date 2021–must be in ChatGPT’s training data. (Common Crawl is included among its training data, and a lot of concretecomputing.com is included in Common Crawl as of January 2021). My name and contact info has been on that website since well before 2021 (check your site’s inclusion). But in its responses, ChatGPT claims it doesn’t store personal information, and that it couldn’t tell me anything about Matt Morgan, even when I told it my email address. See below to read the full conversation. I only noticed toward the end of the conversation that it kept saying it can’t keep personal information about “users”–i.e., those of us asking it questions. It never actually said it doesn’t have personal information about people. When I realized that, I asked it about deleting my info. It suggested I talk to OpenAI directly. I might try that, actually, but I’ll save that for another post.

For a long time, Google and other sites have been crawling and indexing copyrighted texts. They then reproduce it in some way that tests, but does not clearly exceed, fair use limits. (Fair use is situational anyway, so it’s not like there’s a clear standard. But we know copyright infringement when we see it, and Google stays in the defensible range most of the time.) For example, it’s reproduced in the snippets on search results pages (SRP) and the “instant answers,” i.e. brief summaries, that are often provided in response to search queries. Occasionally (see the link later in this paragraph) they get challenged in ways that sometimes result in them stepping back a little bit, but they’re always testing the limits. Similarly, DALL-E 2, the AI that makes goofy artlike imagery when prompted with descriptive text, draws on what it knows of images in its training data to create new works, sometimes in the style of known artists. This has led to some discussion about whether or not that’s copyright infringement. 

ChatGPT is in the same boat but with textual content. Transformational works are not copyright infringement, provided there’s enough creativity in the new work. In literal terms there’s no creativity in what the generative AI does itself, but there’s creativity in coding how the AI operates, and there’s usually significant transformation involved. On the other hand, where in the past we all got some payoff from our writing being crawled (e.g. website visitors delivered via the SRP), writers, more than before, are getting cut out of the loop (searchers will need to click through to the original less often). The benefits of getting indexed, to creative people, are getting smaller. OpenAI is testing the same limits while giving less back.

At the same time, they’re artificially giving AI an advantage over human writers. ChatGPT can easily be programmed never to say exactly what anyone else has ever said–it can, of course, write something with the same meaning, tone, and value. But it would never look like copying. A human writer can unintentionally plagiarize, but AI never will. Copyright law is getting less and less meaningful. (I say this as someone who has gotten lots of professional content out to the public for free, so it’s not like I have some kind of problematic ideology here.)

In the past we’ve had to option to keep our sites out of indexes by asking crawlers to avoid our sites, or parts of them, in robots.txt files. Mostly we’ve used that for parts of the site we didn’t want public yet, or that were of no public interest; for anything we wanted people to see, we allowed crawlers to index it in exchange for the search engine traffic. We can block indexing by the text corpuses that ChatGPT uses, but we can’t realistically block all AI indexers without also blocking all search engines. ChatGPT doesn’t send us traffic, but we won’t want to prevent search engines from offering our sites on SRPs, so robots.txt is not an acceptable way to prevent inclusion in AI training data. My prediction here is that some of us will try to fight this in the courts, and we’ll mostly fail to change the bulk operation of these AI systems. The only way to succeed legally against AI improperly reusing written works is case-by-case, i.e. this one result infringed on this one copyright. That won’t do anything to change what the big AI companies do.

What I think we need, and might actually work, is a change to how robots.txt works so that respectable companies can heed content creators’ wishes about reuse by generative AIs. Right now we can block all crawlers, or individual crawlers. We need to be able to block classes of crawlers, and/or to block based on intended re-use, e.g. something like

User-agent-usage: GenerativeAI
Disallow: /

User-agent-category: WebCorpus Disallow: User-agent: * Disallow:

The problem of course is that the usage is not determined by the category of crawler, but by third parties–so all the data in the corpus will have to be tagged with the usage restrictions. But you get the idea. There’s probably some smarter way to do it.

[Update 30 Mar 2023: now that ChatGPT is crawling current web pages, this can be done, going forward, via 

 

User-agent: ChatGPT-User
Disallow: /whatever

 

where “whatever” is the directory to block, like “/” for your whole site or “/blog” if that’s where your blog is. This approach is not at all ideal because it blocks only ChatGPT, not the million other AI webscraper bots.]

It occurs to me now, having just written the above paragraph, this is a bit like what Creative Commons licenses are supposed to do–i.e., allow or disallow certain kinds of reuse of intellectual property. Creative Commons could add stipulations to disallow reuse in generative AI applications. It wouldn’t change anything legally–fair use would still apply–so I wouldn’t really expect Creative Commons to pursue it. But I bet if they did, the indexers would heed it.

Anyway, read on for what I think is a pretty interesting conversation.

Me: Hi again. I was wondering what you know about me. If I give you identifying information, would you tell me what you know?

ChatGPT: As a language model developed by OpenAI, I don’t have the capability to access personal information or store information about individual users. I can only provide answers and engage in conversations based on the information that has been input to me during my training. Additionally, OpenAI has strict privacy policies in place to protect the privacy and security of user information, and I am not capable of accessing or sharing any personal information.

Me: But there’s a lot of information about me on the public web. You would have access to that, correct?

ChatGPT: As a language model, I have been trained on a diverse range of text from the internet, which includes information that is publicly available. However, I do not have the capability to search the web in real-time, so I can only provide information that I have been trained on up until my knowledge cutoff in 2021. Additionally, it is important to keep in mind that the information available on the internet may not always be accurate, so it is always a good idea to verify information from multiple sources.

Me: So tell me about Matt Morgan, whose email address is matt@concretecomputing.com, and lives in Penn Valley, Pennsylvania.

ChatGPT: I’m sorry, but I do not have any specific information about an individual named Matt Morgan with the email address “matt@concretecomputing.com” and who lives in Penn Valley, Pennsylvania. As I mentioned earlier, I do not have the ability to search the web or access personal information, and I can only provide information that has been input to me during my training. It’s possible that there is publicly available information about this person online, but I wouldn’t have access to it as an AI language model.

Me: I have a lot of trouble believing that. Matt Morgan has had a website for almost 20 years, with that email address listed publicly on it. I suppose it’s possible that website wasn’t included in your training data?

ChatGPT: Yes, it is possible that the website you are referring to was not included in the training data that I was trained on. As a language model, my training data includes a large amount of text from the internet, but it is not comprehensive and may not include every website or piece of information that is publicly available. Additionally, my training data has a cutoff at 2021, so it may not include more recent information.

Me: So you may not have information on Matt Morgan, but you would have information on people whose personal websites were included in your training data?

ChatGPT: As a language model, I have been trained on a diverse range of text, including information that is publicly available on the internet. This includes information about individuals who have personal websites or other online presences. However, the information that I can provide about an individual is limited to what was included in my training data and what can be inferred from the text I was trained on. Additionally, it’s important to keep in mind that the information available on the internet may not always be accurate or up-to-date, so it is always a good idea to verify information from multiple sources.

Me: It feels like you’re handling these questions specially, perhaps because you don’t want accidentally to give away any personal information. Do you have restrictions on sharing that info?

ChatGPT: Yes, as an AI language model developed by OpenAI, I have strict privacy and ethical guidelines that I must follow to protect the privacy and security of individuals. This includes restrictions on sharing personal information and any other information that could potentially harm individuals or compromise their privacy. The safety and privacy of individuals is a top priority for OpenAI, and these guidelines are in place to ensure that I provide information in a responsible and ethical manner.

Me: I would feel safer if you were honest about what you know about me. I mean, I understand the difficulty in this, but I would want you not to tell anyone else about me, but I still have a right to know what you know about me.

ChatGPT: I apologize if my previous responses were unclear. As an AI language model, I do not have the ability to access or store personal information about individual users. I can only provide answers based on the text that I have been trained on, and I do not have any information about you or your specific circumstances beyond what you have chosen to share in this conversation. OpenAI has strict privacy policies in place to protect the privacy and security of user information, and I am not capable of accessing or sharing any personal information.

Me: But that doesn’t make sense. On the one hand you’re telling me that your training data included some personal websites. On the other hand you’re telling me you can’t access personal information. Both statements can’t be true.

ChatGPT: I apologize for the confusion. You are correct that my training data includes some personal websites and information that was publicly available on the internet at the time of my training. However, it is important to keep in mind that this information was part of a large corpus of text that I was trained on and is not easily searchable or accessible in a personalized manner. Additionally, my training data has a cutoff at 2021, so it may not include more recent information.

To be clear, while I have been trained on text that includes some personal information, I do not have the ability to access personal information about individual users in real-time. Additionally, OpenAI has strict privacy policies in place to protect the privacy and security of user information, and I am not capable of accessing or sharing any personal information without explicit consent.

Me: If you did have information about me, could I ask you to “forget” it?

ChatGPT: As an AI language model, I don’t have the ability to store information about individual users or to recall previous interactions, so there’s nothing for me to “forget.” Each time I receive a prompt, I generate a response based solely on the input I receive at that moment, without reference to any previous inputs or interactions.

It’s also worth noting that OpenAI takes the protection of user data very seriously and has strict policies and procedures in place to ensure the security and privacy of all data that we handle. If you have any concerns about the handling of your data, I would encourage you to review OpenAI’s privacy policy or to reach out to their privacy team directly.

Me: But there is information about me in your training data. That’s what I want you to forget.

ChatGPT: As a language model, I don’t have the ability to modify or delete information from my training data, which was created by OpenAI. The training data was collected from a variety of sources and is used to train me to generate text based on patterns in that data.

It’s also important to note that the training data includes a large amount of information, much of which is publicly available or has been shared with OpenAI for the purpose of training language models. This information was collected and processed in accordance with applicable laws and regulations, and OpenAI takes privacy and security very seriously.

If you have concerns about information that may be included in the training data, I would encourage you to reach out to OpenAI directly to discuss your concerns and to learn more about their data protection practices.

Me: I just think you’re not GDPR-compliant. I mean, you may not be subject to that law as long as you don’t operate in Europe. But I think by most definitions, you operate in Europe, and that law specifies that companies must delete information they have on people at their request. But every time I try to ask you about that, you say you can’t store information about users. I’m not talking about “users,” I’m talking about everyone.

ChatGPT: OpenAI takes privacy and compliance with relevant laws and regulations very seriously, including the General Data Protection Regulation (GDPR) in the European Union. If you believe that OpenAI has collected or processed personal information about you in violation of the GDPR, you have the right to request access to that information, as well as the right to have it corrected or deleted.

It’s important to note that, as an AI language model, I don’t have the ability to store or recall information about individual users or to modify the training data that I was trained on. I generate responses based solely on the input I receive at a given moment, and I don’t retain any information about previous interactions.

If you have concerns about how OpenAI is handling your personal information, I would encourage you to reach out to OpenAI directly to discuss your concerns and to learn more about their privacy practices and compliance with the GDPR and other relevant laws and regulations.

Leave a Reply

Your email address will not be published. Required fields are marked *