AI and privacy: A primer for developers

For privacy lawyers, ChatGPT’s ban in Italy was a ‘told-ya’ moment. Large language models (LLMs) like ChatGPT are trained on huge volumes of data. This  includes publicly available data sourced from the Internet, such as open Reddit threads or news pieces; datasets licensed/ purchased from relevant providers, for instance, datasets procedure from data vendors; and data about users of the AI models, which is used to further train the model. A lot of this could be personal information about individuals. But, when collecting and using it, do developers abide by privacy norms?

ChatGPT was back up in Italy, with some changes. (See our piece on Italy’s concerns and lessons for India here). To meet the conditions set by Italy’s data protection authority, Open AI modified its privacy policy– to specifically call out use of data for training AI models, changed its user interface- to provide users the ability to opt out of their data being used for training, and added an age-gating mechanism- to verify user’s age, upon sign-up.

Regulators around the world are watching closely. UK’s ICO released guidance for developers to consider while developing or using generative AI (including identifying a legal basis for data-processing, preparing data protection impact assessments, ensuring transparency, etc.). France’s CNIL has a four-part action plan for regulating generative AI. The US FTC has also released guidance on marketing AI products and ensuring fairness while using AI tools.

Law or no law, with AI discourse at an inflection point, data privacy can’t be an afterthought for developers of AI. We discuss what developers can do to minimise privacy risks.

Identify ‘personal data’: Privacy norms extend to ‘personal data’, namely, data that is about or relates to an individual. Privacy rules don’t extend to non-personal or anonymised data, as such data doesn’t involve or relate to individuals. The thumb rule to ask is if the data you deal with, can identify an individual or relates to an individual. For eg., to train a bank’s customer service chatbot, you feed it real conversations between customer service reps and customers. This could include personal details about customers. And so, this is personal data. But, for training AI algorithms in crop monitoring systems, you use data about soil patterns. This doesn’t relate to individuals, and, is not personal data. This helps identify what is the category of data that you must worry about. 

Identify whose data you have: To train your AI model, you could have data about your users (individuals who use your service/ platform) and non-users (individuals who don’t use your service, but whose data you may have scooped up from the Internet or other sources for training your AI model). While you directly interact with the user, (when they sign up), you may need to explore if you can inform and meaningfully allow non-users choices around their data (see opt out below). For instance, Open AI made changes to inform non-users (for eg., public figures whose data is available freely on the Internet) about the way their data is used and provided them an option to object to the processing of their data.

Anonymise data: Evaluate if youreallyneed personal data. For instance, in the customer service rep example, while you need to know how customer service representatives typically respond to grievances- the tone, the style, level of detail- your AI model doesn’t need to know names, addresses, or other personal identifiers of the customer.

If you don’t need personal data, consider seeking and collecting only anonymised data. In the three sources we discussed above, if you’re getting training datasets from a third party source, say CT scans from a radiology lab to train your AI model to detect cancerous growth, you could ask the source to anonymise the data before sharing it with you. Typically, this would mean contractual clauses asking the data provider to anonymise the data before allowing you access. When getting data from public sources, you could consider removing personal data from the training dataset – using data anonymisation techniques or use of synthetic data.  

Tell people what you’re doing:Publish a privacy policy, explaining how you collect and use personal data. While OpenAI’s privacy policy earlier did say it uses personal data to improve its services, post Italy’s ban- the policy now specifically calls out the use of data to train its AI models.

Include your privacy notice in your sign-up workflow: Just having a privacy notice on your website may not be enough. When a user signs up on your platform, give them a link to your privacy policy, to read before continuing to sign-up. (Ofcourse, this only covers a scenario where users’ data is used to train the AI model further, not when training data is gathered from scraping the Internet or third party sources.)

Opt in or opt out mechanisms: A key tenet of privacy is to allow individuals to make meaningful choices about how their data is used. So, while using data to train AI algorithms, consider whether you can allow individuals to permit or object to your use of their personal data.  Users can be provided the choice, during the sign-up workflow. For non-users, you may consider an opt out – similar to how OpenAI now allows individuals to fill a form objecting to processing of their data. Italy’s DPA also asked OpenAI to conduct a mass media campaign to tell individuals that their data may have been used to train OpenAI’s algorithms and that they could visit the website to exercise their right to opt out. Taking a cue, developers must communicate opt-out options effectively and make them easily discoverable.

Implementing Opt-outs: While opt-in only requires you to maintain a log, implementing opt-outs can be a complex exercise, especially when non-users make such requests. Opt outs are reactive – so you need to ‘undo’ something already done – which can be difficult. To start with, it is difficult to accurately identify and trace an individual’s data throughout training datasets – especially, if the individuals are not profiled and their information is incidentally included in the training dataset. And, training datasets may evolve and update over time and be spread across systems. So processing opt-out requests, in real time, across multiple versions and systems can be complex. Also, user preference may change over time and individuals may revoke their opt-out requests, adding another layer of complexity. Further, it can be challenging to accurately trace an individual’s data if the data has been anonymized or aggregated. A few things to consider while implementing opt-outs, for LLMs:

(a) Ask non-users to provide unique identifiers (like email addresses or phone numbers), which can help you process their request. Clearview AI requires individuals to submit their photo to process the data deletion request, since they do not maintain any other information on the individuals. Open AI requests non-users to provide evidence that the model has knowledge of such individuals, including the relevant prompts. So, depending on your AI models, you may need to seek information from the individuals to process their requests.

(b) On receiving an opt-out request from non-users, developers could consider asking individuals to identify if they are public figures. Given the quantum of data available on public figures, it might be difficult to provide an effective opt-out. Open AI asks non-users to identify the country whose law applies. This will be useful to ascertain if the law mandates you to process the request.

(c) Explain how the opt-out works and its limitation. This helps bring transparency and manage individuals’ expectations.

(d) Consider third-party audits to periodically verify that the opted-out data is not being used in the training process. This may be relevant if you routinely deal with sensitive personal information like health or financial information.

Age-gate to avoid scooping up children’s data: Given the heightened sensitivities in using children’s data, most privacy laws around the world place additional restrictions on organisations collecting and using children’s data. Developers could consider age-gating methods, to avoid gathering children’s data to train AI models. Further, while collecting data from public sources, developers could consider filters to exclude content or websites that are known to contain children’s personal information. Developers can also consider providing users with the option to report instances where the AI model’s responses contain children’s data. The feedback can be used to reduce the use of children’s data in training AI models.

Conclusion

Ensuring privacy while developing AI models can be a complex exercise, involving legal, technical and process-level changes. It requires buy-in and involvement across teams. As a starting point, AI companies must sensitise their teams on data protection laws and privacy principles. This can help them identify and address privacy risks early on while developing the models. For example, well-informed teams could proactively build means to enable individuals to exercise their rights or minimise the amount of personal information in training datasets. Retro-fitting these elements at a later stage may require greater efforts and resources. 

This piece is authored by Mayank Takawane (Associate) and Sreenidhi Srinivasan (Partner), with inputs from Anirudh Rastogi (Managing Partner).

Image credits: Pixabay

We welcome feedback and inputs. Please reach out at sreenidhi@ikigailaw.com

Challenge
the status quo

Dividing by zero...