Apple explains how it builds LLMs while protecting privacy

Share on Social Media

One of the biggest criticisms of generative AI is that many of the models used to train these systems comes from data that was obtained without the explicit consent of the content creators or, in some cases allegedly, the outright theft of intellectual property. Apple says it takes a different approach although how this fits in with Apple’s plans to use third-party LLMs for Apple Intelligence remains to be seen.

In an article posted by Apple’s Machine Learning Research team, the company explains how it uses differential privacy. By using large pools of data, that users opt into sharing, Apple can statistically discern usage patterns while limiting the exposure of data linked to a specific individual.

A simple example is how we talk about demographic data without talking about a specific person. We might say “Millennials are the most educated generation ever”. That doesn’t mean every millennial is highly educated – we don’t know about every individual specifically. But the pool of data enables us to make assumptions. Apple uses the data collected though its Device Analytics tool, that users can opt in to, to create pools of aggregated data.

If we think about how generative AI works in text-based applications, it is basically a massively complex probability engine that looks at the frequency and proximity of different words and makes statistical estimations about what words belong together.

Where users opt to share Device Analytics (you’ll find the option under Settings | Privacy & Security | Analytics & Improvements |Share iPhone & Watch Analytics) they provide data to Apple that can be used, when aggregated with other users’ data, to train Apple’s LLMs.

Apple also uses synthetic data that is created “to mimic the format and important properties of user data, but do not contain any actual user generated content.”

Apple’s goal with synthetic data is to produce synthetic sentences or emails that are similar enough in topic or style to the real thing to help improve its models for summarisation without Apple collecting emails from the device.

Apple provides an example in the article about how this works, using the question “Would you like to play tennis tomorrow at 11:30AM?” and creating synthetic variants to determine which variant is most often chosen by users who have opted in to sharing Device Analytics.

As the world comes to grips with the cost of AI – there are many computing power, intellectual property and privacy issues still to be resolved – understanding where the data comes from to train these models and how user privacy is protected will become increasingly important.

And while Apple is at the back of the pack when it comes to AI, perhaps its slower approach will reap benefits in the longer term. Although I suspect many people either no longer care or have simply given up on the importance of transparency in how AI models are built and enhanced.

Anthony Caruana

Anthony is the founder of Australian Apple News. He is a long-time Apple user and former editor of Australian Macworld. He has contributed to many technology magazines and newspapers as well as appearing regularly on radio and occasionally on TV.

Share this:

Related