Google shows how not to launch an AI feature
Google’s AI Overviews feature offered lessons for those wanting to use LLMs with their data stores, via an almost literal example of the old programming adage: “garbage in, garbage out.”
Subscribe to FILED Newsletter
Welcome to FILED Newsletter, your round-up of the latest news and views at the intersection of data privacy, data security, and governance.
This month:
- Meta faces criticism for training its AI models on our posts.
- Microsoft will turn of its AI “Recall” feature after everyone raises privacy concerns
- Everyone thinks they are excellent at identifying phishing emails (they aren’t)
But first: Google’s AI Overviews feature offers lessons for those who wish to bring LLMs into their data stores.
If you only read one thing:
Garbage in, garbage out
Like a pizza chef adding glue to their sauce to deal with a structural integrity issue, Google has demonstrated the dangers of a poor, ill-considered diet.
In this case, we’re of course discussing information diets, in reference to the company’s new AI Overviews feature, available now in the US and worldwide by the end of the year. The feature makes use of data obtained in a $60 million/year deal allowing Google to scrape Reddit’s archives and provide the data to its AI models.
Publishers have raised concerns that the feature will decimate search traffic, though the web giant has reassured them that links included in the summaries get more traffic than traditional web listings. We will see how that plays out, but what is the immediate result for users?
Mostly: anodyne AI summaries of top search results, re-written versions of sentences no doubt contained in the top search results, punctuated by occasionally nonsensical, potentially dangerous answers to niche search queries.
You have by now all seen examples of the platform confidently spouting nonsense:
- Doctors recommend pregnant woman smoke 2-3 cigarettes per day.
- Dogs have played in the NHL and the NBA.
- You can use gasoline to make a spicy spaghetti dish.
- The US has had 42 presidents, 17 of whom were white.
And of course, the pizza glue concept, where Google’s LLMs seem to have ingested an 11-year-old Reddit comment, presumably (hopefully!) meant as a joke. (If you’re wondering, one writer tried the recipe and it worked as advertised, though there were concerns about fumes.)
Thanks to all these examples, in the week following the launch, Google announced it would pare back use of the feature, reducing its use in certain searches.
If in doubt, leave it out
Could this have been avoided? At Google’s scale, and with that data set, it is likely inevitable that some weird and wonderful content will enter the output, even when handled carefully. But AI Overviews is a spectacular example of what could happen if you turned an LLM loose on a dataset without first pruning it to remove any nonsense data.
Reddit’s value proposition is that discussions of all types happen on its platform, with a lot of contextual understanding often needed to separate genuine commentary from irony (and, not coincidentally, a lot of AI-generated rubbish). As capable as LLMs are at lots of varied tasks, the old adage holds true: garbage in, garbage out (or perhaps that should be, “glue, rocks, cigarettes, gasoline in, garbage out.”)
Google no doubt tried very hard to reduce the garbage in their data set, and even then, at this scale, there was enough to sustain many search-optimized articles about the subject.
The lesson for organizations under pressure to use LLMs on their data: the first step should be to remove, on the one hand, any sensitive or confidential data, and on the other, anything redundant, obsolete, or trivial (ROT). Any of these categories of data can cause issues.
RAGtime
But you can do more. As well as removing sensitive or garbage data, organizations are using a technique called Retrieval-Augmented Generation to try to give LLMs some expertise and help the model to avoid using the remaining garbage data in their responses. While LLMs get their strength from their understanding of how humans generally use language, RAG offers specificity. If you are using LLMs in a legal or medical context, you could connect a RAG system to relevant legal or medical texts. The LLM would then be able to cite footnotes when making a statement. To take the Google example, if a given statement had cited satirical magazine The Onion, for example, you would be safe ignoring it.
The gold standard when using LLMs is a human in the loop, fact checking each statement, but for when that isn’t practical, a RAG can help reduce what shall hereby be known as the glue in the pizza phenomenon.
🕵️ Privacy & governance
Meta faces criticism for its plans to use public Facebook/Instagram posts and images from UK and EU citizens', as well as Australian citizens’ to train its AI models. The company is citing privacy policy concerns that take effect on 26 June (though the scraping is already happening in the US.)
77% of Australian consumers value privacy over personalization when it comes to online services, according to a new study.
Microsoft will turn off Recall by default following a backlash from privacy advocates and security professionals.
A new report from the Australian Competition and Consumer Commission (ACCC) says Australians don’t know and can’t control how data brokers spread their personal information, despite laws against ‘data enrichment’ services.
The Securities and Exchange Commission (SEC) has adopted amendments to Regulation S-P, which impose new privacy-related protections and obligations including the adoption of an incident response program and customer notification requirements.
🔐 Security
Cyber-attacks against Europe, many linked to Russia, have doubled in recent months, leading up to European elections and the Paris Olympics.
Data analytics firm Snowflake is working on its security strategy, and asking customers to adopt stronger security controls following suspected data breaches at companies including Live Nation and Advanced Auto Parts Inc.
Workers are still overconfident in their abilities when it comes to spotting ransomware. Remember, this has gotten harder thanks to the advent of generative AI.
The latest from RecordPoint
📖 Read
When you understand your data, you can make the right decisions, which results in improved risk posture, compliance with relevant privacy and records laws, lower costs, and improved efficiency. Here's a guide to understanding your data.
Related: What is data security posture management?
And if you are based in Australia, you have no doubt heard about the amendments to the Privacy Act. Read our guide to preparing for them when they are tabled in August.
🎧 Listen
We have three(!) episodes of FILED for you this month.
Most recently, we just published a special edition of the podcast to cover the news of new legal action brought against Medibank by the OAIC, which carries a maximum theoretical penalty of an AU $21 trillion fine.
Then SolCyber CEO Scott McCrady joined Kris and I to discuss the intersection of cybersecurity and data governance. They cover how businesses are struggling to prioritize security and privacy measures, the shift from perimeter-based security to identity-based security, and the future of cybersecurity and AI.
And finally, Civic Data director Chris Brinkworth joined the pair to dive deep into the issues of privacy regulation, following the announcement the reform of Australia’s Privacy Act will be brought forward to August. They cover the challenges of third-party data, the “fair and reasonable” test, and why companies responding to privacy law must focus on “privacy by design.”