What happens when we collect too much data?
|Sep 23|| 9|
In the beginning, writes Jay Kreps, co-founder of Confluent and co-creator of Kafka*, there was the log.
A log is just a sequence of records ordered by time. It’s configured to allow more and more records to be appended at the end, like this:
Logs keep track of anything and everything. There are all kinds of logs in computing environments:
The ones that are important are server logs, which keep track of the computers that access the content and apps you have on the internet.
When your computer accesses a website, the server hosting that website gets - and keeps - a bunch of details from your computer, including which resources (web pages) the computer accessed, the time the computer accessed those resources, and the IP address of the computer that accessed them.
188.8.131.52 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent= HTTP/1.1" 401 12846
184.108.40.206 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523
220.127.116.11 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291
This may not look like much, but from these logs you can extrapolate the profile of the person who’s accessing the assets, how they browse through your website, tie them to a specific geographic location, and much, much more.
This is collection of user tracking logs is known as clickstream data. For consumer-facing internet companies like Facebook and Netflix, clickstream logs are their lifeblood. As early as 2010, Facebook was hoovering these up and using Flume (an open-source log streaming solution for Hadoop) to collect and stream them to various systems for analysis. Every company does stuff with logs: Uber,Airbnb, Netflix, and almost every single e-commerce firm.
Today, the company who collects the most logs wins because, ostensibly, studying these logs, like so many internet entrails, allows them to understand what their users do, when they do it, and tweak sites to allow their users to do more. Completing more purchases. Finishing more MOOC courses. Converting free users to paid users. Figuring out how to get more users to push the like button. And so on. These companies can then write nice blog posts about all the neat-o data engineering platforms they build to collect these logs, and the data science they’re able to do on top of those platforms to analyze them, resulting in 15% more lift on their platform (for a very broad definition of lift.)
Who really wins behind the scenes are companies that participate in processing log data. For example, Jay’s company, Confluent, powers Kafka, a stream processing solution that’s really taken off over the past five years or so. But there are hundreds more companies that specialize in every component of the clickstream processing tool chain. A whole industry has grown around the need to collect, store, and analyze log data. Just look at this year’s data landscape. Poor Matt Turck is going to get carpal tunnel next year, trying to make the inevitable 15 billion ore logos fit in here.
All of this (the log collection, the data science, the tool monetization) has been going amazingly for both the companies collecting the log data and the companies building tools to collect the log data, until very, very recently, when a couple things happened.
First, the Cambridge Analytica scandal somehow did the unthinkable and turned at least part of the mainstream sentiment against Facebook. This means that the media has been reporting an increasing number of negative articles about the tech giants lately, which in turn has led to negative rumblings among lawmakers. For example, even as little as two years ago, it would have been mind boggling to see anyone talking about breaking up the tech giants, much less as a positive part of a presidential platform.
Far and away, this Act is the strongest privacy legislation enacted in any state at the moment, giving more power to consumers in regards to their private data. With a variety of major tech giants based in California, including Google and Facebook (both of which have recently suffered data breaches), AB 375 is poised to have far-reaching effects on data privacy. AB 375 will go into full effect on January 1st, 2020.
When it does, companies operating in California will essentially have to be able to fully notify consumers what they’re collecting on them, and allow them to opt out by deleting all of their data. That means deleting hundreds upon thousands of logs, and figuring out how to re-platform the log collection system to be able to delete data.
The genius of CCPA is that, if a large company is operating in California, chances are it’s operating in every other state, too. And boy is it hard to segregate co-mingled log data on a state/jurisdiction level, which means either companies will relocate their headquarters, or have to comply with CCPA by complying with stricter regulations for all of their data.
Logs are a funny thing: on the one hand, they’re enormously useful. On the other, because they grow exponentially by design and there’s never any less of them, and they seem to be everywhere, like crumbs that you just can’t get rid of. They are a huge hassle to keep track of, store, clean, tie to other data, and, just as important, sample for data science purposes.
CCPA by its very nature puts pressure on this system of log storage and analysis.
What we’re going to see, in my opinion, as a result, is that collecting more logs is bad. The more you keep, the more you have to delete. The more you have to provide back to customers. The more liability there is for breaches of the kind that GDPR is exposing.
Maciej, one of my Favorite Internet People, predicted this three years ago in his keynote at Strata, an important data conference.
I reference this talk all the time since the first time I heard it, but it’s becoming more and more relevant. It’s called Haunted by Data, and he says:
The terminology around Big Data is surprisingly bucolic. Data flows through streams into the data lake, or else it's captured in logs. A data silo stands down by the old data warehouse, where granddaddy used to sling bits.
And high above it all floats the Cloud. Then this stuff presumably flows into the digital ocean.
I would like to challenge this picture, and ask you to imagine data not as a pristine resource, but as a waste product, a bunch of radioactive, toxic sludge that we don’t know how to handle.
And it’s true. Companies and engineers are still talking about complicated ways to collect and analyze logs, and Hacker News is full of discussion around distributed streaming collection and analysis systems. But, the mainstream media is starting to talk about something else: how log collection is impacting us socially, and how to break up tech companies that do this log collection.
Logs are not yet a form of liability, but will be soon. And it’s this, more than any complicated streaming architecture, that’s worth for companies to think very hard about.
The really, really big question here (from my perspective as a Paid Data $cientist) is - what does this mean for Data Science? The explosion of data science to date has so far been based on the art of divining logs to get at user behavior. Does it mean data science, and the tools ecosystem that supports it, is going away?
I don’t think so, but I think what data science does going forward in the next five to ten years will be fundamentally different from the first ten.
If the first ten years of data science were all about collecting and analyzing everything, the second ten will be about how to be deliberate and selective about collected and analyzed data.
There’s two threads I want to mention here as a starting point, and explore a lot further in future newsletters: the art of sampling, and the art of deleting and obscuring user data.
First, sampling. In an amazing, very underrated article all the way back from 2000, Jakob Nielsen talks about why you only need five users to perform tests. At first glance, this seems insane. How can you possibly extrapolate what all one billion users of Facebook, with their geographic, economic, and ethnic diversity, are going to do on the site? I don’t know that five is enough, but the guiding principle behind the article, that once you grow past a certain number of users, the data that you collect is just additional noise, holds true. The true challenge will be how to collect just enough data to be statistically valid, but not one log more.
Second, the ability to delete and obscure data will become even more important. I haven’t seen any discussion yet about how to correctly configure systems to delete data on an incremental basis. But this will become really important, because, if you never collect it, you can never give up the data. Snapchat had the right idea, and I’m (optimistically) expecting a lot more ephemeral data collection tools to arise.
What I am actually seeing a lot more of is discussion around things like differential privacy, or the practice of obscuring user data when using it for statistical analysis - is really taking off at Google (who has a lot vested in this). Differential privacy essentially adds white noise - fake data -into real data sets, up to the point where the real data is still statistically valid, but you can’t infer any one single real user back from it.
Be on the lookout for more of both as CCPA comes into play and companies start to deal with the problem of logs.
Art: A merchant making up the account, Katsushika Hokusai
What I’m reading lately:
This newsletter about Russia!
Just found out about this YouTube channel summarizing deep learning papers
Are graphing calculators ever going to go away?
This thread on hard SQL queries:
Please read this entire thread and wonder at Leavenworth.
About the Author and Newsletter
I’m a data scientist in Philadelphia. This newsletter is about tech and everything around tech. Most of my free time is spent wrangling a preschooler and a newborn, reading, and writing bad tweets. I also have longer opinions on things. Find out more here or follow me on Twitter.
If you like this newsletter, forward it to friends!