The data lake overfloweth

Delete this tweet, actually all of them

Art:The Second Skin, Friedensreich Hundertwasser, 1986

A while ago, Tim created a Twitter bot called Your Old Tweets (sadly no longer in service due to something Twitter claims was a spam policy violation) that would search through your tweets on a daily basis and fetch everything you’d tweeted on that same day every year you were on Twitter. 

It was wonderful because it allowed me both to reminisce, and to delete all the stupid crap I’ve tweeted since 2009. 

My Old Tweets

Twitter in 2009 was a wild and small place, entirely full of strangers using their real names to tweet stuff like what the weather was like in Washington, D.C. Back then, it felt more like hanging out your front porch and chatting with the neighbors.

But today, due to context collapse and the size of the platform, as well as the inherent conflict and animosity built in,  it feels like every tweet is a minefield. Amplifying all of this are the social and economic dynamics of cancel culture (if you only read one thing about it, make it this one.)

It reminded me of this amazing 2003 paper/talk that I came across just recently, a group is its own worst enemy, which I think everyone doing anything online should read,  where Clay Shirky argues that we are still in the very early stages of understanding social software, and that in any social software setting, people will follow the rules of people in a group. 

And people in a group both act in their own self-interest, and in that of a forming group. Shirky lists several behaviors, based on analyzing previous real-world psychology studies, that people in a group will always engage in once the group forms, that thwart the forward momentum of the group: 

  • the group conceives of its purpose as the hosting of flirtatious or salacious talk or emotions passing between pairs of members

  • The group starts identifying and vilifying external enemies (“Nothing causes a group to galvanize like an external enemy.” )

  • And 3, the group starts engaging in religious veneration—the nomination and worship of a religious icon or a set of religious tenets. The religious pattern is, essentially, we have nominated something that’s beyond critique. (i.e. “Lord of the Rings cannot be criticized in a Lord of the Rings fan website”)

Once you know these patterns, it’s so obvious that Twitter, and really all social platforms these days, is meant to amplify all of them, which makes it impossible for cohesion, which means there is the possibility that every single thing you tweet will be controversial to someone.

Let’s face it. Social online is just plain hard, and I even though I very rarely agree with pg anymore, I  agree with him on this one: 

Having thought long and hard about this, and inspired by Chris’s tweets, I’ve been having this idea that I want to delete all my old tweets. 

Previously, I was doing this manually. Every time Your Old Tweets would show me the day’s retro, I would go through and delete anything that I thought could remotely be considered controversial, insensitive, or just plain dumb today. Every day for the better part of two years, I read and cleaned up my old tweets, agonizing over what to keep and what to delete. 

Because these tweets have been my diary. It’s sad, but those tweets are the main markers I have of being young and just-married and living in Washington, DC, what I cared about then, what I thought, and also of what Twitter was back in 2009, which was a nice, small chatroom for you and your closest online friends (doesn’t matter that most of my tweets had zero interaction.)

It’s also fun to see my programming skills progress throughout the years. 

The Past is a Foreign Country Made of JSON

But no matter how much I love coming back to these, more and more, I’ve been thinking that our past is, particularly in today’s climate, an enormous liability. There is a quote in Outlander, one of my absolute favorite books,  “The past is a foreign country. They do things differently there.” And there is no way I can compare the online atmosphere in 2020 to what it was like in 2009, which is radically different than what it was in 2000. 

The only thing that is the same is our streams of log data, quietly following us around and changing its context throughout time so that at times it might be benign, at times it might be extremely worthy of cancellation. 

So I downloaded my Twitter archive (which you can also do pretty easily!) and started looking through it to get an idea of how to get rid of it. What you get is a WHOLE bunch of stuff. 

There is a very friendly README file that tells you everything included in the download, and then you are on your own with .js files

I won’t bore you with the technical details, although I probably will do a separate technical post on this stuff, but basically, in order to access your archive, you need to understand: 

  • HTML

  • Javascript variables

  • JSON

  • How a README works

  • And a lot of terminology around tracking web events: 

And, further, to truly have fine-grained control over delete your tweets, you need to either use Chris’s script or build your own app. Either way you need to: 

  • Know a general-purpose programming language with access to Twitter’s API

  • Understand secret tokens

  • Understand API throttling

  • Have some way to test tweet deletion before you do so

  • Have access to GitHub

For me, all this stuff is as familiar as breathing air. Logs, JSON, parsing data, deleting embarrassing joke content, etc. 

The Data Lake is Full

As I was researching all of this, a couple thoughts came to me. First, only people who work in software or know someone who does can go about doing this (unless you use some website to delete your tweets, but the website might not give you the fine-grained control you want). 

And second, that, for most people swimming in the enormous cesspool that is the data lake of our digital lives today, realistically, it’s easier to work through security by obscurity than to have to curate all of this stuff. Security by obscurity is a concept from..you guessed it..computer security. The basic theory behind it is that if you hide your most important stuff out in the open in an unassuming way, it only becomes vulnerable when someone realizes that it’s important, but they have to dig through all the content to find that thing.

This is where we all are now, I think: figuring out which content to hide and which content to keep, which content is a liability and which content is an asset. And even if we’re not thinking about it,as most people are not, we experience it every time there is, say a security breach where there could be a leak in tweets or private messages, or data breaches in systems of companies like Marriott or Anthem or LinkedIn or Quest Diagnostics or Labcorp. Or…

And, I was just taking a look at my tweets. But what about my Facebook posts? All of my Google activity (which you can also download through takeout, and which Sundar now says we can delete, in theory )

All of my cell phone data that’s since been passed countless times to marketers?  All of my medical data, floating out there somewhere in the ether? All of my chats? My private messages across 10-11 different social platforms and 10+ years?  Just as I squash one thing, it will be impossible to take care of the rest of them, and you and I and all of our friends and family will keep generating more data until we flip the Big Bit from one to zero, so to speak. 

I could go on and on and on. But, it’s impossible to delete and control all of this stuff and logs are a liability, and they’re growing exponentially. 

The easiest way to mitigate this today is, if you can’t delete it, generate a lot of data, until either (hopefully) regulations start telling companies not to collect more and we become more empowered over our own data, or until the log-filled waters of the data lake overflow. 

What I’m reading lately:

  1. Reading recs:

  2. Even more reading recs

  3. Tech interviews are bad and cause anxiety

  4. Has anyone ever used clustering?

  5. Really great replies here:

  6. This is a really nifty video and you don’t even have to know much about ML:


The Newsletter:

This newsletter’s M.O. is takes on tech news that are rooted in humanism, nuance, context, rationality, and a little fun. It goes out once a week to free subscribers, and once more to paid subscribers. If you like it, forward it to friends and tell them to subscribe!

Swag: Stickers. Mug. Notepad.

The Author:
I’m a machine learning engineer. Most of my free time is spent wrangling a preschooler and a baby, reading, and writing bad tweets. Find out more here or follow me on Twitter.