I wanna be in the black box where it happens

Prying open Google's algorithm


In the popular Hamilton musical, a singing Aaron Burr, left out of negotiations on the Compromise of 1790, laments in “The Room Where it Happens,”

No one else was in
The room where it happened
The room where it happened
The room where it happened
No one else was in
The room where it happened
The room where it happened
The room where it happened
No one really knows how the game is played
The art of the trade
How the sausage gets made
We just assume that it happens
But no one else is in
The room where it happens

This song is so good because it applies to so many different situations, but lately when I’ve been listening to it, I’ve been thinking about how B2C (consumer-facing) internet sites work.

For example, if Google decides to change its SEO ranking algorithm, on which thousands of sites (and millions of dollars), of the online economy depend, it doesn’t say how it does so, just that it’s done so:

If your business depends on the Google algorithm and it now starts suddenly doing worse, well, you’re out of luck. “Just keep building great content, bro!”

Google also routinely kills products that people rely on (I’m still bitter about Reader), also without any insight into how or why.

And Google isn’t the only one. Facebook’s numerous incantations of NewsFeed, Twitter’s site redesign (which I wrote about last week), Instagram’s algorithm reshuffle, changes to surge pricing at Uber, and much, much more happen to us Internet People on an extremely regular basis.

All of these pieces of software affect our daily lives online, and push us to make specific decisions regarding our purchases, our business practices and, sometimes, our lives (when we Google for “signs of heart attack”, for example).

And we are not privy to any of the decisions about how these pieces of software are created and changed.

How does any one of these algorithms get updated? I joked, when I saw news this week that the Australian government is recommending audits of Google and Facebook:

This is a simplistic reduction for the sake of Twitter likes (what isn’t, really?), but it’s not too far from the truth.

Usually what happens is that some C-suite executives declare some top-level metrics that need to be met, either to appease the company’s venture capital overlords, or the public markets, or the company’s board. Those metrics trickle down to middle managers, who dole out the tasks to product managers, who tell a team of developers to work on feature X or Y. Feature X gets parsed out into hundreds of Jira tickets, and, is eventually cobbled back together into production as hundreds of pieces of code.

Then, a shiny announcement from the company, saying they’re “thrilled” to roll out X or Y for a Better Customer Experience comes out. If you’re lucky. If you’re not, there will be a tweet somewhere, and good luck finding it. And that’s the end of it.

We, the public, consumers of these algorithms, the people whose lives are driven by them, are never in the room where it happens, where these decisions get made, when the Jira tickets get allocated, when designers come in, when the data is parsed and chopped up. We are not consulted.

Our data goes in, and internet products come out, worked on by teams of engineers and product managers who are so far removed from their users that they might just as well work out of Juneau as Mountain View.

This lack of transparency has been an issue for as long as private companies have been around. After all, corporations aren’t democracies for a reason.

But whereas before companies dictated what kind of car we had (“any color as long as it’s black”) today’s companies are now in charge of our search history, our email, our restaurant preferences, the way we communicate with friends and family, and even our heartbeats and intimate medical details.

And, we have a plethora of regulation for cars with respect to consumer safety:

But there is nothing out there for tech company products.

Finally, however, both the public media and regulators are waking up to the danger of these decisions being made in black boxes.

A couple of incidents lately make me really optimistic that we’re closer to not only algorithmic, but organizational transparency.

First, in Australia this week, regulators released a report on the big tech companies.

The Australian government directed the commission in late 2017 to hold the inquiry, with a mind to modernize the country’s media and privacy laws.

Among the 23 recommendations is a call for the government to set up an office in the commission to scrutinize the algorithms used by Google and Facebook to rank news and advertising. The report said the office would have the power to order Facebook, Google and other tech giants to hand information over to regulators.

Second, the American Justice department is gearing up for an antitrust investigation of Google.

Google may soon face an antitrust investigation from the US Department of Justice pertaining to its search business and potentially other aspects of the company’s sprawling software and services empire, according to a late Friday evening report from The Wall Street Journal.

Both of these things are, in my opinion, great, and come closer to putting the public back in the room where it happens, it being the algorithms and business decisions.

But, it’s a shame it had to come to this for Google. Because, in the beginning, Google was super open about all of this.

For example, if you want to look for the original PageRank, the paper is out there online. But, what is much, much cooler is that there this paper, which talks in-depth about the business implementation of Google, as well:

It covers Google’s design goals - “improved search quality” - and data collection:

Another important design goal was to build systems that reasonable numbers of people can actually use. Usage was important to us because we think some of the most interesting research will involve leveraging the vast amount of usage data that is available from modern web systems. For example, there are many tens of millions of searches performed every day. However, it is very difficult to get this data, mainly because it is considered commercially valuable.

It also covers the technical details, programming languages, and server size.

But then it gets to some really interesting details in the appendices. For example, take a look at Appendix A: Advertising and Mixed Motives:

Currently, the predominant business model for commercial search engines is advertising. The goals of the advertising business model do not always correspond to providing quality search to users. For example, in our prototype search engine one of the top results for cellular phone is "The Effect of Cellular Phone Use Upon Driver Attention", a study which explains in great detail the distractions and risk associated with conversing on a cell phone while driving. This search result came up first because of its high importance as judged by the PageRank algorithm, an approximation of citation importance on the web [Page, 98]. It is clear that a search engine which was taking money for showing cellular phone ads would have difficulty justifying the page that our system returned to its paying advertisers. For this type of reason and historical experience with other media [Bagdikian 83], we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers.

And centralized indexing:

As the capabilities of computers increase, it becomes possible to index a very large amount of text for a reasonable cost. Of course, other more bandwidth intensive media such as video is likely to become more pervasive. But, because the cost of production of text is low compared to media like video, text is likely to remain very pervasive. Also, it is likely that soon we will have speech recognition that does a reasonable job converting speech into text, expanding the amount of text available. All of this provides amazing possibilities for centralized indexing. Here is an illustrative example. We assume we want to index everything everyone in the US has written for a year.

This paper, written at the very beginning of Google, when there was no motive to be corporately obtuse, when Google was not making billions of dollars from advertising, is more illustrative than anything else Google could have put out, and will probably be more illustrative of the company’s market and ambitions than any research that comes out of the government probes.

I actually kind of can’t believe it’s still online because it’s so honest. This is the closest look we have at Google’s thinking and actions, not yet tied to monetary motivations, government lobbying, and insane ambition. Like I said before, when a company tells you who they are, believe them.

We’ll never get this kind of transparency from any other company today.

Unless someone outside of the black interferes and pushes their way in. And I’m hopeful that these starts from the Australian and American governments are it.

Art: Composition A, Piet Mondrian, 1923

What I’m reading lately:

  1. Why/how Zoom became more popular than WebEx

  2. Great newsletter about monopoly. This edition is about Google in Russia.

  3. This:

  4. Millennial parents are using less and less social media.

  5. This:


About the Author and Newsletter

I’m a data scientist in Philadelphia. This newsletter is about tech and everything around tech. Most of my free time is spent wrangling a preschooler and a newborn, reading, and writing bad tweets. I also have longer opinions on things. Find out more here or follow me on Twitter.

If you like this newsletter, forward it to friends!