Art: Forbidden Literature (The Use of the Word) Rene Magritte, 1936
When my oldest turned two, we started letting her indulge in short stretches of screen-time, and in particular, animated features. One of the first movies we watched together is Moana, which is probably my favorite Disney movie.
At one point early on in the movie, toddler Moana is lured by the ocean to the water’s edge. The ocean, as we find out, needs Moana’s help. At first, the ocean sparkles, and then turns into an anthropomorphic blob. It doesn’t have a head or a body, but it somehow motions with its wave presence for Moana to come closer.
My daughter turned to me, laughing. “The wave wants to be Moana’s friend,” she said.
Up to this point, my husband and I had not mentioned anything about Moana or the plot of the movie. We had not explained to my daughter anything about the ocean, about characters in the movie, or even about how friendship worked.
But somehow, she had managed to understand, through context and through her tiny life experience, that the wave was friendly, and that it wanted to talk to Moana. It was like pure magic. I started crying, because it’s these tiny milliseconds of parenting that make you realize what a miracle the human mind is. You also think about how much work you personally put in to get the child to that point. How many diapers you’ve changed, how many times you’ve sung the same song, played the same games, made the same meal. And all of a sudden, after years of your baby mutely blinking at you, you have a tiny person who can deduce things.
I think about that moment a lot, and I’ve been thinking about it even more during all these recent discussions around GPT-3. Depending on whether you’re an AI optimist or a skeptic, GPT-3 is either the best thing since sliced bread, or the most dystopian view of reality that’s been created since Twitter.com. (Here is a complete breakdown of a ton of different opinions around it over the past couple weeks.)
What I’m interested in, as usual, are not the current results per se, but the system around the model, and what comes next for it.
Training the Toddler
I’m not an AI expert, but the way I think about NLP and AI in general, is that we are trying to get a machine without any information at all to be as smart as a toddler. (And if you’ve ever met a toddler, you know we’re not going for a really high bar here, looking at you, baby who loves to wash his hands in toilet water.)
When we, as parents, try to get toddlers to become humans, we do it by telling them things, again, and again and again, over minutes and hours and months and years. We expose them to behaviors we want them to see, to happy memories, to discipline, we are constantly talking to them. We are giving toddlers data points.
NLP models work in much the same way. We give the baby machine a group of words, preferably the more the better, and the machine builds a model out of what kinds of words and sentences it thinks make the most kinds of sense following the first chunk. It then takes the first chunk it predicted, adds it into the prediction,and generates the next chunk until you tell it to stop. It tries to learn, clumsily, based on some rules that you give it.
Here’s a very beautiful overview of how it works for GPT-3.
What a lot of people are focused on, understandably so, is the architecture of the model, the layers of the neural network, the parameters. But there has been very little discussion of the data that we’re feeding into GPT-3 to get it to learn, and I want to talk a bit about that.
Smelly cat, smelly cat, what are they feeding you?
What is GPT-3 reading to figure out how to be human? It’s reading the entire internet. Or rather, a processed subportion of it. But the thing that makes it significant is that the model has incorporated so much data and is so massive.
In the GPT-3 paper, the authors note that the GPT-3 is made up of several data sets:
So, we have 5 datasets, and they’re each taken at proportions within the training mix. However, it’s important to note, that, as the paper says,
during training, datasets are not sampled in proportion to their size, but rather datasets we view as higher-quality are sampled more frequently,such that CommonCrawl and Books2 datasets are sampled less than once during training, but the other datasets are sampled 2-3 times.
So basically what happens is that OpenAI collects a ton of data floating around in the fathomless depths of the public web, puts it into a huge mixing vat in very specific amounts, and out of that data, creates a soup to feed the baby model that teaches it how to give answers based on that input data.
What is the input data?
CommonCrawl
Let’s start with the big one. CommonCrawl is an enormous open-source scrape of the internet open to anyone looking to do textual research. CommonCrawl scrapes the HTML that makes up the websites entire internet, or a large portion of it anyway, and stores those files in indices, like in a card catalog.
Any website you can possibly think of is included. There are the big ones, like Facebook, Reddit, Twitter, etc, and there are also your cooking blogs, your financial communities, practically anything on the English-language web that’s significant is in here.
I’ll let them best describe it themselves,
The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.
There are over 25 billion webpages in there. The resulting data is kept in Amazon’s S3 store, in WARC format, going back to 2013, and you can search through it using an index, as laid out in this tutorial.
If you look at an actual datafile, it looks something like this, and then contains the text of the webpage.
For hundreds of thousands of millions of websites.
If you want, you can even use an API to search for specific sites within the WARC files. Looks like Normcore is already being indexed!
{"urlkey": "com,substack,vicki)/", "timestamp": "20200702094608", "status": "200", "url": "https://vicki.substack.com/", "mime": "text/html", "digest": "UTWPF42ZUK6EB54B5UQETAECVIIHX3JQ", "charset": "UTF-8", "offset": "727779136", "filename": "crawl-data/CC-MAIN-2020-29/segments/1593655878639.9/warc/CC-MAIN-20200702080623-20200702110623-00244.warc.gz", "length": "12940", "mime-detected": "text/html", "languages": "eng"}
And, if you go to that specific filename in S3, download the file, you should be able to get back all the non-subscription text for Normcore - if your computer has storage and processing space for 62 terabytes, which is how big the archive is.
Why am I going into such detail on all of this? Because the size and sheer amount of data processing power you need to have, as well as understanding all the nuances of how to parse specific URLs, and keep or drop specific content, is enormous. It’s important to understand that, just as building the model itself is a feat, so is actually creating, curating, and deciding on the data to feed the model.
(BTW, if you’re interested in technically exploring the dataset, this is probably the best intro I’ve seen)
What’s the deal behind CommonCrawl, though, and why is AWS agreeing to host it?
Common Crawl was started by Gil Elbaz, who was one of the founders of the company, Applied Semantics, that developed AdSense. Applied Semantics was purchased by Google in 2003 for $102 million, a purchase that really pushed forward its adtech wing. By the way, if you read the linked article, you’ll notice that Susan Wojcicki, then director of product management at Google, is quoted about the purchase. Where is Susan now? Running YouTube.
After working in Google for a while, Elbaz continued working on AdSense and continued building Google’s ad revenue until he left to found his own startup in 2008. The company, Factual, which started as a data aggregation company and now does data enrichment for marketing companies, was funded in part by, you guessed it, A16Z. It’s now merged with Foursquare and offers location data, such as observing how your store visitation patterns have changed after COVID.
By the way, GPT-3 is not using the dataset blindly. They specifically filter it down because of the noise, the detritus of the public web:
In order to improve the quality of Common Crawl, we developed an automatic filtering method to remove low quality documents. Using the original WebText as a proxy for high-quality documents, we trained a classifier to distinguish these from raw Common Crawl. We then used this classifier to re-sample Common Crawl by prioritizing documents which were predicted by the classifier to be higher quality.
That’s great, but what is WebText?
WebText and Friends
WebText is the second dataset used by GPT-3 and a proprietary dataset created by OpenAI when they created GPT-2.
We created a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational,or just funny.The resulting dataset, WebText, contains the text subset of these 45 million links.
So basically, the researchers scraped just Reddit for high-quality links, which they decided were links to sites outside of Reddit that had more than 3 upvotes by Redditors, and then scraped the text of those links. So, a manual kind of subset of CommonCrawl. (If you’re interested in exploring an open version of the dataset, you can do so, here.)
They then used two book datasets,the curiously-named Book1 and Book2. It’s not clear from the paper what these two datasets are, and I couldn’t actually find them online, even after strenuous searching, so NLP people if you’re reading this and you know what these two are, shoot me an email.
I’m going off what could be a completely incorrect assumption that these are either Google’s n-grams datasets, but they could be just two big superset corpuses of Gutenberg or BookCorpus, for example.
And finally, they used a corpus from English-language Wikipedia.
So,what are we feeding our toddler from this big pot of web soup? It seems like, basically the entire English-language, American-centric web, as much of it as is open. All of Twitter, as much of Facebook as is able to be scraped without hitting API limitations, Yelp, HackerNews, Reddit, blogs, forums, anything and everything in the public-facing, scrapable internet. (And probably a corpus of what, again, I’m assuming, are the most popular maybe 10000 or so books in the English language over the past 100 years.)
Toddler Course-Correction
There is some (correct) concern that this leads to our toddler model being biased. Actually, it’s addressed right in the paper itself, which is probably the first time I’ve seen any kind of this type of discussion in the academic document itself, which, to me, seems like an important model for future papers to follow.
What it found was, that, of course, like any model trained on the internet, this one has some serious inherent biases. The mainstream media has, of course, picked up on this and written and tweeted pieces about dangerous inherent biases that the model holds.
The reaction seems to be that we need to shut these types of models down, but my thinking is that we need to put more work into them, just as we do into our own toddlers, to teach them not to touch hot stoves, not to call people names, to share, and to do all the things we do to teach toddlers to be fully present in the contract we hold with humanity.
None of this work is easy, though, because just as the model is biased, so are we ourselves, in understanding the scope of what to change. Let’s say that, within the GPT-3 training data, we scrape subreddit that entirely includes a discussion about how people love Teslas (this probably won’t be too hard to find and actually I’m sure already exists in multiple variants.) So whenever we write something about Teslas, it autocompletes to, “Teslas are….great.” This is biased, because it’s not always true, right?
So what do we do here? Do we remove the Tesla forum from our initial training dataset? By doing so, will we be adding even more bias, in a different direction? Do we go out and try to scrape more datasets from even more car forums that talk about how great Fords and BMWs are (BMWs, as everyone knows, are truly the only great vehicle.) What bias do we introduce with these new datasets? And what if we need to do this for every car, make, and model? And, if we come to a conclusion with cars, what do we now do with boats? How do those all interplay in the model soup?
This is not easy work, on the contrary, it is hard, just as hard as raising a decent human being, and work that needs to be done.
The question becomes, though, how much more work will be put into making this model more representative of the world we want to exist, not the one that does?
That all, of course, depends on the money, as it always does
Must be the money
At the same time that we as parents are teaching the toddler, we also expect a return on our investment. For me, personally, it’s the idea that, one day, I will no longer have to change diapers and have raised two fully-functional humans who (hopefully) love me, go out into the world to do big things, and come back to watch Moana and the BMW video with me every few weeks.
For the researchers training GPT-3, they have to have a bigger return.
GPT-3 cost around $5 million dollars to train, speaking from purely computing cost perspective (not including developer/researcher time, etc.). Other models are not as pricey, but prohibitively enough so that companies producing these deep learning products need to make money, stat.
One of the things I talk about a lot at Normcore is that it doesn’t matter what a person or a company says. It’s more important to follow the money trail. And the expectation here is that there was a huge investment in GPT-3 so that OpenAI would get something back out of it.
When I wrote about OpenAI last time, they were still kind of flailing around for a revenue model in light of being owned by Microsoft. But now that they have GPT-3 and the forthcoming paid API, they’re going to be in big business. Some of its first potential customers? Surprise, it’s Reddit, thus completing the circle from scraping, to model, to using the scraped Reddit data back in to power Reddit’s product:
Access to the GPT-3 API is invitation-only, and pricing is undecided. It’s also not clear, even to OpenAI itself, exactly how the system might be used. The API could be used to improve the fluency of chatbots, to create new gaming experiences, and much more besides.
So far, OpenAI says the GPT-3 API has around a dozen users. These include the search provider Algolia, which is using the API to better understand natural language search queries; mental health platform Koko, which is using it to analyze when users are in “crisis”; and Replika, which builds “AI companions.” Social media platform Reddit is also exploring how the GPT-3 API might be used to help it automate content moderation.
Where does evaluating bias and tuning the model come into this? I have an optimistic view that customers will expect more introspection into the model as a result of the rise of the nascent field of algorithmic fairness. As they say themselves on their site,
Ultimately, what we care about most is ensuring artificial general intelligence benefits everyone. We see developing commercial products as one of the ways to make sure we have enough funding to succeed.
We also believe that safely deploying powerful AI systems in the world will be hard to get right. In releasing the API, we are working closely with our partners to see what challenges arise when AI systems are used in the real world. This will help guide our efforts to understand how deploying future AI systems will go, and what we need to do to make sure they are safe and beneficial for everyone.
What does this mean in practice? The same thing that it always does in the corporate world. Probably that, if you’re spending a million dollars a month running this thing in production, you get a bigger say in what “ethical” AI means than someone who’s hitting the model’s API 10 times a month for $100 a month.
Which means that really, if we want to get to the bottom of these data sources, to have reproducibility, accountability, to better understand the models, if we want a bigger hand in raising the toddler and teaching it how to understand that waves are friendly, we probably have to be a company the size of Microsoft or Google or Amazon, or whichever companies end up buying and becoming power users of this platform.
What I’m reading lately
Two bits of covid escapism: Drive and Listen and Window Swap
I just finished Uncanny Valley and I’ll probably do a post on it soon
Spotify and the constant music release cycle
I really liked the answers here:
The Newsletter:
This newsletter’s M.O. is takes on tech news that are rooted in humanism, nuance, context, rationality, and a little fun. It goes out once a week to free subscribers, and once more to paid subscribers. If you like it, forward it to friends and tell them to subscribe!
The Author:
I’m a machine learning engineer. Most of my free time is spent wrangling a preschooler and a baby, reading, and writing bad tweets. Find out more here or follow me on Twitter.
Very Interesting Post. Long time 'free' subscriber, this post convinced me to convert to a paid subscriber.
Many thanks for sharing your insights.