Starting a collaboration

I’ve been itching to do some work outside of compiler work. During the past three years, I did not have a lot of time to work on creative coding. This year, I had a little bit more time on my hands and decided to reach out to a friend for a possible collaboration. He introduced me to Eleni. Eleni was working on a submission for the Athens Digital Art Festival. Her piece is titled TIRESIAS and makes reference to the blind prophet of Greek literature with the same name. In her piece, Tiresias is represented by a robot resembling a snake. (In fact, these class of robots resemble so much a snake that wikipedia lists them as snake-arm robots).

I was asked to extend her work by allowing Tiresias to provide predictions about the future. (Now, I want to make this absolutely clear, we are not actually predicting the future, we are just genering random text that sounds like future predictions.) So, here’s the process of what creating a computer program that generates text looked like:

First step: Defining the Task

Here is the part where I would like to sound authoritative and tell you that I’m an expert in machine learning. In reality I’m going to tell you that the little knowledge I had about machine learning is now mostly forgotten. So, in light of this, I feel unqualified using machine learning jargon and I will explain what I did in terms that I can fully understand. All I did was follow some publicly available tutorials on how to train (or fine-tune?) GPT-2 on a custom body of text.

What this concretely means is that I downloaded around 50,000 tweets (the custom body of text) and used Google’s collaboratory notebook to create a program that generates random text that somehow reflects the body of the custom body of text. The sentences generated by the program should not be identical to the tweets that are given to the program, but should mimick the style and vocabulary of the tweets.

Ask Twitter for permission to download thousand and thousand’s of tweets.

Even though Twitter is a very noisy source of data (i.e., people there mispell constantly, use emojis, and share URLs), it was decided that generating Tiresias’ predictions from Twitter was a good concept for the art piece. It focuses on what people are talking about and it may highlight pieces of modern culture. So, in order to download 50,000 tweets, one needs to apply for access to the Twitter API.

Twitter provides access to a lot of information, and even more if a twitter user decides to use your application. However, here we are not asking twitter users for permission to use their tweets, since we are using the publicly available tweets already. We use the Twitter API to ask for a lot of tweets without getting banned by Twitter and to use the search API. Normally, Twitter is concerned about robots that just download every single tweet because it costs Twitter money to send that information over the Internet. So, unless you signed up for the Twitter API, twitter might just stop sending information to your computer at any moment. The search API is also convenient because it allows us to focus on tweets which have certain topics on them. For example, we want Tiresias to be able to generate predictions about climate change, so we searched for tweets containing the key workds “climate change”. In order to get access to the Twitter API, you need a twitter account and write a brief description of your application and what you intend to do with the data you request. After a couple of minutes, Twitter accepted my application and I was granted access to download publicly available tweets.

Now, how you do you translate this access into something you can use? The twitter API only allows you to request information from the twitter website and receive an object in a machine readable format (JSON). (Sure, JSON is somewhat human readable here… but the point that I’m trying to make is that I want to avoid manually querying the API and create a program that can get the tweets that we want.) Here is what it would look like to use the Twitter API manually according to their website.

$ curl
 --request GET 
 --url 'https://api.twitter.com/1.1/search/tweets.json?q=from%3Atwitterdev&result_type=mixed&count=2' 
 --header 'authorization: OAuth oauth_consumer_key="consumer-key-for-app", 
 oauth_nonce="generated-nonce", oauth_signature="generated-signature", 
 oauth_signature_method="HMAC-SHA1", oauth_timestamp="generated-timestamp", 
 oauth_token="access-token-for-authed-user", oauth_version="1.0"'

This command basically says: Please GET me the contents of the website api.twitter.com/… and I want the tweets from the twitterdev account. I want recent and popular tweets and 2 tweets per page.

Instead of using the Twitter API directly, many people have developed programs that are easier for humans to use and can allow people to ask for this information in a friendlier manner. I chose to use the TwitterSearch library which allows one to use Python to talk to Twitter instead of just using the Internet. It is a little bit outdated, it was last updated on January 1, 2018. But it is easy to install and works well for my use case.

Here is what talking to the Twitter API basically looks like:

from TwitterSearch import *
try:
    tso = TwitterSearchOrder() # create a TwitterSearchOrder object
    tso.set_keywords(['Guttenberg', 'Doktorarbeit']) # let's define all words we would like to have a look for
    tso.set_language('de') # we want to see German tweets only
    tso.set_include_entities(False) # and don't give us all those entity information

     # this is where the fun actually starts :)
    for tweet in ts.search_tweets_iterable(tso):
        print( '@%s tweeted: %s' % ( tweet['user']['screen_name'], tweet['text'] ) )

In my opinion, this is so much easier to use. I have ommitted some stuff to allow people to feel more comfortable reading the code. Here we basically are looking for German tweets which include words like Guttenberg and Doktorarbeit.

This is pretty much what we need with the exception that we needed to change the keywords and the language. There were some other small details that needed to be changed. I think this library was developed before Twitter allowed for 280 characters on tweets, and therefore some tweets appeared truncated. How can we look beyond the original 140 characters that the API provides? However, other people have already asked this question and provided a simple solution. From here, it was trivial to generate a simple program that allows you to type what you want to search for and use this other program to search for tweets on twitter.

Cleaning the tweets

Great, so now we have downloaded 50k tweets (after several days due to how twitter restricts the usage and my time availability). So now we need to clean the tweets. There are several guides online on how to clean text data, but due to time constraints I only did the following:

  1. Remove hashtags
  2. Remove excessive whitespaces
  3. Remove emojis.
  4. Remove URLs.
  5. Remove mentions.
  6. Lowercase everything
  7. Remove commas (for other implementation reasons).

This is not the ideal way to clean textual data. Looking back, I think it would be best to not just remove this elements from tweets but remove the whole tweet if it contains one of this elements. Why? Because people write differently than how they speak if they have accesses to this type of non-verbal language (i.e. access to emojis, hashtags, URLs…) This was really simple to do and not a lot of work, but if given more time I think I could have spent more time making sure that the data was as clean as possible and generate better predictions that the ones we already had.

Choosing a publicly available Machine Learning model

We really live in the future. Academics are creating neural networks and making them publicly available for free. Furthermore, we have access to a lot of computer power at home and as long as you now where to look… you might also have access to computing resources a few years back you could only dream of. But back to choosing a machine learning model.

I needed something that was easy for me to use since I have no interest in developing a neural network from scratch. I saw two promising leads: textgenrnn (which is based on char-rnn and GPT-2. Now, again, I don’t want to misrepresent me nor anyone. What I’m trying to say here is that there are people (mostly academics) that develop these machine learning models that create great results, then there’s other people that make these academic work usable by ordinary people and then there’s people that just use this work. Here I chose the intellectual work of the OpenAI group GPT-2 mostly because GPT-3 has just been released and I wanted to play with something similar. And I chose to use the user friendly package simple-gpt-2 that allows me to interact with the intellectual beast that is GPT-2.

Using free compute power from Google

Training a machine learning model on a laptop can be a slow process. For a point of comparison I trained textgenrnn on my machine one night with 1,000 tweets and it took 4 hours to finish training. I then trained on the free compute time provided by Google and it took about 5 minutes. Also, this is textgenrnn which I believe is less computationally expensive than GPT-2. So scaling this to 50,000 tweets in my computer would be a challenge I am not ready to take.

Fortunately, Google is providing free compute power to hobbysts. It allows people without GPUs to use them using the Colaboratory platform. Thanks Google! Also, thanks to minimaxir for not only providing the source to simple-gpt-2 but also providing a python notebook easy to just use the colaboratory platform. So… all I did was copy the freely available work by minimaxir and instead of selecting the input to gpt-2-simple to be some Shakespeare sonnets, I chose the tweets that we had collected. I let the computer run for maybe half an hour. And we got our generated tweets!

What I’m working on right now

This work was deemed sufficient. However, I am now asked to create a server that provides a human friendly interface to allow guests of the Athens Digital Arts Festival to ask for their own predictions. So, we’ll see how that goes…