Thursday, December 3, 2015

Introducing DerpRNN

Just released my Deep RNN code on Github. Yes, you read that right: it's called DerpRNN. So why the name DerpRNN? Because DeepRNN was already taken, that's why... dammit!

If you want to actually get your hands dirty, the best place to start is this demo Jupyter notebook.

Anyway, the module base class is called DeepRNN, which consists of separate "read-in" layer, several stacked recurrent layers and a readout layer. Accepted forms of input data are basically binary vectors (or bit arrays if you will), e.g. one-hot or "many-hot" coded data. I've tested the model with character level text data and polyphonic midi music (see below for some samples), but if the data can be expressed as a vector of zeros and ones, you're good to go! The whole thing was coded in Python and especially with the excellent and amazing Theano.

The majors difference to libraries such as Keras is that I'm using Theano's scan function to do the recurrent loops. For example Keras unrolls the RNN in time, and therefore requires the sequence length to be defined at compilation time. In my implementation also the evolution through the stacked recurrent layers is performed with scan, so that in the end I have a nested scan looping through all the layers and time... it was a bit of a pain to code (especially while taking care of the random state updates with the RBM), but now it finally works! That being said, I'm definitely not trying to implement something similar and as big as Keras or Lasage etc... this is just my small pet project. Also the primary motivation for coding this from scratch instead of using Keras, Lasagne, Blocks etc. was that I'm actually planning to try out all kinds of research ideas with RNNs (one such being the InvariantDeepRNN, which is a translation invariant version of the DeepRNN... more on that in a later post), and it would just be a huge hassle to start modifying the code of those libraries.

There are three types of layers in the model: the "read-in", recurrent and readout layers. The role of the tanh read-in layer is basically to map the data from n_input dimension to n_hidden_recurrent and to remove the bias. The recurrent layers are all then of dimension n_hidden_recurrent, and you can stack as many on top of each other as you like by specifying the number of layers with the depth parameter. Actually, I designed the model such that n_hidden_recurrent is an integer multiple of n_input, so instead of specifying the number of neurons in the recurrent layers directly, you specify a width parameter instead. (There's actually a good reason for this choice, which relates to a child class called InvariantDeepRNN, but I will discuss that in a later post.) Anyway, currently I've implemented the "standard recurrent unit" (SRU), which is just the basic tanh activated RNN, and the Gated Recurrent Unit (GRU).

You have some more choice in the readout layer, which can be either a sigmoid, softmax or a Restricted Boltzmann Machine (RBM). The RBM layer code is quite heavily based on the excellent deeplearning.net RNN-RBM tutorials and the paper by Boulanger-Lewandowski et al. So for one-hot data, such as character level text, you should use the softmax layer (well, you could use the RBM too, but it's going to be a lot slower and probably a serious overkill) and for polyphonic music or other binary vector data, the RBM... or the sigmoid, but that's a bit crappy for polyphonic music.

The model

A single SRU layer for the hidden unit/ neuron $h$ is defined as the equation

$$
h_t^{(l)} = \tanh \left(W^{(l)} \cdot h_{t-1}^{(l)} +  U^{(l)} \cdot h_{t}^{(l - 1)} \right) ,
$$
where the index $l = 1, \ldots, L$ labels the layer. Note that both $W$ and $U$ are square matrices for all $l$. The GRU layer is conceptually identical, so I won't write that down here. The "zeroth" layer unit $h^{(0)}$ is the "readin" layer, defined as

$$
h_t^{(0)} = \tanh \left( W_{in} \cdot x_t + b_{in}  \right).
$$

So the role of the readout layer is to pass the input $x_t$ to the recurrent layers and to remove the bias from data.

The softmax and sigmoid readout layers are pretty standard, and they depend only on the highest layer units $h^{(L)}$, so I won't write them down here. The RBM readout is a bit different though. It is a standard RBM model, except that the hidden and visible biases depend on the last hidden recurrent layer, so in this sense it is actually a conditional RBM. The RBM PDF with both hidden ($h'$) and visible ($x$) units is defined as

$$
P_t(x, h') = \frac{1}{Z} \exp \left( -E_t(x, h')  \right),
$$

with

$$
E_t(x, h') = b_t \cdot x + c_t \cdot h' + h'^T \cdot M \cdot x.
$$

$M$ is a standard RBM weight matrix, but the biases $b_t$ and $c_t$ depend on the last hidden recurrent layer as

$$
b_t = b + W_{hx} \cdot h_{t-1}^{(L)}, \\
c_t = c + W_{hh'} \cdot h_{t-1}^{(L)},
$$
where $b$ and $c$ are constant in time biases. Now given a hidden state $h_{t-1}^{(L)}$, which depends on $x_{t-1}, x_{t-2}, \ldots$, we can draw a sample $x_t$ from the marginal distribution $P_t(x) \doteq \sum_{h'} P_t(x, h')$.


So just a brief recap on the role of the RBM in the RNN: suppose your data consists of sequences of vectors like $x_t = [1, 0, 0, 1, 1, \ldots]$ with the "time" $t$ labeling the position in the sequence, then the RBM layer is a probability distribution over such vectors. So whereas a (rounded) sigmoid out would be able to produce just one such vector deterministically, the RBM draws a vector from a probabilty distribution of binary vectors... so the RBM-layered RNN will be able to sample from a selection of vectors.



Examples


So enough with the math, let's see some examples! Please see the notebook for more explicit examples and if you want to try it out yourself. Note that I saved some model parameters in the folder saved_model_parameters, so you don't even need to train a model :) Also, it's probably a good idea to use those parameters as initializations, if you want to do some training yourself, as the training gets stuck in a local minimum very easily for bad choices of initializations (still need to work on those a bit I think). Unfortunately the parameters naturally only work in models with the same depth and width parameters...

Text data


I first tried some charactel level text generation, very similarly as in the nice Keras example here. In fact, I used the Keras data loader and tried the Nietsche dataset. I trained a depth=5, width=5 model with input dimension = 49 (i.e. 49 characters; I only took lower case chars, numbers etc.) and a softmax readout. So that makes 245 neurons per layer (plus the reset and update gates). After about 12-24h of training (don't have more exact time; had to restart the training a few times) on a GTX 980 the model seemed to produce text quite well.

So what should we ask Nietzsche? About meaning of life? Been there, done that... let's instead hear him tell a few "Yo mama" jokes. However since neither of those words belong in Nietsche's vocabulary (I think), I used the seed "your mother is so fat, that". Here's a few examples (I cut the "joke" at the first period, since there is no start or end signals):

your mother is so fat, that consequently mistake know that the
    genuine sciences spoke of it!

your mother is so fat, that account
god is acquisities in history of german intercourse, indeed with
degrees of the spirit and social stronger.

your mother is so fat, that it was looked upon as a modern
spirit among philosophers valid which live me in favour to possess
the most surrender that is a question, perhaps even a religion of god.

your mother is so fat, that which he is
no existence but this is two far as the jews?

Uhh... ookay... so maybe "yo mama" jokes are not up Nietsche's alley (and what's that with the jews???). Let's try something else, now with the seed "how many philosophers does it take to change a light bulb?". Maybe this would be more familiar territory for the guy!

how many philosophers does it take to change a light bulb?
     taking my eye, and not disguise.

how many philosophers does it take to change a light bulb? our
veness and does not find thought.

how many philosophers does it take to change a light bulb? act
before the most general many-straying philosophical empire hose philosophers.

how many philosophers does it take to change a light bulb? life 
is not one of the compulsory revenge.


Huh?!? I'd just hate being in a cocktail party with Nietsche cracking jokes... all the other bigshot philosophers would be laughing their asses off and I'd be like "yeah, heh, that's funny... I think". But more seriously, it seems that the model seems to be working OK.

Polyphonic music data


OK so next up is some polyphonic music. I used the MIDI tools from Hexahedria's RNN project, which is, by the way, really really great and you should definitely check that out here. The midi data is from the Classical Piano Midi Page. I processed the MIDI data a bit differently however: I clipped the notes at the end of the note whenever a new note was pressed, so if for example there's two consecutive strikes of key C of duration 4, the sequence would be like this: CCCXCCCX, where X denotes no key pressed. Otherwise you'd just have CCCCCCCC, and it would clearly be impossible to separate that from a key held for 8 time steps. All this is then encoded into binary vectors, where 1 denotes a struck/held key and the position is the actual piano key.

I trained a depth = 5, width = 5 model (totaling 440 neurons per layer) GRU-RBM and left it running for about 24 hours (or something of that order). Here's some samples:





It's maybe not quite as good as Hexahedria's model (which was specifically hand crafted for MIDI music), but still pretty good. It is still clearly making some mistakes, which could be remedied by further training or training a deeper or wider model. While it would certainly be possible to get better results by just training a bigger model, I wouldn't want to go there, since the implementation is missing a key ingredient which Hexahedria keenly noted in his post: the model is not translation invariant, which in the musical context means that the model is unable to relate tunes played at different keys. Implementing translation invariance in RNNs was actually one of the reasons I started this pet project, and in fact I've already coded an InvariantDeepRNN class which is fully translation invariant. I however still need to iron out some wrinkles and train a proper model before I can post any interesting results, but I'll definitely write about that as soon as I've got something, so stay tuned! :)