Thursday, December 3, 2015

Introducing DerpRNN

Just released my Deep RNN code on Github. Yes, you read that right: it's called DerpRNN. So why the name DerpRNN? Because DeepRNN was already taken, that's why... dammit!

If you want to actually get your hands dirty, the best place to start is this demo Jupyter notebook.

Anyway, the module base class is called DeepRNN, which consists of separate "read-in" layer, several stacked recurrent layers and a readout layer. Accepted forms of input data are basically binary vectors (or bit arrays if you will), e.g. one-hot or "many-hot" coded data. I've tested the model with character level text data and polyphonic midi music (see below for some samples), but if the data can be expressed as a vector of zeros and ones, you're good to go! The whole thing was coded in Python and especially with the excellent and amazing Theano.

The majors difference to libraries such as Keras is that I'm using Theano's scan function to do the recurrent loops. For example Keras unrolls the RNN in time, and therefore requires the sequence length to be defined at compilation time. In my implementation also the evolution through the stacked recurrent layers is performed with scan, so that in the end I have a nested scan looping through all the layers and time... it was a bit of a pain to code (especially while taking care of the random state updates with the RBM), but now it finally works! That being said, I'm definitely not trying to implement something similar and as big as Keras or Lasage etc... this is just my small pet project. Also the primary motivation for coding this from scratch instead of using Keras, Lasagne, Blocks etc. was that I'm actually planning to try out all kinds of research ideas with RNNs (one such being the InvariantDeepRNN, which is a translation invariant version of the DeepRNN... more on that in a later post), and it would just be a huge hassle to start modifying the code of those libraries.

There are three types of layers in the model: the "read-in", recurrent and readout layers. The role of the tanh read-in layer is basically to map the data from n_input dimension to n_hidden_recurrent and to remove the bias. The recurrent layers are all then of dimension n_hidden_recurrent, and you can stack as many on top of each other as you like by specifying the number of layers with the depth parameter. Actually, I designed the model such that n_hidden_recurrent is an integer multiple of n_input, so instead of specifying the number of neurons in the recurrent layers directly, you specify a width parameter instead. (There's actually a good reason for this choice, which relates to a child class called InvariantDeepRNN, but I will discuss that in a later post.) Anyway, currently I've implemented the "standard recurrent unit" (SRU), which is just the basic tanh activated RNN, and the Gated Recurrent Unit (GRU).

You have some more choice in the readout layer, which can be either a sigmoid, softmax or a Restricted Boltzmann Machine (RBM). The RBM layer code is quite heavily based on the excellent RNN-RBM tutorials and the paper by Boulanger-Lewandowski et al. So for one-hot data, such as character level text, you should use the softmax layer (well, you could use the RBM too, but it's going to be a lot slower and probably a serious overkill) and for polyphonic music or other binary vector data, the RBM... or the sigmoid, but that's a bit crappy for polyphonic music.

The model

A single SRU layer for the hidden unit/ neuron $h$ is defined as the equation

h_t^{(l)} = \tanh \left(W^{(l)} \cdot h_{t-1}^{(l)} +  U^{(l)} \cdot h_{t}^{(l - 1)} \right) ,
where the index $l = 1, \ldots, L$ labels the layer. Note that both $W$ and $U$ are square matrices for all $l$. The GRU layer is conceptually identical, so I won't write that down here. The "zeroth" layer unit $h^{(0)}$ is the "readin" layer, defined as

h_t^{(0)} = \tanh \left( W_{in} \cdot x_t + b_{in}  \right).

So the role of the readout layer is to pass the input $x_t$ to the recurrent layers and to remove the bias from data.

The softmax and sigmoid readout layers are pretty standard, and they depend only on the highest layer units $h^{(L)}$, so I won't write them down here. The RBM readout is a bit different though. It is a standard RBM model, except that the hidden and visible biases depend on the last hidden recurrent layer, so in this sense it is actually a conditional RBM. The RBM PDF with both hidden ($h'$) and visible ($x$) units is defined as

P_t(x, h') = \frac{1}{Z} \exp \left( -E_t(x, h')  \right),


E_t(x, h') = b_t \cdot x + c_t \cdot h' + h'^T \cdot M \cdot x.

$M$ is a standard RBM weight matrix, but the biases $b_t$ and $c_t$ depend on the last hidden recurrent layer as

b_t = b + W_{hx} \cdot h_{t-1}^{(L)}, \\
c_t = c + W_{hh'} \cdot h_{t-1}^{(L)},
where $b$ and $c$ are constant in time biases. Now given a hidden state $h_{t-1}^{(L)}$, which depends on $x_{t-1}, x_{t-2}, \ldots$, we can draw a sample $x_t$ from the marginal distribution $P_t(x) \doteq \sum_{h'} P_t(x, h')$.

So just a brief recap on the role of the RBM in the RNN: suppose your data consists of sequences of vectors like $x_t = [1, 0, 0, 1, 1, \ldots]$ with the "time" $t$ labeling the position in the sequence, then the RBM layer is a probability distribution over such vectors. So whereas a (rounded) sigmoid out would be able to produce just one such vector deterministically, the RBM draws a vector from a probabilty distribution of binary vectors... so the RBM-layered RNN will be able to sample from a selection of vectors.


So enough with the math, let's see some examples! Please see the notebook for more explicit examples and if you want to try it out yourself. Note that I saved some model parameters in the folder saved_model_parameters, so you don't even need to train a model :) Also, it's probably a good idea to use those parameters as initializations, if you want to do some training yourself, as the training gets stuck in a local minimum very easily for bad choices of initializations (still need to work on those a bit I think). Unfortunately the parameters naturally only work in models with the same depth and width parameters...

Text data

I first tried some charactel level text generation, very similarly as in the nice Keras example here. In fact, I used the Keras data loader and tried the Nietsche dataset. I trained a depth=5, width=5 model with input dimension = 49 (i.e. 49 characters; I only took lower case chars, numbers etc.) and a softmax readout. So that makes 245 neurons per layer (plus the reset and update gates). After about 12-24h of training (don't have more exact time; had to restart the training a few times) on a GTX 980 the model seemed to produce text quite well.

So what should we ask Nietzsche? About meaning of life? Been there, done that... let's instead hear him tell a few "Yo mama" jokes. However since neither of those words belong in Nietsche's vocabulary (I think), I used the seed "your mother is so fat, that". Here's a few examples (I cut the "joke" at the first period, since there is no start or end signals):

your mother is so fat, that consequently mistake know that the
    genuine sciences spoke of it!

your mother is so fat, that account
god is acquisities in history of german intercourse, indeed with
degrees of the spirit and social stronger.

your mother is so fat, that it was looked upon as a modern
spirit among philosophers valid which live me in favour to possess
the most surrender that is a question, perhaps even a religion of god.

your mother is so fat, that which he is
no existence but this is two far as the jews?

Uhh... ookay... so maybe "yo mama" jokes are not up Nietsche's alley (and what's that with the jews???). Let's try something else, now with the seed "how many philosophers does it take to change a light bulb?". Maybe this would be more familiar territory for the guy!

how many philosophers does it take to change a light bulb?
     taking my eye, and not disguise.

how many philosophers does it take to change a light bulb? our
veness and does not find thought.

how many philosophers does it take to change a light bulb? act
before the most general many-straying philosophical empire hose philosophers.

how many philosophers does it take to change a light bulb? life 
is not one of the compulsory revenge.

Huh?!? I'd just hate being in a cocktail party with Nietsche cracking jokes... all the other bigshot philosophers would be laughing their asses off and I'd be like "yeah, heh, that's funny... I think". But more seriously, it seems that the model seems to be working OK.

Polyphonic music data

OK so next up is some polyphonic music. I used the MIDI tools from Hexahedria's RNN project, which is, by the way, really really great and you should definitely check that out here. The midi data is from the Classical Piano Midi Page. I processed the MIDI data a bit differently however: I clipped the notes at the end of the note whenever a new note was pressed, so if for example there's two consecutive strikes of key C of duration 4, the sequence would be like this: CCCXCCCX, where X denotes no key pressed. Otherwise you'd just have CCCCCCCC, and it would clearly be impossible to separate that from a key held for 8 time steps. All this is then encoded into binary vectors, where 1 denotes a struck/held key and the position is the actual piano key.

I trained a depth = 5, width = 5 model (totaling 440 neurons per layer) GRU-RBM and left it running for about 24 hours (or something of that order). Here's some samples:

It's maybe not quite as good as Hexahedria's model (which was specifically hand crafted for MIDI music), but still pretty good. It is still clearly making some mistakes, which could be remedied by further training or training a deeper or wider model. While it would certainly be possible to get better results by just training a bigger model, I wouldn't want to go there, since the implementation is missing a key ingredient which Hexahedria keenly noted in his post: the model is not translation invariant, which in the musical context means that the model is unable to relate tunes played at different keys. Implementing translation invariance in RNNs was actually one of the reasons I started this pet project, and in fact I've already coded an InvariantDeepRNN class which is fully translation invariant. I however still need to iron out some wrinkles and train a proper model before I can post any interesting results, but I'll definitely write about that as soon as I've got something, so stay tuned! :)

Monday, March 2, 2015

Playing with AlgoCanvas

Some time ago I discovered an interesting data analysis (and trading) platform called AlgoCanvas. AlgoCanvas is a product of a Finnish startup called Streamr, which actually did quite well in the recent Slush startup competiton (they got to the top 20).

Basically the idea is to allow the user to come up with trading algorithms easily by simply connecting different modules with "cables" and eventually routing the output to (for example) a chart/plot. I've found however that it's also a very useful data munging tool for data pre-processing and analysis. One picture is worth n words, so here's a screenshot (click for full size image):

Here I took the bid and ask prices for AAPL during one day and sent it to the Chart on the right. Then I took the cumulative dollar value trades (price * quantity * sign... basically the order flow), discarded the hidden orders and sent it to the Chart on a different axis. Pretty simple!

Getting the order flow took a bit too many modules (Multiply, Not and Sum modules), but it was pretty quick to do. It might make sense to have a custom "Function" module which would have user defined inputs and outputs and a simple mathematical expression between... anyway, the product is still in beta and developing fast, so maybe that will be implemented eventually. Mind you, there is a custom "JavaModule", but since I don't really know Java, it's pretty useless for me... anyway, I've heard a Python module is coming too!

Also, if you have a trading algorithm in mind, you can run a full backtest... and this is with event time data, bid and ask prices, trades etc... basically full order book data with all orders sent to the order book (not just trades as in my example above). Pretty impressive when compared to e.g. Quantopian's 1 min bars...

Wednesday, February 4, 2015

Using Ipython Notebook and Theano on an AWS GPU machine

Using a service such as Sage Math Cloud is an excellent way to use Ipython Notebooks (or Sage Worksheets) online, but I wanted to give Theano's GPU features a serious try, so I decided to spin up a VM with a powerful GPU (using Amazon AWS's g2.2xlarge instance). This post is a sort of a "note-to-self" type step-by-step guide on how to get it done, since it's not completely trivial... I'm not going to repeat what others have written about this, so I just added the relevant links.

1) Check out this post to set up Theano on AWS. Make sure to run the test too... The test passed for me, but turns out I got further problems with BLAS (see below)

2) Check out this post to set up an Ipython Notebook server on AWS. Make sure to allow inbound traffic: go to Security Groups on AWS Dashboard, click Actions and Edit Inbound Rules. Here add a custom TCP rule with a port number you wish to use

3) I still got an error "WARNING (theano.tensor.blas): We did not found [sic] a dynamic library into the library_dir of the library we use for blas. If you use ATLAS, make sure to compile it with dynamics library." Maybe it's just something I did, but apparently Theano cannot find BLAS, although it did install with Anaconda... Anyway, I just edited the .theanorc file and added a flag for blas like this (after installing openblas; or maybe it was installed by default??):

ldflags = -lopenblas

That sorted it out... and now it's running like crazy, with only about $0.09 per hour!! ;)

Friday, October 10, 2014

Cython is awesome

After some initial frustration with learning Cython, I found the Scipy 2013 Cython tutorial by Kurt Smith of Enthought, which was a hugely positive surprise. IMO it's by far the best intorduction to Cython (the official Cython tutorials can be a bit tedious...).

So I decided to put my newly learned skills to use in optimizing some stochastic differential equation simulations. To make a long story short, here's the original Python+Numpy version:

...and here's the Cython optimized version:
Previously I had to wait about 30s for 1000000 iterations, and now it takes only about .1 seconds... definitely worth the effort ;) (BTW sorry about the lack of documentation... anyway, the algo has an adaptive time step, since I'm simulating some pretty wildly behaving SDEs).

Wednesday, August 6, 2014

Holographic principle

So just a while ago I received an email from World Scientific about my article "Easy Holography", which was published in their journal IJMPD. According to the email, my paper " a great contribution in understanding the relevant field of study, which is worth to be shared." Oo well gee, thanks for the recognition World Scientific (I'm sure only select few receive such encouragement)! Oh well, I guess I'll just do as I'm told and share the article!

Anyway, the idea is roughly as follows (it's been a while since I worked on this stuff, so my apologies for some possible inaccuracies): let's pretend we live in a two dimensional universe and that this universe is a two dimensional quantum gravity theory, which is "asymptotically $AdS_2$", which is very similar to a hyperbolic plane $\mathbb H_2$ but of minkowskian signature (google it or stop reading). Or just think of $\mathbb H_2$. The "asymptotic symmetries" of $\mathbb H_2$ are such diffeomorphisms which preserve the metric near the boundary. For $\mathbb H_2$ these symmetries are isomorphic to circle diffeomorphisms $Diff(S^1)$, which is an infinite dimensional group. The only thing we need to assume about the quantum gravity theory is that it is integrable w.r.t to the asymptotic symmetries, which in this context just means that every physical state can be mapped to another physical state by the asymptotic symmetries extended to all of the space (a la Conformal Field Theory).

The toy model I used in the "Easy Holography" paper was basically just a pixelated image (of a bunny, since it was submitted on Easter). A rough estimate of the entropy of that "model" is then a logarithm of the number of all distinct images that are mapped from an original image by the asymptotic symmetry group, which is isomorphic to $Diff(S^1)$.

So first I have a bunny. Now "coarse grain" it on an $N \times N$ grid, so that a grid is black if most of the (Euclidean) surface area under it is black and white otherwise. Then apply a map from the asymptotic symmetries on the original image and coarse grain that. If the coarse grained image is different from the original coarse grained image, count that as a distinct state/configuration. Repeat this for all as. symmetries (there's a noncountably infinite number of them) and write down the number of all distinct images, or configurations, as a function of the number of  $N$.

If the maps were just the usual diffeomorphisms, it's easy to see (see the image above or the paper) that the number of distinct configurations is just $2^{N^2}$ and thus the entropy is $S = N^2 \log(2)$. So the entopy is just proportional to the volume of the system, $N^2$. So that's the old physics. But if instead we use the asymptotic symmetry maps, it turns out that the entropy thus computed is now proportional to $N$, or the "area" of the system! This happens also for $AdS_3$, whose as. symmetries are the conformal group, isomorphic (if I recall correctly) to $Diff(S^1) \times Diff(S^1)$. In the $AdS_3$ the entropy would be proportional to $N^3$ (volume) for ordinary diffeomorphisms and to $N^2$ (area) for the asymptotic symmetries.

So the conclusion would seem to be that the asymtptotic symmetries and the asymptotic $AdS$ structure of the universe is somehow related to the Holographic Principle, which states that the entropy is indeed proportional to the surface are of the system, instead of the volume... but of course the universe is not three dimensional! So what happens in higher dimensions? The asymptotic symmetry group of $AdS_4$ is disappointingly not infinite dimensional, but instead the six dimensional $SO(2,2)$. That would actually just lead to a logarithmic entropy, so no luck there...

But what actually is an $AdS$ space and why would they be interesting? Let's look at (one) definition of an $AdS$ space. We can define $AdS_d$ as the homogeneous space $SO(2, d-1) / SO(2, d-2)$. For $d=2$, we have $AdS_2 \simeq SO(2, 1) / SO(2)$ and for $d=3$, $AdS_3 = SO(2, 2) / SO(2,1)$. For low dimensional Lie groups there exist so called "accidental isomorphies", which we can use to write $SO(2,1) \simeq SL(2)$ and $SO(2,2) \simeq SL(2) \times SL(2)$, where $SL(2)$ is the special linear group. So we can write the $AdS$ spaces in an equivalent way as $AdS_2 \simeq SL(2)/SO(2)$ and $AdS_3 \simeq SL(2)$ (Lie groups are manifolds too). For $d>2$ such accidental isomorphisms do not exist. More specifically, the higher dimensional $AdS$ spaces are not related at all to the special linear groups $SL(n)$.

A short recap: $AdS_2$ and $AdS_3$ are intimately related to the Lie group $SL(2)$, but higher dimensional $AdS$ spaces are not. $AdS_2$ and $AdS_3$ admit infinite dimensional as. symmetries, but higher dimensional $AdS$ spaces do not. Clearly the Lie group $SL(2)$ is the key here! Some natural questions are then

1) Do the higher dimensional Lie groups $SL(n)$ (or related homogeneous spaces) also admit infinite dimensional asymptotic symmetries?
2) If they do, will they also incorporate a "natural holography" as with $AdS_2$ and $AdS_3$?
3) Is it then possible come up with a higher dimensional quantum gravity theory, which is similarly exactly solvable as $AdS_3$ quantum gravity (well, it should be exactly solvable I guess...)?

The answer to part 1) above is "yes". I actually showed that for a homogeneous (symmetric) space $SL(3)/SO(3)$ in an earlier paper called "Infinite symmetry on the boundary of $SL(3)/SO(3)$". That paper is not very elegant, and I was working on an extended and more sophisticated version, but I didn't have time to finish writing it... the final equations were actually solved by Robert Bryant at mathoverflow. I couldn't get any research funding for that though and now I'm doing something completely different! Anyway, hopefully someone will some day find these results useful.

Wednesday, December 18, 2013

Statistical arbitrage with Python

UPDATE: The current Github version of the backtest is a bit broken: there was a silly bug that caused the algo to "see" 1 min into the future. Also, the overnight effects/ jumps in the morning kinda ruin the intraday trades, so I'm currently rewriting the code for 1 sec bar bid/ask data to be used intraday only... the current model is also a bit too naive (especially for higher frequencies) and will need several other improvements to be useful in practice. I will probably not share it publicly in the future though, but if you want to talk about it, feel free to drop me an email etc.

Finally managed to complete an early version of my PyArb statistical arbitrage project... I published it in GitHub as a Python module here, although the best way to view it right away would be to check out the IPython Notebook here at It is a model dependent equity statistical arbitrage backtest module for Python. Roughly speaking, the input is a universe of N stock prices over a selected time period, and the output is a mean reverting portfolio which can be used for trading. The idea is to model "interacting" (correlated, anticorrelated or cointegrated) stock prices as a system of stochastic differential equations, roughly as

$$ dX_t^i = A^i_j X_t^j dt + X_t^i dW_t^i,$$

where $X_t$ are the prices and $dW_t$ are white noises.

The stochastic part doesn't yet play any important role, but that will soon change...

This is just a backtest for a strategy, so there's no saying it will actually work in a live situation (but I'm planning to try paper trading next). Specifically, there's no slippage and impact modelling, short sell contract and borrow costs etc. I just assumed a flat rate \$.005 per share cost from Interactive Brokers' website as a sort of ballpark figure. It gives a roughly 12% annualized returns with a Sharpe ratio of about 5 and a maximum drawdown of 0.6%. Maybe that sounds a bit too good to be true? Well maybe I made a mistake, go ahead and check the code! :) (I need to check it again myself anyway or give it a go in e.g. Quantopian).

Here's a plot of the cumulative returns for a period of about 300 days. The "mode=0" is the best portfolio and corresponds to the lowest eigenvalue of the evolution matrix $A$ in the equation above.

Saturday, September 14, 2013

Funny statistics in stock market data

Funny statistics in stock market data

Market data structure functions

I recently got some minute level stock market data from The Bonnot Gang for some data analytic (and stat arb design) purposes, when I noticed some funny behavior in the data structure function. Now the concept of a structure function may not be very widely known with quants/ data analysts/ economists, so here's a definition:

Suppose there's a time series Xt. The structure function of Xt is defined as


where for a given sample of data you just replace the ensemble expectation E() by the sample mean, 1Nt=0N().

These types of structure functions have been studied for some time now in finance in the context of similarities between financial markets and hydrodynamic turbulence. I think it all started in 1996 with the paper Turbulent cascades in foreign exchange markets by Ghashghaie et al. They computed the structure functions for some FX market data, and found a scaling relation Sn(τ)ξn, where ξn is a concave function of n, implying multiscaling in FX markets, similarly to hydrodynamic turbulence (BTW their conclusions about the result were a bit out there, but I guess the data analysis is still good).

So I did some of my own data analysis with the Bonnot Gang data (I hope it's not bad data!). Here's a few plots of the structure functions, first for n=1:


Then n=3:


This is close to linear, i.e. ξ31, as in turbulence. Then n=10:


Clearly you can't fit a power law in all of this, but there seems to be clear power law regimes divided by about 6, 18, 60 and 180 minutes! I don't know the reason for this, but if I had to guess, I'd say it's because of traders/ algorithms operating w.r.t different data timeframes... or maybe it's because of the finite tick size...

Anyway, I don't have time to get to the bottom of this, but maybe someone else will... so if you see this stuff on a paper someday, you saw it first here!! ;)

Written with StackEdit. Try it out, it's awesome!! You can do MathJax and sync everything in Google Drive or Dropbox and publish directli in Blogger, Wordpress, Tumblr etc.!