Friday, 12 June 2015

Peaky Data

Following on from my previous post about charting latency, I recently needed to analyse a large amount of data - trader order flow over many months. I wanted to search for any traders experiencing poor latency. Obvious metrics like averages and standard deviations are useful and show part of the picture but when analysing latency I've often found latency tends to cluster in more than one place. Here is an example:


Most latency is low and in the large peak on the left but we have another smaller peak with far greater latency. Chances are that the reason for the peak with higher latency is something to do with the type of order the trader is submitting to the matching engine - for whatever reason it has a higher latency cost. If we can find scenarios like this we can look for commonality between the orders in this peak and then build some tests and figure out what's going on.

Analysing 100s of traders of many months means we want to do this in an automated way so lets do it in code.

In this case I had many days worth of logs that had been created with BinaryToFIX as discussed in my previous post. I wanted to look analyse each one and log if there was more than one peak where orders where being amended with OrderCancelReplace messages.

First read the CSV and filter out the execution reports telling us the order has been replaced, i.e. with ExecType of 5.

> messages = read.csv("messages.csv")
> execs = messages[messages$MsgType=="8",]
> replaced = execs[execs$ExecType=="5",]

Given a dataset like the above, how do I find each peak?

Lets define a local maximum as a point in the dataset which is greater then its two adjacent points but also a good distance away from its last local minimum. In my case I chose 10% of the highest point to be a good distance.

getLocalMax = function(bucks) {
    max = max(bucks$counts)
    samends = ksmooth(bucks$breaks, bucks$counts, kernel="normal", bandwidth=2) ;
    dsmooth = diff(samends$y) ;
    locmax = sign(c(0, dsmooth)) > 0 & sign(c(dsmooth,0)) < 0 ;
    locmin = sign(c(0, dsmooth)) < 0 & sign(c(dsmooth,0)) > 0 ;
    lastMin = 0 ;
    lastLocMin = mapply(function (x,y) {
        if (is.na(x)) {
            lastMin ;
        }
       
        if (x) {
            lastMin <<- y ;
        } ;
        lastMin ;
    },
    locmin %in% TRUE,
    bucks$counts) ;
   
    mapply(function(x,y,z) { (x & ((y - z) > (max / 10))) | y == max },
           locmax,
           bucks$counts,
           lastLocMin) ;
}


We are using diff to find out the difference between two adjacent points, so in the above plot we get the following:

> dsmooth
  [1]    0    0    0    0    0    0    0    0    0    0    0    0    0
 [14]    2   16   58  159  238  536  816 1002 1302 1258  993  684  276
 [27] -110  -72 -503 -451 -365 -735 -631 -502 -511 -476 -469 -461 -243
 [40] -362 -261 -219 -219 -144  -59 -143 -100  -72  -58  -41  -27   -5
 [53]  -21  -15  -14    5   -9  -19    4  -12    4    1   -7   -1   -5
 [66]    9    2  -10    3    3  -10    2    9    1  -12    7    3   -8
 [79]    3   -3    4    0   -3    3   -4   -3   -3    3    7   -6    0
 [92]    0   -2    5   -3   -3   -2    5   -3    1   -4    7   -3    0
[105]   -2    1    1   -1    7    1   -2   -8    2   -2    5   -5   11
[118]   -6   -3    6   -8    2    2    1   -1    2   -3    3    4   -4
[131]   -4    0    1    7   -8    0    3   -3    4   -3    3   -4    1
[144]   -2   -1    0    4    2   -4    0    7   -4   -3   -5    5    2
[157]   -2   -2   -1    3    2   -5    7   -7    3   -3    6   -5   11
[170]   -6    2   -2    0   -1    7   -3   -7    3   -1   -1   -4    5
[183]   -4    1   -3    3   -1    1   -2   -2    2    3    1   -4    3
[196]   -1   -3    0    5   -2   -1   -1   -1    1    4   -3    6   -6
[209]    6   -5   -2    0    6    1    1   -4    7   10   55  121  237
[222]  320  331  385  161  -93 -222 -244 -242 -182 -186 -133  -83  -78
[235]   -2  -53  -30  -21   -4  -13  -13   -7   -8    1   -5   -4    0
[248]   -5    4    3   -4   -2   -3    4   -1   -5    0    1    5   -2
[261]    1   -1    1    2   -2   -2    2   -2    1    4   -7    7   -2
[274]    2   -3   -2    4   -2   -3    3   -2   -2    0    4   -3   -3
[287]    1    4   -1    1    0    0    3    2   -5   -3    2   -3   -2
[300]    6    1    1    1   -5    0    1    4   -5    1    3   -6    3
[313]   -3    2   -1    4   -3    2   -2   NA
 
 Our first attempt at local maximum has too many false local maxima:

> locmax
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [23] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
 [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [56] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
 [67] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
 [78]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [89] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
[100] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[111]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE
[122] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
[133] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
[144]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
[155] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
[166]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
[177] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE
[188] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
[199] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
[210]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[221] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[232] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[243] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[254] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
[265]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE
[276] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
[287] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[298]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
[309] FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
[320] FALSE FALSE


The rest of the function gets rid of any local maximum that are less than 10% of the maximum point away from the last local minimum:

> getLocalMax(hist)
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [23] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
 [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[166] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[188] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[199] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[210] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[221] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[232] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[243] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[254] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[276] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[287] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[298] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[309] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[320] FALSE FALSE

 
Now we've just got two local maxima:

 
We can now wrap this getLocalMax function up in a script to search for traders who exhibit multiple peaks and look into their order flow in more details.