Messages and Packets: 2015

Following on from my previous post about charting latency, I recently needed to analyse a large amount of data - trader order flow over many months. I wanted to search for any traders experiencing poor latency. Obvious metrics like averages and standard deviations are useful and show part of the picture but when analysing latency I've often found latency tends to cluster in more than one place. Here is an example:

Most latency is low and in the large peak on the left but we have another smaller peak with far greater latency. Chances are that the reason for the peak with higher latency is something to do with the type of order the trader is submitting to the matching engine - for whatever reason it has a higher latency cost. If we can find scenarios like this we can look for commonality between the orders in this peak and then build some tests and figure out what's going on.

Analysing 100s of traders of many months means we want to do this in an automated way so lets do it in code.

In this case I had many days worth of logs that had been created with BinaryToFIX as discussed in my previous post. I wanted to look analyse each one and log if there was more than one peak where orders where being amended with OrderCancelReplace messages.

First read the CSV and filter out the execution reports telling us the order has been replaced, i.e. with ExecType of 5.

> messages = read.csv("messages.csv")
> execs = messages[messages$MsgType=="8",]
> replaced = execs[execs$ExecType=="5",]

Given a dataset like the above, how do I find each peak?

Lets define a local maximum as a point in the dataset which is greater then its two adjacent points but also a good distance away from its last local minimum. In my case I chose 10% of the highest point to be a good distance.

getLocalMax = function(bucks) {
    max = max(bucks$counts)
    samends = ksmooth(bucks$breaks, bucks$counts, kernel="normal", bandwidth=2) ;
    dsmooth = diff(samends$y) ;
    locmax = sign(c(0, dsmooth)) > 0 & sign(c(dsmooth,0)) < 0 ;
    locmin = sign(c(0, dsmooth)) < 0 & sign(c(dsmooth,0)) > 0 ;
    lastMin = 0 ;
    lastLocMin = mapply(function (x,y) {
        if (is.na(x)) {
            lastMin ;
        }

        if (x) {
            lastMin <<- y ;
        } ;
        lastMin ;
    },
    locmin %in% TRUE,
    bucks$counts) ;

    mapply(function(x,y,z) { (x & ((y - z) > (max / 10))) | y == max },
           locmax,
           bucks$counts,
           lastLocMin) ;
}

We are using diff to find out the difference between two adjacent points, so in the above plot we get the following:

> dsmooth
  [1]    0    0    0    0    0    0    0    0    0    0    0    0    0
 [14]    2   16   58  159  238  536  816 1002 1302 1258  993  684  276
 [27] -110  -72 -503 -451 -365 -735 -631 -502 -511 -476 -469 -461 -243
 [40] -362 -261 -219 -219 -144  -59 -143 -100  -72  -58  -41  -27   -5
 [53]  -21  -15  -14    5   -9  -19    4  -12    4    1   -7   -1   -5
 [66]    9    2  -10    3    3  -10    2    9    1  -12    7    3   -8
 [79]    3   -3    4    0   -3    3   -4   -3   -3    3    7   -6    0
 [92]    0   -2    5   -3   -3   -2    5   -3    1   -4    7   -3    0
[105]   -2    1    1   -1    7    1   -2   -8    2   -2    5   -5   11
[118]   -6   -3    6   -8    2    2    1   -1    2   -3    3    4   -4
[131]   -4    0    1    7   -8    0    3   -3    4   -3    3   -4    1
[144]   -2   -1    0    4    2   -4    0    7   -4   -3   -5    5    2
[157]   -2   -2   -1    3    2   -5    7   -7    3   -3    6   -5   11
[170]   -6    2   -2    0   -1    7   -3   -7    3   -1   -1   -4    5
[183]   -4    1   -3    3   -1    1   -2   -2    2    3    1   -4    3
[196]   -1   -3    0    5   -2   -1   -1   -1    1    4   -3    6   -6
[209]    6   -5   -2    0    6    1    1   -4    7   10   55  121  237
[222]  320  331  385  161  -93 -222 -244 -242 -182 -186 -133  -83  -78
[235]   -2  -53  -30  -21   -4  -13  -13   -7   -8    1   -5   -4    0
[248]   -5    4    3   -4   -2   -3    4   -1   -5    0    1    5   -2
[261]    1   -1    1    2   -2   -2    2   -2    1    4   -7    7   -2
[274]    2   -3   -2    4   -2   -3    3   -2   -2    0    4   -3   -3
[287]    1    4   -1    1    0    0    3    2   -5   -3    2   -3   -2
[300]    6    1    1    1   -5    0    1    4   -5    1    3   -6    3
[313]   -3    2   -1    4   -3    2   -2   NA

Our first attempt at local maximum has too many false local maxima:

> locmax
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [23] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
 [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [56] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
 [67] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
 [78]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [89] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
[100] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[111]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE
[122] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
[133] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
[144]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
[155] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
[166]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
[177] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE
[188] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
[199] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
[210]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[221] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[232] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[243] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[254] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
[265]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE
[276] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
[287] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[298]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
[309] FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
[320] FALSE FALSE

The rest of the function gets rid of any local maximum that are less than 10% of the maximum point away from the last local minimum:

> getLocalMax(hist)
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [23] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
 [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[166] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[188] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[199] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[210] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[221] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[232] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[243] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[254] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[276] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[287] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[298] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[309] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[320] FALSE FALSE

Now we've just got two local maxima:

We can now wrap this getLocalMax function up in a script to search for traders who exhibit multiple peaks and look into their order flow in more details.

I work in finance and many projects that I've been involved with have either connected to some sort of order matching engine or the project itself has been the implementation and integration of matching engines - often third party.

TCP/IP is almost always used for sending orders with some sort of application level encoding and session layered over it. Irrespective to how the matching engine is implemented, its system software will provide some sort of gateway that has the endpoint for the clients TCP/IP connection.

At some point the latency of client connectivity becomes a concern and we look to ways of measuring it. Generally, a client is submitting an order which is processed, placed into the market and then a response is returned indicating the state of that order once its in the market. It may just go onto a resting order book as-is or perhaps its matched and is now partially or fully filled.

Often clients connect via the FIX Protocol and this is the example I'll use here as its widely understood.

There are several ways in FIX that orders or quotes can be submitted and in this example we'll just focus on one of them:

An order is submitted with a NewOrderSingle (NOS) message
A reply is returned within an ExecutionReport (ER) message stating the order has successfully been placed (ExecType=0)
The order is amended (for example the price is changed) with an OrderCancelReplaceRequest (OCR)
A reply is returned within an ER stating the order has been replaced(ExecType=5).

The NOS/OCR and the ER are correlated together with a unique field supplied on the NOS/OCR called ClOrdID. Rather simplistically, the states and transition messages being discussed here are:

What we are interested in is how long does it take between sending the request message and receiving the reply in the ER. This post discusses some ways to analyse this latency to try and get some meaningful results.

What To Measure?

Assuming we have sampled a large number of these messages (more on this later) that are expected to have the same latency cost, the obvious statistics such as taking the mean, standard deviation and so on have a value but they hide problems. A grid of numbers does not make communicating a problem to non-technical managers straightforward either - we need some meaninful pictures.

To understand the profile of the latency incurred we want to see the distribution. Lets look at an example.

The X axis is the latency and the Y axis the number of messages incurring that latency - it's a histogram as we've bucketed the latencies together, i.e. each stack represents the number of messages with latency between x and y. In the above we have a large peak with low latency with a leading edge that that quickly peaks and then steeply falls way with no shoulder. If we took an average of the data it would fall somewhere in this peak - but wait, what's that other peak - this distribution is multi-modal. The smaller "hump" represents fewer messages but they are incurring far higher latency. Why does a proportion of the messages require more processing on the server then expected? Something to investigate.

In this dataset we have 200,000 messages and as you can see the majority are clumped together and look to have good latency. What's also just visible is a tail and this tail shows there is a small number of messages that have increasing latency. I've omitted the X axis but in this example the axis goes up to 300 micro seconds. Charting the rest of the tail beyond 300us and we see the following.

These messages are incurring latencies above 300us and a select few even taking milliseconds. In the example in fact 1000 out of 200,000 have latency above 300us. Again, something to investigate.

The shape of the distribution quickly highlights issues such as:

Long tails show how latency degrades and should be a focus for investigation.

Big fat peaks show how a service has a lot of variance in its latency.

Multi modal shows different processing costs for subsets of the messages

A negative skew shows some messages incurring reasonable latency but the bulk having relatively larger latency. With code optimisation maybe we we change the profile to have more of a positive skew?

Random clumps of outlying data points can point to things like Java garbage collection, operating system swapping or TCP/IP windows filling.

A highly erratic shape often hints at poorly written software or a system loaded beyond its maximum capacity.

Collect The Data

FIX messages have timestamps in them but they are from the clock of the host where the messages were sent from so are meaningless to compare. One could write a test harness that sends and receives messages comparing them using the local clock however this approach means the performance of the test harness is part of the measurement and we cannot measure a production system day in and day out this way.

A better approach is to use a network capture as this both removes any test client (and the machine its running on) from the measurements and can also be used in a production environment with a packet sniffer or, as a last resort, tcpdump.

I've recently put a utility called BinaryToFIX on GitHub. Its code I had within HermesJMS and have been using for years now - it will decode a packet capture and write a CSV file with various fields from messages and most importantly the capture time for each message and the latency between specific messages - one of which is between NOS, OCR and their ER.

Some organisations I've worked for actively monitor their FIX connections using network sniffers and record the traffic so a days production or test environment activity is available for download. Where this is not the case one may have to arrange a network capture device to be attached to a switches SPAN port or have a machines optical network connection tapped. When captured with a capable device you should get pretty much zero packet loss and very low capture latency. If you have to resort to a software capture then your success can vary but its still often better than nothing - its worth explaining to your local network/systems admin what you're doing as a software capture on the same host as running a production system will have side affects...

Analyse The Data

Many people reach for Excel to start looking at this kind of data, but we are programmers so there has to be a better way and anyway, the solution must scale to very large datasets which is hardly Excels strong point. Its likely any solution will need to be automated in future so we may as well get programming from the start.

A colleague introduced me to R and its very capable, it is after all a language for statistics and importantly it has very good charting capabilities. It won't win any beauty contents as a language but its powerful, relatively easy to learn and there are masses of documentation, forums and blogs on the internet to learn from. It has some "Big Data" capabilities and you can even build web sites with it to make your reports dynamic.

BinaryToFIX outputs CSV like this:

"Capture Timestamp","Capture Micros","Latency","MsgType","SenderCompID","TargetCompID","SecurityID","ClOrdID","OrderID","ExecType","CxlRejReason","Text"
"20150127-10:29:00.626","1422354540626175","","D","BANKA","BROKER","XYZ 2Y","37c9df75-b0c","","","",""
"20150127-10:29:00.637","1422354540637598","11423","8","BROKER","BANKA","XYZ 2Y","37c9df75-b0c","20150127422.0","0","",""
"20150127-10:29:00.662","1422354540662951","","D","BANKA","BROKER","XYZ 3Y","76c781cb-cfb","","","",""
"20150127-10:29:00.665","1422354540665779","2828","8","BROKER","BANKA","XYZ 3Y","76c781cb-cfb","20150127423.0","0","",""
"20150127-10:29:00.710","1422354540710473","","D","BANKA","BROKER","XYZ 4Y","1b8e2f94-132","","","",""
"20150127-10:29:00.713","1422354540713233","2760","8","BROKER","BANKA","XYZ 4Y","1b8e2f94-132","20150127424.0","0","",""
"20150127-10:29:00.741","1422354540741122","","D","BANKA","BROKER","XYZ 5Y","20b335f2-36c","","","",""
"20150127-10:29:00.742","1422354540742088","","D","BANKA","BROKER","XYZ 2Y","9b7cba00-cca","","","",""
"20150127-10:29:00.743","1422354540743266","2144","8","BROKER","BANKA","XYZ 5Y","20b335f2-36c","20150127425

The latency column in the data is populated for any message where MsgType=8, i.e. an execution report. Within that subset ExecType tells us whether the order is New (0) or Replaced (5).

We can use R to start to analyse the data, lets see how many new orders versus cancel/replace there are:

> rows = read.csv("/tmp/latencies.csv")
> execs = rows[rows$MsgType=="8",]
> length(execs[execs$ExecType == "0",]$Latency) 
[1] 51
> length(execs[execs$ExecType == "5",]$Latency) 
[1] 20348

Lets extract all the latencies for the cancel/replaced, NA occurs in the data when there was no latency set in the CSV and we strip them out.

> crs = execs[execs$ExecType == "5",]$Latency
> crs = crs [!is.na(crs)]

> length(crs)
[1] 20345

Now for the charting, I want to create two charts, one with the bulk of the data and the other zooming in on the tail. To do this we have a simple function that splits the data set in two, buckets and plots it.

doChart=function(all, title, maxVal, bucketSize) {
    breaks = seq(0, maxVal, by=bucketSize)
    main = subset(all, all > 0 & all < maxVal)
    tail = subset(all, all > maxVal)
    print(sprintf("replaced tail is %d", length(tail)))   
    hist(main, breaks, main=title)  
    breaks = seq(min(tail), max(tail)+bucketSize, bucketSize)
    hist(tail, breaks, main=sprintf("Tail of %s", title) )
}

Lets see what it makes of our data, first the body:

Next the tail:

Looks like we have two questions here:

What is causing the large shoulder with such a wide variance in latency between 1000 and 5000us?
What causes the small number of very high latencies in the tail?

Next Steps

In this example we are using R as a quick way to chart data and do some basic analysis but there is far more scope. For example to create a dynamic report web page using Shiny, searching for distinct latency patterns such as multiple peaks and fat tails or breaking down latency by instrument and time of day - this is where the choice of R begins to give rewards.

In the next post we'll start analysing data to look for specific patterns...

Messages and Packets

Friday, 12 June 2015

Peaky Data

Tuesday, 10 February 2015

Whats my latency?

Blog Archive