How to beat kdb+ – Some Assembly Required

I’ve recently been pondering how to improve the performance of kdb+. For what it’s worth, I think it’s going to be really hard to beat kdb+ at its “A game”, which is quant-analytics over big data. In a select template:

we get free SIMD vectorisation
there’s a simple time-and-space analysis tool
it’s quick and easy* to write and amend

[*] well … maybe not “easy”

If we rewrote this in a language that compiles to a native code we could probably go a bit faster just by avoiding allocations and memcpys between constraints, since I think kdb+ must do a copy-step to offer the next operator one or more vectors of data to compare.

However, there are some areas where I think we can do much better, in particular where we simply avoid doing unecessary work.

Two common instances that spring to mind are kdb+ tickerplants and gateway processes, which deserialise and then reserialise the messages, perhaps without changes. There’s also IPC compression, which I’ll take first.

IPC compression

I remember being told once by Simon Garland (lovely chap) that kdb+ would not compress data if it were destined for a localhost address, or if the IPC type was level 2 (which would also mean you can’t send timestamps, IIRC). You may get the sense by the end of this blog that I hold an unsubstantiated belief that we can push more work into the network to save on CPU cycles. Enterprise networks are now ridiculously fast and we should lean on them as much as we can. I really should measure before making statements like this, but, mea culpa, I haven’t.

Anyway, the question arises about how to send data between hosts without having kdb+ apply compression to the IPC message. We could use websockets and encode with -8! … but that lacks refinement and means the receiving process will receive a reified byte-vector (another allocation) it’ll need to decode, which is suboptimal. Perhaps Kx Systems could add a command line switch to turn off compression? I doubt this is likely to arrive soon, somehow.

Probably the simplest alternative is to serialise and write the data in a shared library. We’d own the messaging technology too, so could do fun things with io_uring, for example.

Tickerplants

These are contenders for the title of “worst offender”. Let’s just say up-front that I’m talking about a zero-latency tickerplant (i.e. which does no batching), which not uncommonly will have wildcard subscriptions for all symbols. There’s clearly more going on where kdb+ does symbol-filtering. So here I’m talking about a tickerplant that receives data, writes it down, then broadcasts it, unchanged, to its subscribers.

In the normal flow, a message is received from the feed-handler by kdb+ and deserialised. It’s then written to the log file (one serialisation step), whose encoded form is discarded, oh my! It’s then broadcast to subscribers, performing the same serialisation step again.

Writing in some other language, C or Java or perhaps even Python, we can do better by simply stripping the header off of the IPC message and copying the payload to the log-file, which skips the deserialisation step and first of the two re-serialisation steps. Then we can go faster still by forwarding the IPC message unchanged to all subscribers. Again, I note that no filtering is going on … because in this case, none is needed.

By-table filtering

OK, so let’s just say you have a requirement to filter by table, and are only interested in trade, but not quote messages. The solution here is to scan the IPC message without deserialising, which is fairly simple. You just walk the IPC bytes to look for the list-header, the symbol-atom .u.upd and then the table name. Done.

Gateways

These are a bit more involved, and again, I see performance gains on the table from not deserialising and reserialising the messages. Some Gateways implement an aggregation step, but here I’m just thinking about a plain-Jane load-balancer that will dispatch queued queries to the next available instance that can answer them. It’s not rocket science. I wrote one of these for an old day-job, and it seemed to do the right thing. The only wrinkle was that if your client said it could understand kdb+ IPC type 3, but your remote service said it wanted to reply with “big IPC” type 5, you would run into issues. The solution was simple: we simply logged-in to the remote services using type 3.

XDP & eBPF?

I read recently about XDP and eBPF and how these together move packet processing into the kernel or perhaps even the NIC. It seems that there are some really rough edges around out-of-order packet delivery and fragmented messages, which would really complicate the payload inspection, but they shouldn’t be insurmountable and there’s always the fallback of simply passing the message back up to the “real” Gateway for handling if things get too much.

One thing that strikes me is that apart from knowing how long an IPC message is, we’re not bothered by packet order: we simply want to forward all packets that comprise the message (TCP_NODELAY!) to the client. Prior to sending the query, the userspace component notifies the eBPF program that any response bytes from remote service K should be forwarded to IP w.x.y.z, source/dest ports S and D. The remote will not send us any further data after responding, so that works out.

The eBPF program will notify the userspace program that the reply to the client is complete, probably via some SPSC ring-buffer. If the first packet with the IPC header is reordered (between London and Tokyo, perhaps) then it gets difficult to know the IPC length and thus figure out how many bytes to expect. In that case the eBPF program would have to pass the bytes up to the userspace part for handling.

What we’d save here are system-calls, memory-copies and CPU cycles, and hopefully time.