« June 2007 | Main | February 2008 »

January 25, 2008

Latency vs Throughput

I was helping a Berkeley DB XML user recently who complained that his query took as long as ~55s to run. It turned out he was trying to run 40 concurrent queries, so I tried out the query on his data set using our concurrency testing framework. I was confused when it reported to me that I was getting ~2 ops/s, which is a completely different ballpark.

It took me a while, but eventually I figured it out: When the user was saying that his query took ~55s what he meant was that a thread's view of how long a query takes can be as much as 55s. When I talked about getting ~2 ops/s, I was looking at the system as a whole. The problem was that he was talking about latency, and I was talking about throughput.

Typically the way that a server gets smaller latency is to implement a thread pool and work queue. Since DB XML is an embedded library and does not spawn it's own threads, it is down to the application using DB XML to manage it's threads well.

So I created a Java test program that used a thread pool to manage the queries being run through DB XML (in Java it's really easy). I created a test function that could run the query with arguments of how many threads to create, how many queries to run, and how many queries to start per second.

I ran a number of tests - in all of the test runs I kept the number of queries run as constant, so that exactly the same amount of useful work was done in all cases. I measured "workload" time, which is the time from when the query is requested until it is completed - this is a measure of latency. I also measured the total time taken for all the queries, and the number of operation per second executed - this is a measure of throughput. Here are the actual figures I got:

Queries runQueries per secondThreadsLatency (avg query time in seconds)Throughput (queries per second)
80no delay4032.151.70
80no delay1023.341.96
80no delay422.001.89
8010418.511.88
805413.871.93
80240.901.97

The first thing I noticed is that as the number of threads spawned decreased, latency decreases and throughput increases - this peaked around 4-5 threads on my system. Throughput increased because each thread has a processing overhead associated with it, so less threads means less overhead. Average latency decreased because instead of 5 queries each taking 50s (as an example), the first query takes 10s, and the last takes 50s:

Average time for 5 queries in parallel: (50s + 50s + 50s + 50s + 50s) / 5 = 50s
Average time for 5 queries in series: (10s + 20s + 30s + 40s + 50s) / 5 = 30s

Surprise number one - adding more threads can actually increase your latency!

Next I tried to add a little bit of realism to the simulation - it is very rare that 80 queries need to be executed at exactly the same time, so I added a little bit of a delay between starting each query. As the queries per second decreased throughput was more or less constant, but latency decreased dramatically. This is simple to explain - the queries spent less time in the work queue, when they aren't actually being executed:

Average time for 5 queries executed at the same time: (10s + 20s + 30s + 40s + 50s) / 5 = 30s
Average time for 5 queries executed with a 10s delay between each: (10s + 10s + 10s + 10s + 10s) / 5 = 10s

So surprise number two - a more realistic workload can result in smaller latency!

Posted by john at 10:42 AM | Comments (2)

January 17, 2008

Why DB XML Doesn't Do Path Indexes

Every so often we have a Berkeley DB XML user ask us if we implement path indexes. Or maybe they assume that we do implement them and are confused about why their index isn't working. It turns out that a lot of these users have come to us after using eXist, which for a long time has only been able to specify path indexes.

A path index is specified by describing the path to the nodes that you want to index - often with a subset of XPath. So you might want to index all "firstname" elements, but only if they are children of an "author" element - which you might write like this:

//author/firstname

The good thing about path indexes is that they can be very small targeted indexes, which only record exactly the information that you want.

The trouble with path indexes, I've always explained, is that it's really hard to work out when you can use one in a query. You miss out on lots of potential optimisation opportunities because query analysis can't prove that an index is applicable. It's not surprising really - path indexes are a subset of materialised views which is known to be a very hard problem, in the general case.

So it's not surprising to see that recent work on eXist has added a new type of index which works in exactly the same way as DB XML's do. The author of the blog post even describes the problems that I have been telling our users about, and recommends that path indexes be used sparingly if at all.

It's good to see other people coming to the same conclusion that I came to. DB XML may implement path indexes some day, but they won't be anywhere near as useful as some users think they should be.

Posted by john at 04:55 PM | Comments (0)

January 07, 2008

New Year, New Blog?

I don't normally make New Year's resolutions, but I've been thinking for quite a while that I should be blogging more. The trouble was that snelson.org.uk has become a bit of a family blog, and my family don't want to read my highly technical work related blog entries. On the flip side, any readers interested in XQuery, XML and databases probably aren't that interested in the latest Snelson baby news - gorgeous though my children are :-).

The solution is snelson.org.uk/john which is an index page and atom feed for only my blog posts. The family focused index page now also filters out all posts with the somewhat cryptic category of "Tech", so I don't have to explain to my Dad what the latest XML terms mean (Hi, Dad).

I've tried this blogging thing before, although last time I was being paid to do it. I've never been much of a wordsmith, and I ran out of steam last time. I'm hoping that writing for the love of it about subjects that I'm interested in, and being less ambitious with the frequency of updates will help me to last a bit longer.

Besides, I've really enjoyed reading both Saxon Diaries and Profound Titles No One Gets, since they always write about the highly technical business of XQuery implementation - my day (and sometimes night) job. I've also gained a lot from some great discussions with clever people recently, which I hope to write up some time. Hopefully this blog will help me to have many more conversations like them.

Posted by john at 08:39 PM | Comments (0)