HBase client's weird API names

2015-10-08

There are only two hard things in computer science: cache invalidation, naming things and off by one errors

It all started with a problem. I wanted to scan the last n rows of an HBase table using the cbass library, a thin, opinionated wrapper around the hbase-client API. I really like it, and if you’re using Clojure and HBase, I recommend checking it out.

cbass doesn’t have the capability to scan n rows, so I decided to add that functionality. In order to do that I had to use the underlying hbase-client API. Two ways to limit the results of a scan using the hbase-client API are to 1) using start/stop rows or 2) timeranges. In my case I wasn’t using either, so it wasn’t immediately obvious how to limit the output to just n results.

At first I thought I could just wrap ResultScanner iterator-seq in a take. That seemed to work on small tables. However on a large table (defined here as anything over a few thousand rows), the scan would blow up. It turned out that in order to minimize the number of network calls, a scanner by default will try to get all the matching entries and shove them into a ResultScanner iterator. For this use case, there’s isn’t a limiting criteria on a scanner, so a scan will try to “fetch” the entire table into memory. That’s one large ResultScanner object! No good. What to do?

What I really wanted is a bounded ResultScanner iterator of size n. Using take, I was able to create a bounding iterator of size n, but the underlying iterator ended up table scanning and being much larger than n. Since take isn’t enough, and I wanted to limit the number of rows coming back from HBase, I needed something on the scanner itself to make it stop fetching everything.

Looking at the API names on for Scan , you’d assume that setMaxResultSize(long maxResultSize) was what you wanted. But when you read the javadoc more carefully, you realize that it has nothing to do with the number of results coming back from a scan. Ok, maybe it’s setBatch(int batch). Nope, batching in this case is meant for rows that have very many column qualifiers, so in this case ‘batch’ means return N of M column qualifiers on the .next() call of a ResultScanner where M is the total number of of column qualifiers of a row and N<=M. Turns out the setting I was looking for is setCaching(int caching). Here’s the description:

Set the number of rows for caching that will be passed to scanners. If not set, the Configuration setting HConstants.HBASE_CLIENT_SCANNER_CACHING will apply. Higher caching values will enable faster scanners but will use more memory.

The description and the name are completely at odds with each other. A better name in my opinion would be setFetchSize, setBatchSize or really any other thing that doesn’t include the word “cache”. The name “cache” is so confusing that both tolitius and I had to dig in the source code of hbase-client just to convince ourselves it wasn’t actually a cache.

I would rename: setMaxResultSize to setMaxResultByteSize, setBatch to setColumnsChunkSize and setCaching to ‘setNumberOfRowsFetchSize’. Judging by the amount of confusion on Stackoverflow I think it’s a good idea to name these methods differently.

Contents