There are only two hard things in computer science: cache invalidation, naming things and off by one errors
It all started with a problem. I wanted to scan the last n rows of an HBase table using the cbass library, a thin, opinionated wrapper around the hbase-client API. I really like it, and if you’re using Clojure and HBase, I recommend checking it out.
cbass doesn’t have the capability to scan n rows, so I decided to add that functionality. In order to do that I had to use the underlying hbase-client API. Two ways to limit the results of a scan using the hbase-client API are to 1) using start/stop rows or 2) timeranges. In my case I wasn’t using either, so it wasn’t immediately obvious how to limit the output to just n results.
At first I thought I could just wrap ResultScanner iterator-seq in a take. That seemed to work on small tables. However on a large table (defined here as anything over a few thousand rows), the scan would blow up. It turned out that in order to minimize the number of network calls, a scanner by default will try to get all the matching entries and shove them into a ResultScanner iterator. For this use case, there’s isn’t a limiting criteria on a scanner, so a scan will try to “fetch” the entire table into memory. That’s one large ResultScanner object! No good. What to do?
What I really wanted is a bounded ResultScanner iterator of size n. Using ‘take’, I was able to create a bounding iterator of size n, but the underlying iterator ended up table scanning and being much larger than n. Since ‘take’ isn’t enough, and I wanted to limit the number of rows coming back from HBase, I needed something on the scanner itself to make it stop fetching everything.
Looking at the API names on for Scan , you’d assume that setMaxResultSize(long maxResultSize) was what you wanted. But when you read the javadoc more carefully, you realize that it has nothing to do with the _number_ of results coming back from a scan. Ok, maybe it’s setBatch(int batch). Nope, batching in this case is meant for rows that have very many column qualifiers, so in this case ‘batch’ means return N of M column qualifiers on the ‘.next()’ call of a ResultScanner where M is the total number of of column qualifiers of a row and N<=M. Turns out the setting I was looking for is setCaching(int caching). Here's the description:
Set the number of rows for caching that will be passed to scanners. If not set, the Configuration setting HConstants.HBASE_CLIENT_SCANNER_CACHING will apply. Higher caching values will enable faster scanners but will use more memory.
The description and the name are completely at odds with each other. A better name in my opinion would be ‘setFetchSize’, ‘setBatchSize’ or really any other thing that doesn’t include the word “cache”.
I would rename: setMaxResultSize to ‘setMaxResultByteSize’, setBatch to ‘setColumnsChunkSize’ and setCaching to ‘setNumberOfRowsFetchSize’. Judging by the amount of confusion on Stackoverflow I think it’s a good idea to name these methods differently.