There are only two hard things in computer science: cache invalidation, naming things and off by one errors
It all started with a problem. I wanted to scan the last n rows of an HBase table using the cbass library, a thin, opinionated wrapper around the hbase-client API. I really like it, and if you’re using Clojure and HBase, I recommend checking it out.
cbass doesn’t have the capability to scan n
rows, so I decided to add that functionality. In
order to do that I had to use the underlying hbase-client API. Two ways to limit the results of
a scan using the hbase-client API are to 1) using start/stop rows or 2) timeranges. In my case I
wasn’t using either, so it wasn’t immediately obvious how to limit the output to just n
results.
At first I thought I could just wrap ResultScanner iterator-seq in a take. That seemed to work on small tables. However on a large table (defined here as anything over a few thousand rows), the scan would blow up. It turned out that in order to minimize the number of network calls, a scanner by default will try to get all the matching entries and shove them into a ResultScanner iterator. For this use case, there’s isn’t a limiting criteria on a scanner, so a scan will try to “fetch” the entire table into memory. That’s one large ResultScanner object! No good. What to do?
What I really wanted is a bounded ResultScanner iterator of size n
. Using take
, I was able
to create a bounding iterator of size n
, but the underlying iterator ended up table scanning
and being much larger than n
. Since take
isn’t enough, and I wanted to limit the number of
rows coming back from HBase, I needed something on the scanner itself to make it stop fetching
everything.
Looking at the API names on for
Scan
, you’d assume that setMaxResultSize(long maxResultSize) was what you wanted. But when
you read the javadoc more carefully, you realize that it has nothing to do with the
number
of results coming back from a scan. Ok, maybe it’s setBatch(int batch). Nope,
batching in this case is meant for rows that have very many column qualifiers, so
in this case ‘batch’ means return N of M column qualifiers on the .next()
call
of a ResultScanner where M is the total number of of column qualifiers of a row and
N<=M. Turns out the setting I was looking for is setCaching(int caching). Here’s the
description:
Set the number of rows for caching that will be passed to scanners. If not set, the Configuration setting HConstants.HBASE_CLIENT_SCANNER_CACHING will apply. Higher caching values will enable faster scanners but will use more memory.
The description and the name are completely at odds with each other. A
better name in my opinion would be setFetchSize
, setBatchSize
or really any other thing that doesn’t include the word “cache”. The
name “cache” is so confusing that both tolitius and I had to
dig
in the
source
code of hbase-client just to convince ourselves it wasn’t actually a cache.
I would rename: setMaxResultSize
to setMaxResultByteSize
, setBatch to
setColumnsChunkSize
and setCaching to ‘setNumberOfRowsFetchSize’. Judging
by the amount of confusion on
Stackoverflow
I think it’s a good idea to name these methods differently.