Dynamic Languages and Types

Many proponents of static typing, including many of the commenters who responded to this post, strongly believe that its presence in a language directly translates into correctness, even though there is very little empirical evidence to support this assertion. More importantly, the assumption is that the safety guarantees provided by static typing can’t be provided via any other mechanism.

To me, and many others who work with statically-typed languages, the biggest benefit of types is auto documentation.

Here’s an example. Let’s suppose the developer is encountering this code, in a generic statically-typed language, for the first time.

It’s not clear what what generateAccount actually does, but we know it takes an Event and returns an Account. In an IDE, it’s easy to click onto Event or Account and see what values they describe to get a sense of what’s happening. If Account and Event are immutable, we have a pretty good idea of what to expect. The second guarantee is that generateAccount can’t take Potato and return Tomato where Potato and Tomato have no relation to Event and Account. That’s simply not allowed. We pass in Event and get back an Account, and that’s it. Sure it might throw an exception, but the approximate understanding of the function exists, and we can move on to other code.

In a dynamic language, the same function might look something like this:

It’s not at all obvious what ‘e‘ is and what generateAccount returns, and we’ll need to get into the internals of the function to figure it out. That’s a lot more effort than simply reading the types by sight in the previous example. And what if there are 30 of these functions? And if we do finally figure out what it is generateAccount does and comment above generateAccount describing the types informally, we have to hope that no one changes the comments. All of this function “metadata” is maintained by the compiler in the statically defined generateAccount.

Every developer that I’ve talked to who favors static typing agrees this is the crux of the issue. It’s not monads, or linear types or even type inference that people want. What’s essential is being able to look at a function and figure out the inputs and outputs to get an approximate understanding without having to dig into the internals.

I believe this situation causes some people to prefer statically-typed languages, because the dynamic languages they are used to are not powerful enough, and don’t have maintainers who were sufficiently thoughtful to give this feature out of the box. My two favorite languages Clojure and Racket, both solve this problem in a powerful and compelling way. I am going to show off a little bit of Clojure’s way of addressing this problem. Luckily for me someone thought of a decent example already. Here’s the challenge recreated here:

The JSON data is in the following format

In JSON, keys aren’t ordered and must be strings.

The goal is to parse the JSON file, order everything correctly (which means parsing the keys into integers), and produce an array that looks like this:

The data structure can be a tuple or some new type.

Here’s the Clojure code to do the challenge:

This code is pretty straightforward, but I do want to call attention to the “s/def” spec section. A ::bible is a map whose keys must exist in book-order. The value of bible is chapter which is a map whose keys are numeric strings. The values of chapter are maps named verse whose keys are again numeric strings and values are nonempty strings. All of this done declaratively. Spec not only validates data it can also transform it. In this case, we can ‘conform’ the keys to the spec which means that once we validate the bible map, the numeric keys will be converted into integers.

The upside of spec is that now we have robust schema validation. The program will abort early if the data passed to it violates spec, so we don’t have to do error handling in the program’s functions, because the data is in good shape. We can just concentrate on transforming it. An even bigger upside is that now it’s easy to tell what kind of data goes in, just follow the definition of ::bible. What data comes out? Follow ::sorted-bible.

Best of all, the errors are data! Here’s a corrupt piece of data that violates the spec rule being tested against the spec. The name “Ge” is not in the set of valid books. Since the error is data, we can interrogate it.

Here are some more cases of bad data.

With just a few lines of code, we get all of the following:

1) The ability to verify that data is in the shape and form the program expects cuts down on error checking
2) Some free data parsing, in this case the string to int conversion
3) A robust system for error messages that can be customized, because it’s just data
4) Spec allows powerful predicates like ‘(set book-order)’ and ‘numeric-str’ which are not easy to do in a conventional typed language. Often types provide limited guarantees like “this key is a String” and not the fact that it’s numeric or nonempty. These are exactly the type of invariants we want but rarely get out of types.
5) Specs can be composed and reused in arbitrary ways
6) Using spec to generate test data or just get an example output of what kind of data conforms to the spec.

For example:

7) Since specs live in a registry, we can query the registry for specs that are defined in our program programmatically in the REPL. This is another aspect that’s desirable in a large code base, because it enables reuse.
8) We don’t have know how the specs look like beginning development. It’s possible to incrementally build up a program in the REPL and once we’re happy with the code, to define the specs. Even this is completely optional. Personally, I find that I like writing specs upfront to conform the input to my code and once I figure out the flow of my code to write specs again afterwards.
9) This is a subtle point, but it’s important to realize that spec supports the creation of open systems that are amenable to change and end up less brittle. The following spec

validates that a map contains two keys ::a and ::b if there are other keys in the map that don’t have specs defined they are left alone. If there’s no spec for ::a, the only thing validated is that a key ::a is present. When requirements change, spec allows for a smooth progression to new requirements. We don’t have to make all of our decisions upfront and our code can still operate.

There are many more uses for specs. I recommend following the excellent tutorial here to learn more.

I don’t believe that static typing is uniquely qualified to give programmers safety guarantees or better readability. The github link for the challenge provides a Haskell solution along side a Clojure one. Since Haskell is treated as the current poster child for statically typed languages, it would be great for somebody to provide a comparable Haskell solution that gives similar guarantees.

The Beauty of Clojure

This post is biased but I happen to like and agree with my biases so take everything written below with a grain of salt.

Clojure is one of my favorite programming languages because of its philosophy of handling state, functional programming, immutable data structures and of course macros.

However, after using component for a project at work, I noticed that my code stopped looking like idiomatic Clojure code and more like OO Java I used to write. While features like reify, defprotocol, deftype, and defrecord exist they exist for the purposes interop with Java, type extensions and library APIs. In my opinion the bulk of Clojure code should strive to utilize functions and be data oriented.

Clearly, with around 1,000 stars on GitHub, many people find component useful, but its object-oriented paradigm feels unnatural and at odds with the way Clojure is written. The rising popularity of component alarms me because looking at some of the code I and others have produced leaves little room for idiomatic Clojure.

Today I ran across a great blog post by Christopher Bui that reminded me of why I avoid component instead opting for mount. The best part of it is that it included code that enables me to rant by writing code which is my favorite kind of ranting.

As an exercise I decided to rewrite Christopher’s component code using mount and I am quite happy with the results.

Here’s the description of the original task:

Let’s say you’re planning on building a new service for managing emails in your application. The work involved is taking messages off of a queue, doing some work, writing data to a database, and then putting things onto other queues. Everything will be asynchronous.

My Clojure code using component looks very similar to the one written by the Christopher because protocols and records end up being at the forefront when they should be de-emphasized, as they are in my mount example. Functions, which are in the background using component are featured in the mount code below.

I have the full example shipper project on github that models an warehouse system that:

  1. reads order numbers that are ready to ship off of a warehouse queue
  2. sends out email notifications to customers
  3. writes order status changes and emails to DB
  4. and then sends notifications to postal to start a shipping process

Below is all the code on one page with namespace declarations removed. A real runnable version is available in the GitHub repo above.

Outside of ‘defstate’s which work like regular ‘def’ variables everything is a function. In my opinion, the above looks more idiomatic and I find it easier to read than the componentized version. In my experience mount’ed code ends up being shorter as a bonus. A big takeaway from using mount is that you can require defstate variables like you would any other var in a namespace and it just works. Take a look at the repo for examples.

Here’s how to interact with the code in boot/lein REPL session.

In short I encourage everyone to keep Clojure idiomatic and beautiful and just because your code has state it doesn’t mean you have to abandon the way you structure your programs.

HBase client’s weird API names

There are only two hard things in computer science: cache invalidation, naming things and off by one errors

It all started with a problem.  I wanted to scan the last n rows of an HBase table using the cbass library, a thin, opinionated wrapper around the hbase-client API. I really like it, and if you’re using Clojure and HBase, I recommend checking it out.

cbass doesn’t have the capability to scan n rows, so I decided to add that functionality. In order to do that I had to use the underlying hbase-client API.  Two ways to limit the results of a scan using the hbase-client API are to 1) using start/stop rows or 2) timeranges. In my case I wasn’t using either, so it wasn’t immediately obvious how to limit the output to just n results.

At first I thought I could just wrap ResultScanner iterator-seq in a take. That seemed to work on small tables. However on a large table (defined here as anything over a few thousand rows),  the scan would blow up. It turned out that in order to minimize the number of network calls,  a scanner by default will try to get all the matching entries and shove them into a ResultScanner iterator. For this use case, there’s isn’t a limiting criteria on a scanner, so a scan will try to “fetch” the entire table into memory. That’s one large ResultScanner object! No good. What to do?

What I really wanted is a bounded ResultScanner iterator of size n. Using ‘take’, I was able to create a bounding iterator of size n, but the underlying iterator ended up table scanning and being much larger than n. Since ‘take’ isn’t enough, and I wanted to limit the number of rows coming back from HBase, I needed something on the scanner itself to make it stop fetching everything.

Looking at the API names on for Scan , you’d assume that setMaxResultSize(long maxResultSize) was what you wanted. But when you read the javadoc more carefully, you realize that it has nothing to do with the _number_ of results coming back from a scan. Ok, maybe it’s setBatch(int batch). Nope, batching in this case is meant for rows that have very many column qualifiers, so in this case ‘batch’ means return N of M column qualifiers on the ‘.next()’ call of a ResultScanner where M is the total number of of column qualifiers of a row and N<=M. Turns out the setting I was looking for is setCaching(int caching). Here's the description:

Set the number of rows for caching that will be passed to scanners. If not set, the Configuration setting HConstants.HBASE_CLIENT_SCANNER_CACHING will apply. Higher caching values will enable faster scanners but will use more memory.

The description and the name are completely at odds with each other. A better name in my opinion would be ‘setFetchSize’, ‘setBatchSize’ or really any other thing that doesn’t include the word “cache”.

The name “cache” is so confusing that both tolitius and I had to dig in the source code of hbase-client just to convince ourselves it wasn’t actually a cache.

I would rename: setMaxResultSize to ‘setMaxResultByteSize’, setBatch to ‘setColumnsChunkSize’ and setCaching to ‘setNumberOfRowsFetchSize’. Judging by the amount of confusion on Stackoverflow I think it’s a good idea to name these methods differently.

CLJ-DS is worth checking out

I like Clojure. I really like it. I think it’s the best language available on the JVM. Why would anyone want to use Clojure? One word: Concurrency. Clojure is largely designed to address the needs of dealing with concurrency without resorting to primitive constructs such as locks.

Unfortunately using Clojure on projects isn’t always a feasible option because many projects are locked into the existing Java paradigm technically, culturally and economically. I often end up writing concurrent code in Java so having good data structures that minimize locking is a must. I find that most code I inherit that uses a lot of “synchronized” can often be rewritten to drastically cut down on the number of locks thanks to java.util.concurrent and java.util.concurrent.atomic. The part I always missed was an immutable data structure that could be returned to the calling code. This could be achieved by returning defensive copies of every mutable data structure but there is a slicker way.

CLJ-DS is another solution to the problem. It’s a library of Clojure’s persistent data structures ported back to Java with nice Generic type signatures and convenience static methods.

Here’s a typical example of code I often inherit. All the business logic and business variable names have been removed.

Some obvious problems with this code?

  • Since ‘getElements’ returns a mutable set, there is no guarantee that some code outside of ‘FooManager’ won’t ‘.clear()’ or mutate the returned set any further.
  • This code has subtle differences depending on ‘uuid’ existing in idToSet. When ‘results’ is null there might be an expectation of the empty set to be referenced by ‘idToSet’ just as it is in the non-null case.
  • Once the calling code gets a handle on the synchronized ‘results’ from ‘getElements’ it’s not guaranteed that everything is safe since ‘addElement’ uses a different lock to writes to the set in the non-null case.

There’s a better way using CLJ-DS and java’s concurrent package:

No subtle mutations of state and a lot fewer locks and by definition less lock contention. I consider the revised version a lot easier to reason about in no small part because of CLJ-DS library. PersistentMap and PersistentSet implement java.util.Map and java.util.Set respectively.