Dynamic Languages and Types

Many proponents of static typing, including many of the commenters who responded to this post, strongly believe that its presence in a language directly translates into correctness, even though there is very little empirical evidence to support this assertion. More importantly, the assumption is that the safety guarantees provided by static typing can’t be provided via any other mechanism.

To me, and many others who work with statically-typed languages, the biggest benefit of types is auto documentation.

Here’s an example. Let’s suppose the developer is encountering this code, in a generic statically-typed language, for the first time.

It’s not clear what what generateAccount actually does, but we know it takes an Event and returns an Account. In an IDE, it’s easy to click onto Event or Account and see what values they describe to get a sense of what’s happening. If Account and Event are immutable, we have a pretty good idea of what to expect. The second guarantee is that generateAccount can’t take Potato and return Tomato where Potato and Tomato have no relation to Event and Account. That’s simply not allowed. We pass in Event and get back an Account, and that’s it. Sure it might throw an exception, but the approximate understanding of the function exists, and we can move on to other code.

In a dynamic language, the same function might look something like this:

It’s not at all obvious what ‘e‘ is and what generateAccount returns, and we’ll need to get into the internals of the function to figure it out. That’s a lot more effort than simply reading the types by sight in the previous example. And what if there are 30 of these functions? And if we do finally figure out what it is generateAccount does and comment above generateAccount describing the types informally, we have to hope that no one changes the comments. All of this function “metadata” is maintained by the compiler in the statically defined generateAccount.

Every developer that I’ve talked to who favors static typing agrees this is the crux of the issue. It’s not monads, or linear types or even type inference that people want. What’s essential is being able to look at a function and figure out the inputs and outputs to get an approximate understanding without having to dig into the internals.

I believe this situation causes some people to prefer statically-typed languages, because the dynamic languages they are used to are not powerful enough, and don’t have maintainers who were sufficiently thoughtful to give this feature out of the box. My two favorite languages Clojure and Racket, both solve this problem in a powerful and compelling way. I am going to show off a little bit of Clojure’s way of addressing this problem. Luckily for me someone thought of a decent example already. Here’s the challenge recreated here:

The JSON data is in the following format

In JSON, keys aren’t ordered and must be strings.

The goal is to parse the JSON file, order everything correctly (which means parsing the keys into integers), and produce an array that looks like this:

The data structure can be a tuple or some new type.

Here’s the Clojure code to do the challenge:

This code is pretty straightforward, but I do want to call attention to the “s/def” spec section. A ::bible is a map whose keys must exist in book-order. The value of bible is chapter which is a map whose keys are numeric strings. The values of chapter are maps named verse whose keys are again numeric strings and values are nonempty strings. All of this done declaratively. Spec not only validates data it can also transform it. In this case, we can ‘conform’ the keys to the spec which means that once we validate the bible map, the numeric keys will be converted into integers.

The upside of spec is that now we have robust schema validation. The program will abort early if the data passed to it violates spec, so we don’t have to do error handling in the program’s functions, because the data is in good shape. We can just concentrate on transforming it. An even bigger upside is that now it’s easy to tell what kind of data goes in, just follow the definition of ::bible. What data comes out? Follow ::sorted-bible.

Best of all, the errors are data! Here’s a corrupt piece of data that violates the spec rule being tested against the spec. The name “Ge” is not in the set of valid books. Since the error is data, we can interrogate it.

Here are some more cases of bad data.

With just a few lines of code, we get all of the following:

1) The ability to verify that data is in the shape and form the program expects cuts down on error checking
2) Some free data parsing, in this case the string to int conversion
3) A robust system for error messages that can be customized, because it’s just data
4) Spec allows powerful predicates like ‘(set book-order)’ and ‘numeric-str’ which are not easy to do in a conventional typed language. Often types provide limited guarantees like “this key is a String” and not the fact that it’s numeric or nonempty. These are exactly the type of invariants we want but rarely get out of types.
5) Specs can be composed and reused in arbitrary ways
6) Using spec to generate test data or just get an example output of what kind of data conforms to the spec.

For example:

7) Since specs live in a registry, we can query the registry for specs that are defined in our program programmatically in the REPL. This is another aspect that’s desirable in a large code base, because it enables reuse.
8) We don’t have know how the specs look like beginning development. It’s possible to incrementally build up a program in the REPL and once we’re happy with the code, to define the specs. Even this is completely optional. Personally, I find that I like writing specs upfront to conform the input to my code and once I figure out the flow of my code to write specs again afterwards.
9) This is a subtle point, but it’s important to realize that spec supports the creation of open systems that are amenable to change and end up less brittle. The following spec

validates that a map contains two keys ::a and ::b if there are other keys in the map that don’t have specs defined they are left alone. If there’s no spec for ::a, the only thing validated is that a key ::a is present. When requirements change, spec allows for a smooth progression to new requirements. We don’t have to make all of our decisions upfront and our code can still operate.

There are many more uses for specs. I recommend following the excellent tutorial here to learn more.

I don’t believe that static typing is uniquely qualified to give programmers safety guarantees or better readability. The github link for the challenge provides a Haskell solution along side a Clojure one. Since Haskell is treated as the current poster child for statically typed languages, it would be great for somebody to provide a comparable Haskell solution that gives similar guarantees.

The Beauty of Clojure

This post is biased but I happen to like and agree with my biases so take everything written below with a grain of salt.

Clojure is one of my favorite programming languages because of its philosophy of handling state, functional programming, immutable data structures and of course macros.

However, after using component for a project at work, I noticed that my code stopped looking like idiomatic Clojure code and more like OO Java I used to write. While features like reify, defprotocol, deftype, and defrecord exist they exist for the purposes interop with Java, type extensions and library APIs. In my opinion the bulk of Clojure code should strive to utilize functions and be data oriented.

Clearly, with around 1,000 stars on GitHub, many people find component useful, but its object-oriented paradigm feels unnatural and at odds with the way Clojure is written. The rising popularity of component alarms me because looking at some of the code I and others have produced leaves little room for idiomatic Clojure.

Today I ran across a great blog post by Christopher Bui that reminded me of why I avoid component instead opting for mount. The best part of it is that it included code that enables me to rant by writing code which is my favorite kind of ranting.

As an exercise I decided to rewrite Christopher’s component code using mount and I am quite happy with the results.

Here’s the description of the original task:

Let’s say you’re planning on building a new service for managing emails in your application. The work involved is taking messages off of a queue, doing some work, writing data to a database, and then putting things onto other queues. Everything will be asynchronous.

My Clojure code using component looks very similar to the one written by the Christopher because protocols and records end up being at the forefront when they should be de-emphasized, as they are in my mount example. Functions, which are in the background using component are featured in the mount code below.

I have the full example shipper project on github that models an warehouse system that:

  1. reads order numbers that are ready to ship off of a warehouse queue
  2. sends out email notifications to customers
  3. writes order status changes and emails to DB
  4. and then sends notifications to postal to start a shipping process

Below is all the code on one page with namespace declarations removed. A real runnable version is available in the GitHub repo above.

Outside of ‘defstate’s which work like regular ‘def’ variables everything is a function. In my opinion, the above looks more idiomatic and I find it easier to read than the componentized version. In my experience mount’ed code ends up being shorter as a bonus. A big takeaway from using mount is that you can require defstate variables like you would any other var in a namespace and it just works. Take a look at the repo for examples.

Here’s how to interact with the code in boot/lein REPL session.

In short I encourage everyone to keep Clojure idiomatic and beautiful and just because your code has state it doesn’t mean you have to abandon the way you structure your programs.

TDD extremists

The other day I came across a particularly abusive post about TDD. Here’s a quote:

If you don’t use TDD in your project you are either lazy or you simply don’t know how TDD works. Excuses about lack of time don’t apply here.

I’ve been seeing this type of attitude for a while now but this one had some code attached so I decided to rant about it. First of all tests have costs, maybe the trade-off is worth it but it’s important to actually acknowledge that it’s happening in the first place.

1) Spending your limited time structuring your workflow around tests could be better spent around architecture and design of the overall code base.
2) In a poorly architected code base test give a false sense of security.
3) Most interesting use cases that you actually want to test are not amenable to testing.
4) Tests mean you have extra code that you need to maintain.
5) Writing tests manually is not particularly efficient.

The purpose of good design is to decrease the surface area of possible mistakes by construction of the architecture. What does this mean in practice?

Here’s the original class from the post:

1 & 2 & 3) Architecture/Design

There’s quite a few things here right away that I think could be improved. First this class is mutable. Someone coming from a functional language or having read “Effective Java” or “Java Concurrency in Practice” will pick up on this right away. Immutability is often a design choice. Do you want to write tests that try to verify the behavior of this class in a concurrent environment? Do you really want to accept a double precision float for money? An arguably better approach is to structure this class so that these problems don’t occur in the first place. Here’s a rewritten version:

Is immutability harder in a language like Java than Clojure or Scala or Kotlin? Yes, but it sure saves a lot of work in the amount tests you have to write. I don’t have to worry about this class not doing the right thing with money because of weird rounding issues/floats. If you want to learn about floating point arithmetic go for it. I prefer to use Joda Money or in this case for the sake of an example I am using BigRational. You don’t have to worry about someone else extending this class and breaking encapsulation. No need to test if synchronization is done correctly. All of these things are extremely hard to unit test to the point where almost nobody does it. However if you think about the architecture of your code before writing tests perhaps you can avoid writing these tests in the first place. I’d argue the easiest things to test are quite simple and don’t end up saving much time. The real hard bugs are in the later category and it’s best to get rid of them via good architecture that doesn’t allow them to exist vs dozens of unit tests that will try to achieve the same thing.

4) Unit tests are things you have to maintain. I think most people with large code bases have experienced this. I have a 2 line change in the code that causes 20 tests to be changed. It becomes very difficult to change the architecture in a large codebase because of all the tests but the tests don’t necessarily improve or imply good architecture and design. It could be argued that tests failing when code is changed is a good thing but it’s worth acknowledging the costs associated with that kind of a workflow. In the above example I hope it’s clear that even in a simple “POJO” class where there isn’t much going on design is important.

5) It’s difficult to test interactions via unit tests, even in simple cases. Wouldn’t it be better to have your code come up with hundreds or even thousands of tests for you? With the invention of quickcheck and the porting of it to most mainstream languages it’s now possible. This can drastically decrease the amount of unit tests you have to write in the first place and I find it more convenient to write those type of tests after the code is designed and written rather than writing tests first before I have a design.

The worst part for me is that the author is probably aware of all of this since he writes Scala. The above is nothing more than a verbose version of a Scala’s case class or Kotlin’s data class.

HBase client’s weird API names

There are only two hard things in computer science: cache invalidation, naming things and off by one errors

It all started with a problem.  I wanted to scan the last n rows of an HBase table using the cbass library, a thin, opinionated wrapper around the hbase-client API. I really like it, and if you’re using Clojure and HBase, I recommend checking it out.

cbass doesn’t have the capability to scan n rows, so I decided to add that functionality. In order to do that I had to use the underlying hbase-client API.  Two ways to limit the results of a scan using the hbase-client API are to 1) using start/stop rows or 2) timeranges. In my case I wasn’t using either, so it wasn’t immediately obvious how to limit the output to just n results.

At first I thought I could just wrap ResultScanner iterator-seq in a take. That seemed to work on small tables. However on a large table (defined here as anything over a few thousand rows),  the scan would blow up. It turned out that in order to minimize the number of network calls,  a scanner by default will try to get all the matching entries and shove them into a ResultScanner iterator. For this use case, there’s isn’t a limiting criteria on a scanner, so a scan will try to “fetch” the entire table into memory. That’s one large ResultScanner object! No good. What to do?

What I really wanted is a bounded ResultScanner iterator of size n. Using ‘take’, I was able to create a bounding iterator of size n, but the underlying iterator ended up table scanning and being much larger than n. Since ‘take’ isn’t enough, and I wanted to limit the number of rows coming back from HBase, I needed something on the scanner itself to make it stop fetching everything.

Looking at the API names on for Scan , you’d assume that setMaxResultSize(long maxResultSize) was what you wanted. But when you read the javadoc more carefully, you realize that it has nothing to do with the _number_ of results coming back from a scan. Ok, maybe it’s setBatch(int batch). Nope, batching in this case is meant for rows that have very many column qualifiers, so in this case ‘batch’ means return N of M column qualifiers on the ‘.next()’ call of a ResultScanner where M is the total number of of column qualifiers of a row and N<=M. Turns out the setting I was looking for is setCaching(int caching). Here's the description:

Set the number of rows for caching that will be passed to scanners. If not set, the Configuration setting HConstants.HBASE_CLIENT_SCANNER_CACHING will apply. Higher caching values will enable faster scanners but will use more memory.

The description and the name are completely at odds with each other. A better name in my opinion would be ‘setFetchSize’, ‘setBatchSize’ or really any other thing that doesn’t include the word “cache”.

The name “cache” is so confusing that both tolitius and I had to dig in the source code of hbase-client just to convince ourselves it wasn’t actually a cache.

I would rename: setMaxResultSize to ‘setMaxResultByteSize’, setBatch to ‘setColumnsChunkSize’ and setCaching to ‘setNumberOfRowsFetchSize’. Judging by the amount of confusion on Stackoverflow I think it’s a good idea to name these methods differently.

Dangers of unit testing undefined behavior

Recently I participated in a intracompany discussion about a concurrency defect. A piece of code looked something like this:



The problem? ‘keepRunning’ a plain boolean without any synchronization. According to the JMM a whole slew of things could go wrong with the above code, one of which is for Thread1 to run forever because it never sees Thread2’s update to ‘keepRunning’.

What’s interesting about this is that there was an integration test that inadvertently tested this scenario and has always passed, so the problem was never caught. Once the code started running a production box which has different hardware characteristics (a lot more cores/memory) this code blew up.

This is one of those examples where unit testing doesn’t produce good results. It’s very dangerous to get an intuition about incorrect concurrent code by running simple unit tests. These one off unit tests run for a short period of time on a box that potentially has few cores, the JIT doesn’t kick in, inlining doesn’t happen the system isn’t under heavy load and the hardware configuration could be favorable to not surfacing the error. As shown in this post (C and Java examples), just because incorrect concurrent code works the way the author expects doesn’t mean it will continue to work. I’m not sure if that was the author’s intention but that’s what I got out of it.

This is why I am pessimistic about types and unit tests when it comes to catching interesting errors found in production. Unit tests/types are good for catching obvious things like “this method doesn’t accept arguments of this category” or “what happens when this method gets passed an empty string instead of a string that I expect?”. I have yet to see a language/test framework that can help with concurrency problems.

I know of are two partial solutions to the concurrency problem.

1) Try to avoid errors by construction, i.e. have good design that makes doing the wrong thing harder. Immutable data structures by default is a big first step in that direction.

2) Feynman method. Think really hard and write code that doesn’t contain concurrency bugs, if that’s not possible try to convince a friend or co-worker to think very hard with you.

The first method is really just a special case of the the Feynman method.

Everyone uses open source, but no one talks about it

About a month ago, I donated some money to OpenBSD because they were in trouble financially. After reading about the response to their call for help, I started wondering why it had to get so bad for a popular open source project before people rallied around them.

I’m a Java developer. Over the course of my career so far, I’ve worked both for Fortune 500 companies and companies that had less than 20 people. I’ve worked for companies that are only 100 years younger than America, and some that were founded only a few years after the tech bubble. The one thing that unites these companies is their use of open source software.

For small companies, it’s usually because they’re too small and cash-poor to afford commercial software like an Oracle licence, so they use MySQL or PostgreSQL. Older companies are usually trying to modernize, and the technical leadership decides that they no longer want to use Websphere/Weblogic, and would rather move to something with less cognitive overhead like Apache Tomcat.
I’ve seen dozens of cases where the only reason a project succeeds in a given time frame is because of easily-available mature open source solutions.

For example, every Java contract I’ve ever worked on has used Apache Commons. Tomcat, Jetty, JBoss increasingly Netty are prevalent for application servers and web service construction. Outside of Java, RabbitMQ, ZeroMQ, and ActiveMQ for are standard for queueing. PostgreSQL, MySQL, HBase, and Cassandra are common for persistence. In Javascript, I don’t think there are any proprietary frameworks left worth mentioning.

What I find much rarer is for somebody in a technical leadership position to acknowledge the huge role that open source plays in the success of these projects, and not to just acknowledge the platform, but doing the right thing and donating some money to these projects. If even 1% of the companies that use these projects were to donate 1% of the costs these projects save them, I’m confident there would be no funding issues for any of these projects.

These projects don’t exists for the money, and the developers who work on them aren’t doing it to become rich. But a bit more goodwill from decision makers in businesses that rely on these projects strikes me as the right thing to do.

Karatsuba Multiplication

Everyone knows how to multiply large numbers. But there is a faster way. What’s interesting is that it was only discovered as recently as the 1960s, by Karatsuba. It makes me wonder how many other things are right under our noses that millions(billions?) of people know about over the course of centuries but nobody has thought of a better way… yet.

I implemented the algorithm bellow in C. The key idea is that it’s possible to save a multiplication by doing arithmetic.

In defense of PHP

This past weekend, I took a quick overnight trip to NYC to attend Code Montage’s, Coder Day of Service in Manhattan. It’s a great event where people donate their technical skills and time to help others who don’t have technical skills but have needs in the nonprofit realm. I worked with a woman who wanted to build a website to bring awareness to a cause she was passionate about.

She wanted the website to include posts with links to resources around the issue and testimonials from others involved with the same issue to build a community. It was great working with her and helping her address her needs.

What I found the most interesting from a technical standpoint was some of the previous advice she received about how to go about building such a website. Someone else at the hackathon had previously recommended that she look into Victory-Kit, a static-based website generation project which hasn’t seen any pull requests or other activity for over four months. It’s generally a bad idea to build any kind of website for end-users based on a dormant project: abandoned code means numerous unpatched bugs, undeveloped features. These sites are also really hard to support once the hackathon is over and the end user has to go to someone else for code maintenance.

Another person at the table recommended we do it using Jekyll with Chef. Chef’s a really great option for programmers because it allows maximum flexibility and control, but it’s only great for people who really know what they’re doing and have the time to devote to development. This was not a great idea for a non-technical user working on non-profit projects with tons of other things on her plate.

Interestingly enough, although it was the obvious solution nobody thought to use WordPress. It just works, it’s a mature technology, it has a vibrant community around it, tons of plugins, many free themes and many more premium themes that end-users can customize fairly easily without any programming knowledge.

There is a distaste for any PHP based technology in the web development community and I suspect I know why. A lot of blogs and comment boards disparage PHP recommending to use Ruby, Python, NodeJS or anything else, as long as it’s not PHP. I wonder if most of these haters formed their opinion of PHP a decade ago and haven’t looked at it since.

I bet most critics have no idea what modern PHP or PHP frameworks look like. I am glad the PHP community largely ignores the hate and continues to produce excellent code that just works.

Radiation in 5 minutes

Every time I talk to family/friends/read news article about Fukushima/Chernobyl/Nuclear Energy I run into a lot of ignorance about what radiation actually means. What surprised me when I did some research on the topic a few years back is how you don’t need a lot of sophisticated mathematics to be able to understand the risks. You need at most elementary level mathematics and some basic facts.

Two questions to ask about any radiation news story:

What type of radiation is it?(alpha,beta,gamma)

Alpha is only deadly if ingested or inhaled. Beta doesn’t travel far and won’t penetrate thick clothing. Gamma will penetrate in the deep tissue and cause cancer.

How much radiation?(rems,sieverts or grays).

Death from cancer in population is 20% before any radiation exposure listed bellow. That means 1 out of 5 people in the world will die from cancer. Usually cancer happens later in life.

LINEAR HYPOTHESIS: model used in radiation protection to quantify radiation exposition and set regulatory limits. It assumes that the long term, biological damage caused by ionizing radiation (essentially the cancer risk) is directly proportional to the dose.

Linear Hypothesis appears to be biased with a pessimistic outlook on effects of low radiation exposure for humans.

Measuring Radiation: milliSievert

0.1 uSv = Eating a banana
7 uSv = Dental X-Ray
3.1 mSv/year = US from all sources including medical
Fukushima 1.42uSv/hr for 2 weeks = Additional 0.16 mSv

2.0 Sieverts = 10% chance increase of cancer
1.0 Sieverts = 5% chance of increase of cancer
0.2 Sieverts = 1% chance increase of cancer
All in addition to the 20% you already have

Additional risk of exposure after two weeks at Fukushima? 0.008% increase of cancer.

CLJ-DS is worth checking out

I like Clojure. I really like it. I think it’s the best language available on the JVM. Why would anyone want to use Clojure? One word: Concurrency. Clojure is largely designed to address the needs of dealing with concurrency without resorting to primitive constructs such as locks.

Unfortunately using Clojure on projects isn’t always a feasible option because many projects are locked into the existing Java paradigm technically, culturally and economically. I often end up writing concurrent code in Java so having good data structures that minimize locking is a must. I find that most code I inherit that uses a lot of “synchronized” can often be rewritten to drastically cut down on the number of locks thanks to java.util.concurrent and java.util.concurrent.atomic. The part I always missed was an immutable data structure that could be returned to the calling code. This could be achieved by returning defensive copies of every mutable data structure but there is a slicker way.

CLJ-DS is another solution to the problem. It’s a library of Clojure’s persistent data structures ported back to Java with nice Generic type signatures and convenience static methods.

Here’s a typical example of code I often inherit. All the business logic and business variable names have been removed.

Some obvious problems with this code?

  • Since ‘getElements’ returns a mutable set, there is no guarantee that some code outside of ‘FooManager’ won’t ‘.clear()’ or mutate the returned set any further.
  • This code has subtle differences depending on ‘uuid’ existing in idToSet. When ‘results’ is null there might be an expectation of the empty set to be referenced by ‘idToSet’ just as it is in the non-null case.
  • Once the calling code gets a handle on the synchronized ‘results’ from ‘getElements’ it’s not guaranteed that everything is safe since ‘addElement’ uses a different lock to writes to the set in the non-null case.

There’s a better way using CLJ-DS and java’s concurrent package:

No subtle mutations of state and a lot fewer locks and by definition less lock contention. I consider the revised version a lot easier to reason about in no small part because of CLJ-DS library. PersistentMap and PersistentSet implement java.util.Map and java.util.Set respectively.