I love conference talks. I believe that YouTube has made me a better programmer over the last 17 years. I’ll often turn one on while I’m doing chores. I’ll learn something and sometimes even be inspired to write about it. Like this one.
The talk I was watching was You’re Doing Exceptions Wrong by Matt Burke. In it, Burke presents some rather compelling advice on how to design your exception throwing and handling to make your code more robust and easier to work with. I quite agree with a lot of what he said, though I think there’s plenty of room for subtleties and context-dependent decisions. You can’t cover everything in a one-hour talk.
What struck me, though, was how easy exceptions are to get wrong. He goes over many examples where poor use of exceptions led to problems. Sometimes people caught exceptions and ignored them. Sometimes they threw the wrong exception. Etc., etc. There was a lot of advice to digest and apply. And you could still get it wrong.
It got me thinking about how many degrees of freedom there are when dealing with exceptions. What class of exception you choose? Should you throw? How much code you wrap your try/catch in. Do you catch at all? How specific do you make your catch statement’s selector? How do you rethrow? Do you need a finally? All of these were subtle and non-mechanical decisions. That’s a lot of cognitive work going on.
It reminds me a lot of the advice in Java Concurrency in Practice. It’s a great book. The advice is solid. I learned a ton about Java when reading that book. But it’s impossible to apply correctly all the time. It is 432 pages of dos and don’ts. Java’s design defaults to a sequential model of computation—one thread. To switch to a concurrent model, you have to start using a whole set of new conventions. You need to properly use the volatile and synchronized keywords, choose “thread safe” classes, and master a whole set of concurrency primitives. There are too many degrees of freedom.
Clojure solved this by reversing the situation: Make concurrency the default. What does that look like? Immutable data structures and concurrency primitives with simple contracts. You can still do sequential programming for those inner loops where you need it. But it’s more cumbersome. In short, Clojure chose the more general case (multiple threads) and constrained the degrees of freedom by choosing the right defaults.
I wonder if there isn’t something like that for exceptions. What would that look like? It would be about choosing the right, general-case default, reduce the number of decisions that have to be made, and making the few decisions left easy to get right. (Aside: This is something Clojure does particularly well.)
If we assume the default is the general case (every function is broken :), then we
The first thing I would do is to come up with a short list of scenarios that the exception can capture. Here’s a first stab:
IllegalArgumentException
— somebody passed me an invalid argumentUnexpectedReturnValueException
— I called a function and got something back I wasn’t ready forNoPossibleAnswerException
— I was called with valid arguments but I can’t fulfill the contract
The main idea behind the first two is try to distinguish the source of bad values. The ubiquitous NullPointerException
tells you you got a null, but not whether it was from an argument or a return value. I’m not as concerned about whether it was a null or some other invalid value (such as 0 for division) as much as where the bad value came from. These exceptions would require you to provide the value.
NoPossibleAnswerException
is for situations where you want to throw an exception because you can’t answer. For example, if I ask for the configuration key “HTTP_PORT” but it’s not set, I may want to throw an exception instead of returning null. This is the exception for that case.
There are still other exceptions that might come up. I’m thinking about FileNotFound
or TimeoutExceptions
. I wonder where those go.
There’s another dimension by which we should slice the exceptions: Whether or not they should be retried. For example, if there’s a timeout on an http get request, go ahead and retry. But if you divide by zero, retrying is useless. If there’s one piece of information I’d love to have available programmatically, it’s whether the thrower thinks retrying is a good idea.
With these few pieces, you can build some primitives that help localize the errors:
(defn divide [num denom]
(assert-argument denom (not (zero? denom)))
(/ num denom))
assert-argument
is a macro that throws IllegalArgumentExceptions
if the condition is not true. It can use the form to report good error messages, including the name of the argument, its value, and the condition that failed. It helps document your assumptions about the arguments you will receive.
Similarly, you can document the assumptions about the return values you will get with assert-return
.
Unfortunately, most existing libraries (certainly those in Java) do not follow these conventions. What has worked best for me is to build something like an Anti-Corruption Layer (as in Domain Driven Design). The anti-corruption layer wraps the library and converts the library’s conventions into the conventions of my codebase. It’s a place to say “this is how we use this library” and “this is what we expect this library to do”. It seems redundant at first. Why wrap each library function you call in another function that just calls the library function? Well, it’s to constrain and centralize the assumptions you’re making about the library. Instead of having workarounds for the library spread throughout the codebase, you centralize them and standardize them to the anti-corruption layer.
Before I conclude, I want to mention a different yet similar approach. Erlang is famous for its “let it crash and retry” strategy for handling errors. It’s similar in that Erlang assumes anything can fail (the default is something will go wrong). And the default strategy is to reset the state and try again. This often works, especially with stateful systems. I think it’s a great default.
The problem I have encountered with it is that you do need a certain level of correctness for retries to be effective. If your code doesn’t handle the null you got, retrying it is not going to fix that. You would like to surface bugs early (hopefully during development) so you can fix them. In those cases, you don’t want to hide the problem with a retry. However, we all know that bugs are inevitable. So being protective and having a decent default strategy is prudent. Retry seems to be pretty good to me as a default. So it seems we may want different behavior in production from during development.
In development, crash early so we can surface the errors. In production, do your best to continue without errors. For instance, if the product recommendations don’t load on your product page for a new tea kettle, you should still load the page, just without recommendations. That way, the user can still buy their tea kettle. But don’t do that during development because if you break the recommendation engine, you want to know.
The tension between these two behaviors shows that there are still significant decision points a programmer will need to make. The key is to make the decision easy to get right. That means reducing the number of options and making each option easier to do right. We want to standardize and reduce boilerplate.
Let’s conclude. Exception throwing and handling has a lot of degrees of freedom—too many to get right all the time. It’s ironic because the error reporting and handling will have errors! We can apply Clojure’s design principles of using the general case as the default. In this case, it’s that the function you’re calling can’t be trusted fully, and the caller of the current function can’t be trusted. You have to check your arguments and the return values. But checking them is a lot of work and easy to get wrong. We need better primitives to make it easy to get right. I proposed a systematic evaluation of the kinds of errors you want to throw, and making those easy to throw. I talked about an anti-corruption layer to enforce conventions when using third-party libraries. And I talked about wanting different behavior in development and production. I don’t think I got it all right. So please let me know what your policies are in your software. And I’m also interested in knowing in what other areas (besides exceptions) you apply the policy of “general case is the default”.
What are your thoughts on https://github.com/cognitect-labs/anomalies ?
Looks like part of this sentence is missing: "If we assume the default is the general case (every function is broken :), then we"?