Residuality Theory

Good idea, bad name

May 21, 2024

Residuality Theory is a framework for systems architecture by Barry O’Reilly. It attempts to use ideas from complexity sciences to build systems that can better survive the complexities of the real world. I think there’s a lot of good ideas in it, and I like spreading good ideas.

The theory begins with a critique of the current process of software architecture. Currently, software architects build software bottom-up, trying to engineer functionality of the software from the functionality of its components. We need to store data, so we choose a database. We need a web API, so we add a web server. What this is missing is a stronger notion of the uncontrollable dependencies on the world outside the software, including social, political, and environmental stressors. How does our system handle a DoS attack? What if the power goes out? What if the laws change? If the software cannot handle the stressors, it will fail to function.

The industry has come up with a set of best practices that can mitigate many stressors. For example, backup policies, rate limiting, autoscaling, and service redundancy are all there to handle common emergencies. While the mitigations considered best practices often work, they fail to constitute a process for architecture. Architecture is not “use as many best practices as possible” just like software design is not “use as many design patterns as possible.”

Residuality Theory suggests that we should mentally simulate the stressors ahead of time to understand how we can engineer the system to exist in the complex real world. O’Reilly claims that this is more like how experienced architects tacitly do their work, and I tend to agree. The theory gives us an explicit process and uses analyses from complexity sciences to show how and why it works.

Here’s the process O’Reilly suggests. We come up with many big, outlandish stressors that will have an effect on the system. I’ll use a recent service I’ve been working on. It’s a way to add a banner to your site to give coupons based on the average purchasing power in your customer’s country. Here are some of the considerations I’ve been thinking about while designing it:

What if nobody uses it?
What if it becomes super popular?
What if e-commerce sites using the service get a big spike in traffic?
What if customers want more features?
What if software versions change?
What if the hosting provider goes down temporarily?
What if the hosting provider goes out of business?

I don’t know how I came up with those, but those are the ones that came to mind. Each of those considerations has influenced my decisions in designing the software, its deployment, and how I sell it.

For example, “What if nobody uses it?” is fine, except I don’t want to pay a lot for a service nobody is using. I could choose a low-cost service that scales down to nearly zero. Likewise, if it becomes super popular, I might get a lot of traffic, costing lots of money in bandwidth. I’d like to design it to require very little traffic from free users and somehow tie the hosting costs to the income from customers. As I mitigate those situations with design decisions, the system becomes more robust.

Each of those mitigations O’Reilly calls a residue. It’s a terrible name. Our list of best practices is actually a list of residues that the industry has discovered over time.

Residuality Theory shows us why our system can get more robust as we imagine more disasters. To do so, I’ll take you on a trip through complexity science and a tiny bit of statistics.

Imagine 10,000 light bulbs, each either on or off. That’s 2^10k possible states, which is way too many to manage. Now imagine the lightbulbs were interconnected, so that lightbulb states could influence each other. For example, if lightbulb #4,221 is on, #5,643 is 50% likely to be on. This is a good enough model of a software system, where each component in the system is either working or not. Dynamical systems theory has studied systems like this and has characterized their behavior as the number of connections (K) increases from 0 to n^2.

Low values of K result in what are known as stable systems. Stability sounds good, but it’s not always. For instance, being dead is very stable. Likewise for going bankrupt. But it could be a functional system that isn’t very good at adapting as things change. Very high values of K result in chaotic behavior, which is bad because we can’t derive value reliably from them. They toggle wildly through different states of lightbulbs unpredictably. The sweet spot is what is known as the critical point, which is the boundary between the stable and chaotic. This is where adaptable systems exist, and it’s our goal for software engineering.

In most initial architectures, K is too high. We are in the chaotic zone. Too many things are coupled together, either explicitly or implicitly. So our goal is to reduce K (also known as reducing coupling). But we don’t want to reduce coupling too far below the critical point, or our system cannot adapt.

O’Reilly offers some tools for analyzing the coupling in our architecture. One method is what he calls the incidence matrix. You make a table with your disasters down the left side and your “components” along the top. Components are kind of like your desired features (not software components or services). My components would be:

charge customer
display banner
embed javascript
etc.

These are all components of this bigger system that I’m designing. My disasters are:

change to javascript embed
HTML on page is hostile (lots of weird, aggressive CSS)
credit card expires
credit card fails
cannot geocode IP address
browser uses proxy through other country
huge spike in traffic

I make these into a table, which looks like this:

Then we go through each box and add a 1 if that component would be affected by the disaster. Here’s my analysis:

I’ve also summed the rows and columns, with the total of the whole table (11) in the corner. This table tells us several things:

“discount banner” is very fragile. It is affected by 7 disasters.
We can also look at the disasters with high totals. These can wipe out the whole system.
Credit card expiration links “charge customer” to “discount banner”. That is, those two components are coupled. In fact, if any two components have a 1 in the same row, they are coupled. (These are coupled because in my design, free users get generic coupon codes. If their card expires and they are downgraded to free automatically, the banner will show non-working coupon codes.)
If we find two components with the same response to disasters (exact same patterns of 1 in columns), then they can be combined in the same component. They live and die together anyway.
If we find a column with no 0s, it’s either invulnerable to the world, or we haven’t imagined enough disasters.

In general, we want to reduce the total divided by the number of components, which is roughly equivalent to the average number of connections (K) in the connection network. So if we can reduce the 11 in the corner, or increase the number of components (3), we are reducing coupling. Remember, as we lower K, we get closer to criticality—an adaptive system. K is currently 11/3, or 3.66.

We take this first pass at an architecture, which we call naive, and we intervene, producing “residual architectures.” Here are some possible interventions:

We should split up “discount” because it is affected by many disasters (7). If we split it up, it would still have the same number of 1s (7), but it would be more components, lowering our K and getting closer to criticality.
We could also say that we give a customer a grace period of 1 month to update their credit card. After that, we don’t really consider it a failure.
And we could add a policy that if we need to change the JS embed, we continue to support the old one.

I like this theory because it considers the socio-technical system, not just the code. It also suggests how we should subdivide into components. We can combine components with identical columns. And we can split components with high totals in columns. Also, I should mention that a lot of these changes I was already thinking about. This analysis just gives it more of a process and can jog my thinking about what to work on next.

Here’s what our new table looks like:

The first thing to note is that our new K is 7/4, or 1.75. That’s much better! Our highest column sum is now 2 as opposed to 5. And we’ve completely eliminated any negative affects from changing the JS.

I broke down “display banner” into two components:

Can we display the HTML correctly?
Are we showing the correct discount?

Their columns look very different now. And conceptually, I would handle their failures differently.

As we mitigate for more and more, we should find that our system can handle disasters that we haven’t considered. O’Reilly suggests an empirical test for this. We can save half of the disasters for testing. We use the remaining half to “train” our naive architecture, generating “residual architectures.” Then, we use the testing disasters that we saved to see how our final architecture would perform compared to the naive architecture. It should do better.

The residue process relies on two observed phenomena, attractors and sampling. Any connected network (as we have when we start) has what are known as attractors. An attractor is a state that the system tends toward. Even though the lightbulb system could be in any one of 2^10k states, there will be a much smaller number of attractors, which you can think of as categories of states, that it will fall into.

We don’t know what those attractors are, but we can imagine in my service the attractors would include the service running smoothly, the service falling over, or the service being dead. The attractors would also include the idea of running profitably, a growing or shrinking customer base, etc. Think of them like the states of the system as a whole that can ignore minor details like the exact amount of traffic it gets at any time.

While the number of possible states is 2^n (where n is the number of components), the number of attractors is closer to n^½ (square root of n). So we have much less to navigate, though they are still unknown points in our state space. And because they’re unknown, we need a powerful idea from sampling theory.

In sampling theory, we can approximate the area of a shape that fits inside of a square quite easily.

We sample randomly. Each sample is a point inside the square, and we are told whether the point is inside or outside the shape. We can approximate the area of the shape with (Area of Square * samples inside shape / total samples). As the number of samples increases, the approximation improves. And at no point do we need to characterize the shape. We rely on statistical probability to estimate the area.

In our architecture, the disasters push our system into a small number of attractors. We sample the attractors with a large set of disasters. As we imagine more and more disasters, we learn more and more about the attractors the system can be in. So the recommendation is that coming up with a large number of outlandish disasters (even if they are unlikely) will give us more information about how the system will behave. We can use that information to correct its behavior.

A large number of disasters will give us a better approximation of the system’s behavior, and the residue (mitigations) we develop will avoid the undesireable attractors. And because there are few attractors, the mitigations will work for a large number of disasters, even those we haven’t thought of. And that’s how we can find an architecture that can resist a wide variety of stressors.

I don’t think O’Reilly mentioned it, but I think this can also help accept windfalls (the opposite of disasters). What if our app gets on the front page of the New York Times? What if a big e-commerce company signs up for the service? What if Shopify puts us by default onto every one of its shops? These could be disasters (too much traffic), but they’re also possible sources of revenue. Is my business model able to take advantage?

I’m looking forward to incorporating Residuality Theory into my designs. As O’Reilly says, experienced architects do think about these disasters, though they can’t explain what they’re doing. I think I do come up with scenarios I want to avoid and design accordingly. But this process gives a better explanation of why it works and tools for doing it more systematically. I share this article because I would like this idea to spread. If you have any questions, join me on my office hours and we can discuss it.

Links

Goncalo

May 21, 2024Edited

Nice bit of synchronicity! I've also been looking into Residuality theory in the past weeks, although I hadn't understood it as well as your article puts it.

The simple examples really help to get over the "scientific jargon".

It really is striking how simple the process is, it defies belief even.

Btw, Barry has written a book about this topic, it's called Residues: Time, Change, and Uncertainty in Software Architecture and you can find it in https://leanpub.com/residuality

PS: Agree with you about the name - "residue" is atrocious ;)

Expand full comment

Eric Normand's Newsletter

Discussion about this post