X

The Silent Killer That’s Crashing Your Coroutines | by Sam Cooper | Feb, 2023


Photo by joannapoe. CC BY-SA 2.0.

You stare blankly at the unremarkable log messages on the screen, willing them to reveal their secrets. Today the production servers stopped responding to traffic again. Last week, you just restarted them and wrote it off as a glitch. You couldn’t shake the feeling there was something weird about it, though. The servers were passing their health checks, and resource usage was fine. There wasn’t a trace of an error in the logs — not even a warning. But the synthetic monitors were going wild, and no wonder: every single request was timing out. Sure enough, the pattern this time is exactly the same. One by one, the servers have gone into zombie mode. The lights are on, but nobody’s home.

Sounds like a nightmare scenario, right? But it’s exactly the kind of thing that can happen to your Kotlin coroutines when cancellation exceptions go rogue. And when cancellation exceptions are everywhere, the chances of that happening are higher than you’d think. Luckily, there’s a solution, but it’s almost exactly the opposite of what you might have been told. In short, the only way to keep your coroutines completely safe is to catch every cancellation exception and not re-throw it. Instead, check for cancellation manually using ensureActive. No, I promise I haven’t completely lost the plot. Let me explain.

Cancellation exceptions in Kotlin have a dangerous superpower. When a coroutine ends with a CancellationException, it’s not treated as an error. Instead, the coroutine ends silently, similar to what it would do if there was no exception at all.

That’s very different from what happens with other exceptions. End a coroutine with an uncaught error of any other type, and the exception will propagate up through the job hierarchy, eventually causing the coroutine scope or application to crash if the error isn’t caught and handled.

The reason cancellation exceptions are handled differently is, of course, that they’re used as part of the coroutine cancellation process. I guess the clue’s in the name! Cancelling a coroutine is co-operative, which means that the cancelled code doesn’t just get stopped immediately. Instead, the coroutine’s Job is marked as inactive and the coroutine is responsible for terminating itself. Cancellation exceptions provide one way for cancelled coroutines to exit cleanly and quickly. They’re baked into the coroutine machinery in two ways:

  1. After a coroutine has been cancelled, trying to have it call a built-in suspending function like delay or await will throw an immediate CancellationException. This helps ensure the coroutine is notified that it’s been cancelled and doesn’t try to keep doing work.
  2. Any time a coroutine ends with a CancellationException, the exception does not propagate upwards through the job hierarchy. The parent job won’t be notified of the exception. This allows cancellation exceptions to be used for a quick, clean exit from a cancelled coroutine without triggering error-handling behaviour.

Obviously, the cancellation exception can only do its job and terminate the coroutine if it makes it to the top of the coroutine’s call stack without being caught first. If you repeatedly catch and ignore the cancellation exceptions that a cancelled coroutine will generate, the coroutine might end up in a zombie state where it keeps running, holding locks or resources, without being able to perform any useful work. For that reason, you may often hear the advice that you shouldn’t catch cancellation exceptions, or that you should always re-throw them. In a coroutine that needs to do catch-all error handling, you’ll often see a special case that looks for cancellation exceptions and re-throws them.

My claim is that there’s a fundamental problem with blindly re-throwing every cancellation exception. Remember that coroutines have special handling for cancellation exceptions, treating them as normal terminations instead of errors. Well, the special handling doesn’t just kick in when the coroutine is actually cancelled. It’s active all the time. The result is that a cancellation exception thrown from code in a normal active coroutine can still cause the coroutine to terminate silently.

I call a cancellation exception that was thrown in an active non-cancelled coroutine a “rogue cancellation exception”. A rogue cancellation exception is dangerous, because it can kill a coroutine silently and undetected, without triggering any error-handling behaviour or logging that you might have set up. In fact, before I figured out what was crashing my coroutines, much of my code was playing right into the killer’s hands by explicitly excluding cancellation exceptions from being logged.

Before I figured out what was crashing my coroutines, my code was playing right into the killer’s hands

There are two very serious problems that result from rogue cancellation. First, the coroutine vanishes abruptly and silently mid-way through its execution, even though nobody asked for it to be stopped. Second, if the exception wasn’t caused by a normal cancellation of the current coroutine, it must have been caused by some other problem, which will now go completely undetected. Taken together, these problems can easily create the “zombie application” scenario from the introduction, where the application appears to be running, but coroutines that were performing critical background work or message handling have just vanished without any trace of an error.

How do you tell the difference between a real cancellation and a rogue cancellation exception? The CancellationException itself doesn’t have any public properties or state you can inspect to see where it came from, and the JobCancellationException subclass isn’t made public either.

My solution is a pattern that I call double-checked cancellation. I haven’t seen this pattern anywhere else, so I’m claiming I came up with it. Let me know if you’ve seen it in the wild before. And please, feel free to use it and adapt it for your own code!

The idea is simple: whenever you see a cancellation exception, check whether the coroutine has actually been cancelled or not.

try {
doStuff()
} catch (error: Throwable) {
coroutineContext.ensureActive()
handleError(error)
}

If a cancellation exception winds up being thrown from the try section, the explicit call to ensureActive constitutes the second check for cancellation. If the current job really is cancelled, the second check will immediately throw another cancellation exception, breaking out of the catch block.

To get a feel for how it works, try changing the runnable example so it calls doStuff2 instead of doStuff1. One of the functions cancels the job properly, while the other throws a rogue cancellation exception. The double-checked cancellation pattern accurately resolves the ambiguity between the two, letting you handle the rogue exception while the real cancellation propagates correctly.

The double-checked cancellation pattern is ideal for catch-all error handling where you need to handle a broad category of errors without interfering with proper coroutine cancellation. Using double-checked cancellation, you catch whatever errors you like, and simply include a call to ensureActive at the top of the catch block. If the coroutine was cancelled, the cancellation exception will escape. It’s exactly like re-throwing the exception, only it’s easier, and it doesn’t risk propagating rogue cancellation exceptions that could silently mask real problems.

Do we really need to go to so much trouble to avoid these rogue cancellation exceptions, though? Sure, they sound bad, but how common are they really? Why would we ever expect a CancellationException to be thrown or caught in a coroutine that hasn’t actually been cancelled?

As it turns out, rogue cancellation exceptions are lurking pretty much everywhere. For a common example, consider a value-producing async coroutine that returns a Deferred result. We know from the docs that calling await will either return the successful value or, if the job isn’t successful, throw the corresponding exception. Based on that, you might be able to guess that if the async job is cancelled, await will throw a CancellationException.

On the face of it, that doesn’t sound too dissimilar to the cancellation exceptions we get in cancelled jobs. There’s an important difference, though. The coroutine that was cancelled is not the same one that made the suspending call to await and received the exception. In fact, you can have a cancelled Deferred that throws a CancellationException without a coroutine backing it at all.

If this rogue cancellation exception wasn’t caught, it would have the potential to silently terminate the calling coroutine, even though no coroutines in this application were ever cancelled.

What this amounts to is a whole new category of cancellation exceptions that don’t necessarily correspond to coroutine cancellation at all. And the exception thrown by a cancelled Deferred is far from being a one-off. Channels do it too, when you call send or receive after the channel is cancelled. But the story actually starts in Java, with the addition of java.util.concurrent.CancellationException all the way back in Java 1.5. The documentation describes the new exception as “indicating that the result of a value-producing task […] cannot be retrieved because the task was cancelled.”

Notice the difference between that and Kotlin’s job cancellation exceptions, which are “thrown by cancellable suspending functions if the coroutine is cancelled while it is suspended.” Inside a coroutine, Kotlin-style cancellation exceptions are thrown to interrupt ongoing work as part of the normal termination of a cancelled job. Java doesn’t use CancellationException for that purpose — it has InterruptedException instead. A Java-style cancellation exception isn’t for interrupting work, but for signalling an illegal state when you try to access the result of a task that never produced a value.

When we look at the definition of Kotlin’s CancellationException on the JVM, we can see that it’s actually a type alias for the original java.util.concurrent.CancellationException. Right away this introduces a problem for Kotlin applications running on the JVM and interacting with Java code. It should go without saying that a CancellationException thrown from a Java class is unlikely to be related to coroutine cancellation, and should be treated as an indication of some kind of error. But if the failed code happens to be running inside a Kotlin coroutine, there’s a high chance that the exception will simply cause the coroutine to terminate silently instead of triggering appropriate handling of the error.

Kotlin’s CancellationException seems to be leading a double life. On the one hand, it inherits some semantics from Java, where it’s an IllegalStateException that’s thrown when trying to interact with something that’s been cancelled. That’s much closer to what we’re seeing when we get a cancellation exception when calling await on a cancelled Deferred. On the other hand, a CancellationException has a special meaning inside a coroutine, where it’s not treated as an error at all, but is used to break out of the control flow and end the job early.

By their nature, functions that wait for a result from another coroutine are asynchronous. And remember, built-in suspending functions typically contain an automatic cancellation check for the current coroutine. What this means is that functions like await, send and receive can each throw a cancellation exception for two entirely different reasons. The first case is when the caller of the function has itself been cancelled. The second case is when the function wants to signal that its normal return value or behaviour isn’t available because the receiver is in a cancelled state.

It’s possible that future releases of Kotlin coroutines will help to resolve this ambiguity. For example, there’s already a proposed change that will alter the exception thrown by the standard withTimeout function so that it no longer inherits from CancellationException.

Photo by 乐融 高 on Unsplash

It’s probably true that there are some situations where you do want a call to await to result in a silent termination of the current coroutine. If a consumer is processing values from a second producer coroutine, and the producer goes away, it might make sense for the consumer to go away cleanly too, rather than treating it as an error. That certainly seems to be the built-in assumption of many of these standard functions. But it’s a faulty assumption on two grounds. First, there are plenty of legitimate reasons that you might want to choose to treat the missing value as an error, depending on the circumstances. Second, and perhaps more importantly, throwing a cancellation exception does not cancel a coroutine.

The active state of a coroutine is tracked by its Job, and a coroutine is only cancelled if the job says it is. If you just throw a cancellation exception and then catch it again, the coroutine remains active and can keep running as normal. This is different from what happens when you call cancel, which marks the job as inactive and prevents the coroutine from suspending again.

A cancellation exception that escapes into a coroutine that hasn’t yet been cancelled is always an anomaly. It may, eventually, mark the coroutine as cancelled — if and only if it propagates to the top of the coroutine and causes it to terminate. But unlike a real cancellation, a rogue cancellation exception is a recoverable error. If you catch and ignore a rogue cancellation exception, the job remains active and can continue as if nothing had happened. If you try to catch and ignore a cancellation exception caused by an actual coroutine cancellation, the coroutine will still remember it has been cancelled, and if it tries to continue running it will almost certainly encounter another cancellation exception pretty quickly. You can’t un-cancel a coroutine.

The way I see it, a coroutine that has encountered a rogue cancellation exception exists in something like a quantum superposition of cancellation states. If you just re-throw every cancellation exception, you’re choosing not to look inside the box. The implications are significant and hard to predict, and will vary depending on where the exception travels next.

Don’t let your coroutines go the way of Schrödinger’s cat. If you catch and identify a rogue CancellationException using the double-checked cancellation pattern, decide there and then whether your coroutine is dead or alive. If you think the current coroutine really should be cancelled, cancel it properly by calling cancel, so that the job is marked as cancelled and all future code can detect and handle the cancellation properly. If you think the coroutine should fail, wrap or replace the CancellationException with some other type of exception so that it will never be misinterpreted as a normal cancellation. And if you want the coroutine to continue running, handle the exception and don’t re-throw it. All three of these are valid responses to a rogue cancellation exception, depending on your specific application.

No discussion of coroutine cancellation would be complete without mentioning structured concurrency. Cancellation is a key part of the structured job hierarchy, because it allows a terminated job to promptly and safely discard its unneeded child coroutines. Because of structured concurrency, a cancellation or failure in one coroutine will often lead to the automatic cancellation of other jobs.

Many of the rogue cancellation exceptions that crop up in Kotlin coroutines today actually predate structured concurrency. I guess it’s a relic of an early attempt to provide a sort of automatic cancellation for jobs that depend on one another. Awaiting the result of a cancelled job? You might as well get a cancellation exception too. It’s an informal relationship which has now been superseded by a more formal system of linked jobs. Nowadays, two coroutines that are working together are likely to be related by a common parent. In that case, anything that causes a cancellation in one of them will very likely cause a cancellation in both.

Does the call to greeting.await() in the example throw a cancellation exception because the async greeting producer task was cancelled, or because the consumer job that called await was cancelled? Thanks to structured concurrency, it doesn’t matter: they both get cancelled at the same time.

There’s less reason to worry about rogue cancellation in a coroutine scope that only interacts with its own children. Rogue cancellations crop up when you interact with jobs from outside your own coroutine scope, or when you start manually cancelling jobs. And of course, there’s always the possibility of unrelated cancellation exceptions from outside the world of coroutines entirely.

So far, I haven’t found any real issues with the double-checked cancellation pattern. Let me know if you try using it and find any scenarios where it falls down!

One thing that might initially seem like a limitation is that you can only use it in a suspending function. The pattern relies on calling ensureActive on the current coroutine context, which can only be accessed when you know for sure you’re in a coroutine. And as I think the article shows, there’s no shortage of ways you could encounter a CancellationException outside of a suspending function.

Fortunately the problem is only superficial. The only way you can encounter a real job cancellation exception is via a function that checks if the current job is active, and only suspending functions have access to the current job. That means that if you encounter a cancellation exception in a non-suspending context, even in a coroutine, you know it’s a rogue cancellation exception, and should always treat it as an error. There’s no ambiguity, so double-checked cancellation isn’t needed.

The pattern has another minor limitation, when it comes to flows. Double-checked cancellation relies on the assumption that a legitimate cancellation exception must correspond to a cancellation of the current job. But the AbortFlowException that’s used internally to exit early from a flow is also a subclass of CancellationException. A flow is just a control structure, not a coroutine, so it doesn’t have a Job to cancel. Exiting early from a flow is analogous to returning early from a suspend function — it doesn’t cancel any coroutines.

On the face of it, that means that double-checked cancellation would incorrectly identify flow cancellation as an error. However, this shouldn’t be a problem in real code. The only place a flow will throw an AbortFlowException is from its emit function. It’s always incorrect to catch exceptions from emit, because they actually belong to the downstream consumer of the flow. They should be handled using the catch operator instead. So yes, double-checked cancellation can’t deal with errors caught from emit, but you shouldn’t be catching errors there anyway.

Catching and suppressing a cancellation exception isn’t an option, because that could cause a cancelled coroutine to enter a zombie state where it can’t terminate and can’t do any work. But re-throwing every cancellation exception is flawed advice too, because that could cause important coroutines to vanish silently instead of properly handling errors.

The problem is that a cancellation exception can be caused by more than one thing, and we need to be able to tell the difference. The double-checked cancellation pattern uses ensureActive after catching an exception, allowing you to handle rogue cancellations as errors while letting real cancellation exceptions propagate correctly.