How it broke - Timeouts in microservices

How an innocent-looking migration to a different HTTP library broke our microservice system.

Setting the scene

Users outside my team send HTTP requests to Service A, which - depending on the request - sometimes needs data from the External Service. The logic for sending requests to the External Service and post-processing the response lives in Service B. Dependencies of Service A (including the External Service) do quite a bit of heavy lifting, so requests to it can take a while. Therefore, the timeouts for requests to Service A were set to 25 seconds at the time.

sequenceDiagram
    User->>Service A: HTTP GET-by-POST
    Service A->>Service B: HTTP GET-by-POST
    Service B->>External Service: HTTP GET-by-POST
    External Service->>Service B: Response
    Service B->>Service A: Response
    Service A->>User: Response

("GET-by-POST" here means that the user or service wants to retrieve a resource, but uses the POST method, specifying the resource in the request body. This approach allows for more elaborate requests than would be possible with GET due to URL length restrictions.)

What went wrong?

During some unrelated work on Service B (written in Python), I noticed that it used httpx for some HTTP requests and requests for others. Following the boy scout rule, I decided to remove this inconsistency by getting rid of the dependency on requests and using httpx instead. The change looked like it was a drop-in replacement:

- response = requests.post(url, data=payload)
+ response = httpx.post(url, data=payload)

However, soon after deploying these changes, the failure rate of requests to Service A increased because requests Service B made to the External Service were timing out.

Some debugging later, I found out that httpx has a default timeout of 5 seconds, whereas requests has not.

The documentation of httpx is quite clear about this: The default behavior is to raise a TimeoutException after 5 seconds of network inactivity.; I just hadn't considered this difference in default behaviors. Once this became clear, the quick fix we implemented was to disable timeouts for calls from Service B to the External Service, which restored the old behavior.

How retries made the situation worse

To deal with errors when making HTTP requests, it is common to perform a couple of retries. This can be effective if the error was temporary, like a network hiccup. In this case however, the retries made situation worse.

Retries were implemented in Service A, which would retry the request to Service B if it received an error response. As a result, multiple requests were sent to the External Service before Service A finally timed out, increasing the load on the External Service while the system as a whole failed to serve the user a successful response.

sequenceDiagram
    User->>Service A: HTTP GET-by-POST
    Service A->>Service B: HTTP GET-by-POST
    Service B->>External Service: HTTP GET-by-POST
    Service B->>Service A: Timeout -><br> Error response
    Service A->>Service B: Retry
    Service B->>External Service: HTTP GET-by-POST
    Service B->>Service A: Timeout -><br> Error response
    Service A->>User: Timeout -><br> Error response
    External Service->>Service B: Response
    External Service->>Service B: Response

Beyond the quick fix

Explicity setting relevant defaults could have helped the situation, as it prevents accidental changes in behavior when defaults change.

Another (untested) approach suggested by a colleague is to make the timeout configurable in the request, and let the caller set an appropriate timeout. If Service B's timeout is configured in the request to the External Service, the External Service could at least stop doing its work once it receives the timeout. A service could take the timeout specified in the request, subtract any time it needs to spend on its own processing, and then pass the remaining time as timeout to the next service.

Finally, in a situation like this where a service needs to do so much heavy lifting, maybe synchronous (HTTP) requests aren't the right choice. A system based on a message queue (e.g. Kafka or RabbitMQ) or a job queue (e.g. Celery) might be a better fit.