Let’s face it: Despite the best intentions and hard work of engineers, software developers and quality assurance folks, software fails. Although it’s an accepted best practice to test software thoroughly in dedicated test and integration environments, failures that are a result of complex circumstances are often discovered in production systems. If your software is not able to recover from such failures at runtime, you and your customers are going to have a displeasing experience.
So resilience – or the ability to recover from failure conditions at runtime back to an operational state – is an important topic. Fortunately, Enterprise MQTT Brokers like HiveMQ are built for resiliency, but what about your MQTT clients? The decoupling of Publishers and Subscribers in MQTT allows for very high scalability but can potentially introduce complex error conditions, especially if unreliable mobile networks are added to the mix. Luckily, the MQTT protocol itself has a lot to offer to mitigate these complex errors. Before digging deeper into how you can make your MQTT applications more resilient, here is how the Oxford Dictionary defines resilience:
The capacity to recover quickly from difficulties; toughness.
Oxford Dictionary “resilience“ (Source)
This blog post will tell you everything you need to know to get your MQTT applications recovering from difficulties/failures and what your options are.
Why should MQTT clients be resilient?
Applications that receive and/or send messages via MQTT are called MQTT clients. These MQTT clients are part of a distributed system that consists of a broker (cluster), and multiple MQTT clients. These MQTT clients are decoupled from each other. There are a lot of possible failure scenarios, e.g.:
- The MQTT client loses connection to the MQTT broker, for example because of (mobile) network loss
- The MQTT client gets overloaded due to too high message frequency on subscribed topics
- The MQTT client could hit a half-open socket problem
- The client could experience low bandwidth and high latency which doesn’t allow for normal operation
- The client could receive messages with different payload than expected (e.g. by malicious clients or due to bugs on the sender side)
Of course this is only a small excerpt of possible failure scenarios that can happen. It’s important to plan for failure and design your application for failure scenario recovery. Just imagine your MQTT client is part of a hardware product and the client wouldn’t recover automatically from a failure and would need human interaction! This will certainly cause damage to your product. It’s important that MQTT applications are anti-fragile to deliver best user experience.
MQTT and Resilience
The good news is, that the MQTT protocol has you covered by providing counter-measures for many failure scenarios. The following table lists what features the MQTT protocol already provides out-of-the-box to make your applications resilient and what you need to implement on your own:
|Application Resilience Feature||MQTT Protocol Feature||Application handling needed|
|Quality of Service (Message Delivery Guarantees)|
|Keep-Alive / Heartbeat|
|Throttling / Protection against overload|
|Handling error conditions / exceptions|
It’s important to note, that some MQTT libraries already provide some of the client-side counter-measures, so you should check if your library already supports that.
MQTT Protocol features that improve the resilience of your application
MQTT is designed for reliable communication over unreliable networks, which may be the reason that one of the most popular features is the Quality of Service level a message can be sent. MQTT has 3 different QoS levels available:
|QoS Level||Delivery guarantee|
|QoS 0||At most once delivery / fire and forget|
|QoS 1||At least once delivery|
|QoS 2||Exactly once delivery|
If using a QoS level higher than 0, the broker might re-deliver a message in case the subscriber application did not receive it or didn’t acknowledge the message. This is tremendously useful if the application runs into problems when a message was received and the client couldn’t acknowledge the message due to problems. This is of course also true if the broker didn’t respond with an acknowledgement when a MQTT client publishes a message.
You can learn more about the MQTT Quality of Service levels in the MQTT Essentials.
Half-open TCP connections happen, especially when using unreliable networks. That means that the still functioning end is not notified about the failure of the other side and is still trying to send messages and wait for acknowledgements.
When this happens, your application may run into problems, since the connection may seem OK from the client’s perspective although the connection already dropped. The only way to recover from this state is to close the connection and open a new connection. The tricky part is to detect this state. Fortunately MQTT has this kind of detection built-in, via a heartbeat mechanism. Each client is able to define its own heartbeat interval with the keep-alive feature when sending a CONNECT message. You can learn more about that feature in this in-depth blog post: Keep Alive and Client Take-Over.
MQTT client identifiers must be unique. If half-open connections are present, there would be a potential problem when a client wants to reconnect with its client identifier, since the broker could assume there’s already a connection present with the same client identifier. For this reason, MQTT has a concept of client takeover, which means the broker will close the previous TCP connection with the same client identifier. This makes sure that your MQTT applications can reconnect without any problem.
Does your application rely on guaranteed message delivery? We saw that MQTT Quality of Service levels can help with unreliable networks with an application level re-delivery. But what to do if your client is offline and you still can’t afford to miss the messages that were sent while your application was offline? MQTT has you covered with message queuing. If your application uses persistent sessions and subscribes to messages with QoS 1 or 2, these messages will be queued for your MQTT client when it is offline. As soon as your client reconnects, these messages will be delivered to your client. Learn everything about queued messages in this blog post.
Client side resilience considerations
All the features discussed above should be implemented by all feature-complete MQTT libraries. The following features we’re going to discuss may also be implemented by the MQTT library you’re using. If these features are not supported by your library, you can typically add the following resilience behaviours on your own on top of most MQTT libraries. An overview of all the popular MQTT libraries and their feature set is available in the MQTT Client Library Encyclopedia.
If your MQTT client loses the connection to the MQTT broker, you should make sure that the client automatically reconnects to the MQTT broker. When implementing reconnect behaviour on your own, consider implementing an incremental or exponential backoff algorithm for the reconnection attempts. This avoids timeouts based on “reconnection storms” if multiple thousands of MQTT clients try to reconnect at the same time to a MQTT broker.
If the broker rejects the MQTT client because of invalid authentication credentials, the client should stop trying to indefinitely connect to the MQTT broker.
Automatic reconnects are one of the key concepts you should implement, since you must not assume that a network is reliable. If you design your application to handle network failures from the beginning, you will save yourself lots of time and pain in the long run.
We have seen, that the MQTT broker queues messages for a persistent session client when using QoS 1 and 2 messages if the client was offline. This happens solely on the broker side. In case the MQTT connection of your application is not present, it may be desirable to have offline buffering capabilities implemented on the client side so the application can send out all messages that were queued while the connection was not present. This is especially important if your clients application architecture abstracts the MQTT transport layer away from the business logic layer. MQTT message offline buffering makes your application more resilient, since the connection loss is not intrusive to your application layer.
Most MQTT client applications are not designed for handling huge amounts of MQTT messages per second. It’s important to know your client’s load limitations and then throttle the message ingestion rate to a limit that doesn’t overwhelm your client. Such an implementation typically stops to read from the socket as soon as a specific bytes per second or messages per second rate threshold is exceeded. So even if the broker tries to send lots of messages, TCP backpressure mechanisms take effect and your client won’t be overwhelmed. If this is not possible with your client library, you should think about implementing load shedding, which means you throw away messages that you can’t handle. This is not optimal but certainly better than constantly crashing your client.
It’s a good pattern to always validate inputs, no matter how trusted the environment is. Your MQTT clients should at least validate the following:
- MQTT topics: Did the client receive messages on topics the client did not subscribe to? In such a case your application should ignore the message. If you are using wildcard subscriptions, only process messages on topics you know the client can interpret.
- MQTT message payload: The payload of MQTT messages is always binary. The structure of the data in the payload is typically defined on the application level. Always make sure you can parse the actual message. So if you are expecting a JSON payload but the actual payload is in XML, your parser will probably have a hard time interpreting the messages. So always validate if your application can handle the input, otherwise malicious MQTT clients could craft MQTT packets that your subscribers can’t handle.
We saw in this post, that it’s not enough to have a resilient MQTT broker like HiveMQ to get a stable MQTT system. Your MQTT clients need to be resilient to get back in an operational state after error scenarios. We all know that the question is not if but when errors occur. MQTT has lots of features built into the protocol to make the communication resilient, this does not necessarily mean that application developers should not care about resilience since some patterns are use case specific.
What is your favorite way to harden your MQTT clients and make them more resilient? Let us know in the comments!