Monday 12 September 2016

NServiceBus, "Failed to enlist in a transaction" and MSMQ overload.

We've had several problems, where after a serious fault in our infrastructure, NServiceBus on MSMQ has failed to start up gracefully. On many occasions, it would error for a period of anything up to several hours, before eventually unblocking itself and processing as normal.

The context of this is a very popular ecommerce site and NServiceBus is trying to process anything up to 40,000 queued messages after a period of downtime.

For background context - on two occasions this was total power failure in a commercial data centre, and the other one was due to a SAN disk failing, bringing down total throughput.

Our NServiceBus is on-premise and uses MSMQ as the underlying transport. Some endpoints use MSDTC and enlisted transactions. (As an aside, this is something you should be able to design out. Transactions are not particularly friendly to very distributed messaging systems. With a combination of idempotency, built-in MSMQ transactions and retries you can avoid the need for them).

One of the errors seen was "Cannot enlist the transaction". It is believed that the startup contention for MSMQ causes this. You can turn on MSDTC logging, and this requires the use of the TraceFmt.exe tool to format the logs for human consumption.

You can also turn on System.Transaction tracing.

Once we turned this on we could see repeated errors whereby a transaction was started and 2 minutes later it was aborted. This was happening continuously and causing NServiceBus to fail to startup. After a long period of time (an hour or more) one transaction managed to succesfully complete and this started to allow other messages to follow through.

Our team originally tried to solve this by increasing the MSDTC timeout duration. However this is not enough, underlying this is the System.Transactions timeout. This also needs to be changed.

<system.transactions>
  <defaultSettings timeout="00:05:00"/>
</system.transactions>

The solution is to perform one or more of the following:

  • wait
  • increase the transaction timeout
  • reduce the queue length if it is too big on startup by using a temporary queue and copying messages manually
  • remove the need for MSDTC
  • or fix the underlying performance problem that is hampering MSMQ.

No comments:

Post a Comment