Monday, 17 October 2016

Distributed Systems: de-duplication and idempotency

When working on distributed systems there are two important concepts to keep in mind - deduplication and idempotency. The two are sometimes confused but there are subtle differences between them.

Background Context
In our e-commerce system, we receive orders submitted from Checkout and manage them through a lifecycle of Payment Authorisation, Fraud Check, Billing through to Dispatch. We employ asynchronous message-driven architecture using a combination of technologies such as NServiceBus, MSMQ and Azure Service Bus. We use a combination of workflow choreography and orchestration. In this lightweight handlers receive a command or subscribe to an event, do some work, and publish a resulting message, These are chained together with other handlers to complete the overall workflow.

At-Least Once Messaging
Messaging systems typically offer at-least-once messaging, That means you are guaranteed to get a message once but under certain circumstances - such as failure modes - you may get the message twice (perhaps on different independent threads of execution).

Avoiding certain work twice
However, some work items should not be repeated: you should only charge a customer once or perform a fraud check once. If our handler receives a PaymentAccepted event it performs a Fraud Check with a third party service. This costs money and repeatedly calling it will affect a customers fraud score - it should only be done once per order.

Some people employ deduplication to solve this. For example, Azure Service Bus gives you the option to deduplicate messages by setting a time window (say 10 minutes). In this Azure Service Bus will check every message ID and if it sees it has already been processed it will prevent another handler processing it. This is possible because ultimately the broker architecture revolves around a single SQL database (limited to a region).

However, deduplication does not solve all of the problems. Idempotency is very important too,

If we are truly idempotent then a handler should “always produce the same output given the same input”.

Ideally, a handler would be fully idempotent. If the work was to set subscribe to a NewCustomerOrder event and set a new customer flag to true and publish a NewCustomerUpdated event, we should be able to publish this event ten times and the outcome will always be the flag being set to true and a NewCustomerUpdated flag published.

However, in the Fraud Check scenario, we should subscribe to PaymentAccepted event and check history to determine whether to perform a Fraud Check, and always publish the result - normally FraudCheckPassed.

Idempotency supports Replays
This is important for replays. We had a system problem where our system experienced a series of failures that caused messages to be lost at various stages of the workflow. 

This is shown below:

It would have been very useful to go to the leftmost part of the workflow and replay the message: this would have resulted in the downstream handlers responding and repeating their work as required. However many handlers employed deduplication: they simply swallowed the input event, did no work, and published no outcome. That meant latter stages of the workflow were not exercised and the system had to be recovered stage by stage. It required a lot of manual effort, was time consuming and was prone to error.

If each handler was idempotent; it would have received the input event, chosen whether to do work or not, and published the output event.

What are the side effects?
So impact could this have?

In the event of duplicated messages at the transport level (e.g. at least-once messaging) we could end up with more load on the system. If one of the leftmost handlers received a duplicate, it would propagate down the right. However I believe this is a small price worth paying for the enhanced supportability. Furthermore Azure Service Bus, being a broker with message-level locking, will prevent this except for handler failure scenarios.

If we are using Event Sourcing then if we duplicate or replay an event our event streams will record the fact that the FraudCheckPassed was published twice. Arguably, though, this is a good thing. It is a true reflection of history, which is what the event stream should be. It is much better to have two publishes recorded rather than someone manually hacking the event stream to fix problems.

In our current system, we enforce deduplication at the transport layer using Azure Service Bus configuration. However I believe this should be changed anyway; it prevents partitioned queues, it prevents us changing transport layer and anyway the deduplication and idempotency should be enforced at the handler level. 

Note that idempotency and deduplication are separate concerns (although often mixed!). In the event of a Fraud Check or a call to a Payment Processor for billing it is important that these calls are not repeated twice. If we receive a duplicate or replayed message, we don’t call these external systems again. Rather we just republish the last outcome.

Closing argument
Our APIs are idempotent so why aren’t our handlers? When an API receives the same input request, it publishes the same output. This is important if the client enters a tunnel when issuing an HTTP request.
Why aren’t are handlers behaving in this manner?

Tuesday, 11 October 2016

Netscaler Gateway Plugin

The Netscaler Gateway plugin is incompatible with VirtualBox's "VirtualBox Host-Only Ethernet Adapter". You get timeouts when you do a ping test.
When the latter is enabled, the VPN connection intermittently fails to route traffic correctly. Disable the adapter to work with the Netscaler gateway.

Saturday, 8 October 2016

gpedit is the new control panel: Windows Update and Windows Ink.

Today I got rather annoyed by Windows 10. I know it is supposed to cater for a wide range of users including those whom are non-IT savvy, but it has taken "knowing best" to an all new level.

When I turned on my laptop today:

- it spent ages installing a new update
- it overwrote my Windows + W shortcut to launch a new program, Windows Ink Workspace, which I don't want
- it then tried to reboot as I was working

I then spent half an hour trying to find the option to uninstall Windows Ink. You can't. I then tried to find it's option to release my Windows + W shortcut. You can't.

This is too aggressive. I like my computer they way I had it before the update. I don't expect someone to come into my office, replace my pens and chair, and rearrange my desk because they are think they are doing me a favour. I want it like I had it.

So can you reconfigure your Windows 10 to turn off all this rubbish? To ask before installing updates and disable Windows Ink Workspace? Well, no. Not using the front door.

Fortunately there is gpedit. It appears this is becoming the new control panel. The old control panel is for the non-IT savvy users who just want to do the basics.

So to stop Windows Ink Workspace:

To force Windows 10 to be polite and ask before downloading updates over your Internet connection:

Friday, 30 September 2016

Options for exchanging information between domains

There are various approaches for exchanging information between domains in a system.

1. Fat Command, Subordinate System.
The system that holds the information commands a subordinate to do some work. In the command it provides all the information that it needs.
For example an Order Processing System may send a SendCustomerEmail command to a Notification system and that command holds all of the data deemed necessary to populate an email.
The recipient acts immediately upon the receipt of the command.

2. Fat Command, Thin Events
A system sends a command to a second system. That command contains all of the public information that is necessary for the execution of an action, for example SavePaymentCardDetails.
At some point in the future that system, or an unrelated system issues an event that triggers the execution of an action.
An example may be a front end Web application may preload a payment system with SavePaymentCardDetails. The payments domain saves the card details to it's domain repository. Later on the Fraud check system issues a FraudCheckPassed and the Payment domain then bills the credit card.

3. Thin Event, We'll Call You
In this scenario an event is published and the receiving system calls the Web API of the sender (or other systems) to get the information required to perform the action. For example, an Order Despatch system publishes the OrderDispatched event and the subscribing Customer Notification system calls the Order API to get the details of the order so it can send an email to the customer.

4. Fat events
In this scenario an event is published which contains the public information shareable with other domains. For example the OrderCreated event is published which contains the Line Items, Delivery Address, Billing Address etc.
There are cons to this approach:

  • You may be encouraged to publish information that want to hide (addresses)
  • You may not be able to restrict access to that information
  • An implicit coupling may be developed. You may get a whole number of subscribers to your events, some of whom you have no knowledge about, who become heavily dependent upon your events. You then struggle to upgrade or deprecate older versions of the message.

Tuesday, 20 September 2016

Changing a password on nested Remote Desktop Sessions

To change a password on a nested remote desktop session:

Start > Run > osk

Hold down the CTRL and ALT keys on the physical keyboard.

Click on the DEL button on the virtual keyboard.

Monday, 12 September 2016

MSMQ Utilities

List all of the private queues:
Get-MsmqQueue –QueueType Private | select QueueName

NServiceBus, "Failed to enlist in a transaction" and MSMQ overload.

We've had several problems, where after a serious fault in our infrastructure, NServiceBus on MSMQ has failed to start up gracefully. On many occasions, it would error for a period of anything up to several hours, before eventually unblocking itself and processing as normal.

The context of this is a very popular ecommerce site and NServiceBus is trying to process anything up to 40,000 queued messages after a period of downtime.

For background context - on two occasions this was total power failure in a commercial data centre, and the other one was due to a SAN disk failing, bringing down total throughput.

Our NServiceBus is on-premise and uses MSMQ as the underlying transport. Some endpoints use MSDTC and enlisted transactions. (As an aside, this is something you should be able to design out. Transactions are not particularly friendly to very distributed messaging systems. With a combination of idempotency, built-in MSMQ transactions and retries you can avoid the need for them).

One of the errors seen was "Cannot enlist the transaction". It is believed that the startup contention for MSMQ causes this. You can turn on MSDTC logging, and this requires the use of the TraceFmt.exe tool to format the logs for human consumption.

You can also turn on System.Transaction tracing.

Once we turned this on we could see repeated errors whereby a transaction was started and 2 minutes later it was aborted. This was happening continuously and causing NServiceBus to fail to startup. After a long period of time (an hour or more) one transaction managed to succesfully complete and this started to allow other messages to follow through.

Our team originally tried to solve this by increasing the MSDTC timeout duration. However this is not enough, underlying this is the System.Transactions timeout. This also needs to be changed.

  <defaultSettings timeout="00:05:00"/>

The solution is to perform one or more of the following:

  • wait
  • increase the transaction timeout
  • reduce the queue length if it is too big on startup by using a temporary queue and copying messages manually
  • remove the need for MSDTC
  • or fix the underlying performance problem that is hampering MSMQ.