Distributed Systems — How to communicate services

Carlos Hernández
carlos-hernandez
Published in
6 min readJul 1, 2020

--

This article is the continuation of my previous post How to build a SOLID Distributed Systems, you should check it out if you haven’t read it yet!

Some of you asked me if I could go into more detail and give a better explanation as to how we can improve decoupling in our distributed systems. For this reason I am going to talk solely about Event-driven in this article and how it can improve our systems.

In my opinion, event-driven is an important topic to consider when we are going to design distributed systems. It is not, however, ‘a silver bullet’ that is going to solve all of our problems. Despite this, in some situations our systems will be simpler and decoupled if we use it correctly.

Pros and cons of using Event-driven

The best way to understand the event-driven paradigm is to think about one type of asynchronous communication familiar to us. This type of communication is email. In this example, the email message is our event and the communication between two people is the event-driven model.

By drawing on this particular concept we can highlight the various pros:

  • We can send emails even when there is not anyone at the other side.
  • We do not have to answer them immediately, we can continue working and respond to them later.
  • We can save all of the important emails together and look at them at a later date.

But as we know, no system is perfect and in fact there are a few cons to address:

  • It is not the best form of communication for when we need an urgent response.
  • The more emails we want to save, the more space we will need.
  • It is another system that we will have to support.

We can think of event-driven as an asynchronous communication that has similar pros and cons to our email example. Let’s start with the pros in the same order:

  • Services can continue sending events even when receivers are down.
  • Services can continue processing data at the same time when events are queuing.
  • Some queue systems permit saving the events after they have been processed for future reprocessing. Kafka is one of them and this pattern is called Event Sourcing.

Some cons are:

  • The communication is not instant, we do not know when the event will be processed.
  • The more events we want to save, the more space we will need.
  • It is another system that we will have to support.
  • The order of events is not guaranteed.
  • Events can be processed more than once (creating duplicates).

Ways to Use ‘Event-Driven’

At this point in the article, we can start to think about when it would be better to use event-driven in our systems. For example, imagine that we need to hire a new developer for our company. The number of people that are going to send their resumes to us is going to be pretty high, so it is better to use an asynchronous communication. This communication type could be by email. Applicants can send emails, and when the workload in the office is low we can go through all of them at once. Another benefit is that we can save them for future positions that open up in the company.

Similar to the previous example, we can use asynchronous communication in our systems to manage high traffic in a matter of seconds. One example of this could be a shopping website. We can use asynchronous communication to process the shopping action as fast as possible because shopping time is a critical business key and clients are likely to buy more products if it is faster to do so. So from the user’s perspective the process will be instantaneous and they can continue shopping while their purchases are processing in the background.

However, the previous case is not perfect and we have to decide what to do if one of the products in the shopping list is out of stock. In this case, we can open a communication process with the client to return the money or change the delivery date. On our shopping website this is not a problem because we prefer to return the money for one product rather than lose possible sales on other products. But this is not a programmer’s decision; this has to come from the business side of the company. In addition to sales of material objects, if a website is selling something such as tickets to a concert we do not want to sell them to clients and 30 minutes later tell them they have sold out before they had made their purchase.

Event order and duplicate

As we saw previously in the queue systems the order is not guaranteed, but able to be obtained if needed. This is because normally we have more than one partition and more than one system pulling events:

As we can see in the previous picture the events in each partition are sorted from oldest to newest. But maybe the second event in partition 3 is older than the first event in partition 0. The reason to split the events into different partitions is to improve the performance. In fact, we can save each partition on different disks, increasing the speed of reading/writing. In the case that we need to guarantee the order of the events our topic is going to be restricted to use only one partition.

Finally, one of the most typical strategies used on these systems is at-least-once. The reason behind the use of this strategy is that we prefer duplicating events, so we do not lose them. When we are using this strategy the event is not marked as checked in the queue system until an ACK is received from the service. Sometimes, after the service receives the event from the queue system and processes it something goes wrong, for example a connection error and the ACK does not arrive in the queue system. In this case if we do not receive an ACK back after a slot of time, the event will be sent again (creating a duplicate). We can solve this problem by checking if the event was processing before and if so, we only need to resend the ACK.

Rapids and rivers

In this section I would like to introduce the concept of rapids and rivers. The concept was first introduced by Fred George in 2013. When all events are published to the rapids, contexts or services could subscribe and filter events through rivers. This means that our systems have to permit publishing events as fast as possible (rapids) and read/process them in a controlled way (river).

Designing distributed systems using this concept has the benefit that none of our services have to wait for each other. As a consequence of this, our systems will be much more efficient.

References

--

--