Service discovery – part 1

Introduction

This post is part of a series of post on how to implement a continuous deployment pipeline for a micro services based application. The overview and table of content can be found here.

In a micro service based architecture one micro service might depend on the work of one to many other micro services. If the application requires zero downtime then each micro service needs to be redundant, that is at least two instances of a specific micro service need to be running at all times, ideally on different nodes of a cluster.

Patterns of service discovery

Let’s assume micro service A requires micro service B then technically we could hard wire A to B. In this case A knows where to find B. This information would probably be stored somewhere in a configuration file for A.

If the location of B changes then we would have to update this configuration file so that A can find the new instance of micro service B. But this is not a very scalable solution and a maintenance nightmare in an application that consists of many micro services.

A much better solution is to establish a registry where we collect the information about all services that run in our system. This registry is managed by some specialized service that offers the ability to register and de-register a service. If we have established such a registry service A can now ask the registry (service) about the availability and whereabouts of micro service B. With this information from the registry it can then call B. Thus we have introduced a level of indirection into our communication between micro services.

How does this work? Whenever a micro service instance comes live and is initializing itself (= bootstrapping) it makes a call to the registry service to register itself. The service instance knows where it runs – it can get the IP address of its host – and on which port it listens (if it is a micro service e.g. exposing a RESTful API). If the service is well behaving it also makes a call to the registry service to de-register itself if it stops.

But what happens if the micro service instance crashes? Obviously then it will not be able to de-register itself and we have an orphaned entry in the registry which is bad. To deal with such a situation a registry service should perform periodical health checks on the registered services. A possibility to do so would be to e.g. ping each service on a pre-defined endpoint every few seconds. If the micro service instance does not reply the entry in the registry is marked as critical. If the same service instance cannot be reached several times in a row then the corresponding entry can be removed from the registry. When now micro service A asks the registry service for the list of known instances of micro service B it will not return the information of the instances marked as critical. Thus A will never try to connect to an unhealthy instance of B.

 Consul

Consul by HashiCorp is an open source implementation of a distributed key value store which on top also provides functionality that make it an ideal candidate for service discovery. Consul can run either as master or as agent. The master orchestrates the whole network and maintains the data. The Consul agent is only a proxy to the master and forwards all requests to the master. In a cluster we want to install an agent on each node so that all services that run on the respective node have a local proxy. That makes it very easy for any micro service to talk to its Consul agent on localhost.

In a production environment we want to run at least 2 instances of Consul master on different nodes. Consul runs on Linux and Windows although it is recommended for production to only run the Consul master on Linux.

On Linux it is advisable to run Consul (master or agent) as a Docker container. On Windows this option is not yet available and we have to run Consul agent as a Windows service. First we have to startup an instance of Consul master. We can do this on our Linux VM that was created by the Docker Toolbox. We’re using the gliderlabs/consul image from Docker hub.

docker run -d --name consul --net=host gliderlabs/consul-server -bootstrap -advertise=192.168.99.100

Once the master is up and running we can test it and access it’s API. To e.g. see what nodes are in the current Consul cluster we can use this command

$ curl 192.168.99.100:8500/v1/catalog/nodes

and we should see an answer similar to this

[{"Node":"default","Address":"192.168.99.100","CreateIndex":3,"ModifyIndex":4}]

What’s next?

In my next post I will show how we can run a Consul agent on Linux and on Windows and join them to the Consul network. Furthermore we will also learn how we can use Registrator to auto register services running in Docker containers with Consul and lastly how we can use Consul.Net to register Windows services with Consul. Keep tuned

About Gabriel Schenker

Gabriel N. Schenker started his career as a physicist. Following his passion and interest in stars and the universe he chose to write his Ph.D. thesis in astrophysics. Soon after this he dedicated all his time to his second passion, writing and architecting software. Gabriel has since been working for over 25 years as a consultant, software architect, trainer, and mentor mainly on the .NET platform. He is currently working as senior software architect at Alien Vault in Austin, Texas. Gabriel is passionate about software development and tries to make the life of developers easier by providing guidelines and frameworks to reduce friction in the software development process. Gabriel is married and father of four children and during his spare time likes hiking in the mountains, cooking and reading.
This entry was posted in architecture, containers, continuous deployment, docker, How To, introduction, Micro services, patterns, Setup. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • Does it mean you need to consult the registry for every call from A to B?

    • gabrielschenker

      Yes it does. But of course you can abstract that into a library function or so if you care about DRY.

      • What about performance overhead in this case since we add extra network call every time?

        • gabrielschenker

          There is always a fine line between the latency of a single call and the scalability of the overall system. If you want minimal response times per call then you need to either cache aggressively (which has its own problems) or use another architecture with its own limitations… There is no such thing as a free lunch, I regret.

      • Alexander Furer

        This is not true if you work with Eureka.
        Eureka client caches the topology and is able to work even if registry server is down (as long as cached topology remains valid)

        • gabrielschenker

          What you’re talking of is a further optimization. It doesn’t invalidate what I said.

  • Pingback: The Morning Brew - Chris Alcock » The Morning Brew #2019()

  • Michael Manning

    This is a great and needed series. Thank you for putting it together.
    These may be premature questions, but where does the distribution of calls to a group of microservice B get implemented? Is it the responsibility of microservice A or does Consul behave like a load balancer?

    • gabrielschenker

      Usually you would want to add another layer or indirection (for other reasons like API versioning or blue-green deployment, etc.). This is a reverse proxy like Nginx which will take care of the load balancing. I will describe this pattern in a subsequent post.

    • Jeroen Gordijn

      You have 2 choices. You can either use client side service discovery or server side. Server side is the way Gabriel describes below. You put a load balancer in some form in between. The other option is to use client side service discovery. In this case the client gets a list of possible service instances of the desired type. The client can than pick one randomly, or use some other form of load balancing.

  • Zhang Wen

    Thanks for the nice article. I have a doubt and hope you can give me some helps. As you mentioned in the post, we can install an agent on each node. What will happen if the local agent is down? Does it mean that the node cannot communicate with the masters?