Autoscaling etcd using SRV discovery on AWS

Acknowledgements

Thanks to John 'retr0h' Dewey for releasing the ansible role I borrowed from extensively to make this happen. Also thanks to Kelsey 'kelseyhightower' Hightower for the SetupNetworkEnvironment tool and for generally being a heroic individual in the kubernetes community.

Motivation

Just over a year ago, I took my first steps with Docker orchestration and microservices. While I wasn't involved with choosing the platform at the time, I was heavily involved in deploying it. We went with Rancher and opted to use the Cattle framework (rancher supports many, including Kubernetes) mostly due to perceived operational simplicity. It worked out just fine, though we encountered the odd issue with our fairly early Rancher release.

Fast forward to today - I'm at a new job and my first real project is a Kubernetes deployment on AWS. Kubernetes, being opinionated, requires an etcd backend. I have dealt with Consul in the past and mostly found it a pleasure to deal with - but until I had to deploy etcd, I had no idea how much I took for granted what I now consider a killer feature - bootstrapping via ec2 tags!

Essentially, when you deploy Consul on EC2, you have the option of providing a tag Key and Value which Consul will search for. The nodes discover one another and cluster automagically, then elect a leader once the specified number of nodes have found one another. This makes things like autoscaling the cluster trivial - you just scale up x number of nodes behind an ELB and you end up with a working HA consul deployment.

I expected etcd to implement something similar, and as anyone familiar with etcd will tell you, I was disappointed. Its bootstrapping mechanism is much more primitive, though of course I acknowledge that perhaps I was spoiled by Consul's ec2 tag based system. Regardless, I set out to replicate this ease of use with an autoscaled etcd deployment.

Etcd Bootstrapping

Three options are available as per etcd docs:

  • Static (provide IPs/DNS to command line)
  • Etcd Discovery (requires an existing cluster, wtf?)
  • DNS SRV Records

The disadvantages of the first two are evident. Method 1 is not conducive to autoscaling or self-healing, making it problematic for a real production deployment. Method 2 requires a second, existing etcd cluster - which is probably the case for some environments, but I'm guessing I am not alone in thinking it onerous. This article, as the title foretold, focuses on the SRV Record method.

SRV Records

From Wikipedia:

A Service record (SRV record) is a specification of data in the Domain Name System defining the location, i.e. the hostname and port number, of servers for specified services. It is defined in RFC 2782, and its type code is 33.

In this case, we need it to list information about three etcd servers, which for the purposes of my current project are also kubernetes master nodes. This article won't be going into any detail about kubernetes itself, just the etcd deployment.

From the etcd docs:

DNS SRV records can be used as a discovery mechanism. The -discovery-srv flag can be used to set the DNS domain name where the discovery SRV records can be found. The following DNS SRV records are looked up in the listed order:

_etcd-server-ssl._tcp.example.com  
_etcd-server._tcp.example.com  

If etcd-server-ssl.tcp.example.com is found then etcd will attempt the bootstrapping process over TLS.

To help clients discover the etcd cluster, the following DNS SRV records are looked up in the listed order:

_etcd-client._tcp.example.com  
_etcd-client-ssl._tcp.example.com  

If etcd-client-ssl.tcp.example.com is found, clients will attempt to communicate with the etcd cluster over SSL/TLS.

To that end, I've created a subdomain to host the kubernetes project. Let's call it kubernetes.tastycidr.net. I've also created an autoscaling group which spans three availability zones in the Oregon region. The goal here is to have one etcd/master node in each AZ, each with an a-record pointing to it. The SRV record then contains one line per AZ which points to the A Record and port of our etcd process.

Route53 Records

Three A Records:

  • etcd-us-west-2a.kubernetes.tastycidr.net (resolves to 172.16.0.x)
  • etcd-us-west-2b.kubernetes.tastycidr.net (resolves to 172.16.2.x)
  • etcd-us-west-2c.kubernetes.tastycidr.net (resolves to 172.16.4.x)

One SRV Record - _etcd-server._tcp.kubernetes.tastycidr.net:

0 0 2380 etcd-us-west-2a.kubernetes.tastycidr.net  
0 0 2380 etcd-us-west-2b.kubernetes.tastycidr.net  
0 0 2380 etcd-us-west-2c.kubernetes.tastycidr.net  

Autoscaling

As some of you may have already realized, this method has a couple of challenges and one big limitation.
The limitation is that the method I used here to dynamically update your A Record and facilitate DNS-based discovery assumes you only have 1 etcd instance per Availability Zone, which in practice limits your fault tolerance to 1 failure in most AWS regions.

The main challenge is that when an instance is launched in, for example, us-west-2a - we need that instance to automatically assume the A Record etcd-us-west-2a.kubernetes.tastycidr.net. This is doable fairly easily via boto3, but requires gathering some data first. I've written a python CLI tool to handle doing the following:

  • Gathering information like current AZ and Private IP
  • Set hostname to match the A Record for the AZ
  • Upsert the R53 A Record for the AZ with our Private IP

I just lay this down via ansible as /usr/local/bin/r53update with execute permissions. Depends on python3, boto3, docopt and requests being installed:

#!/usr/bin/env python3
"""Simplified interface for updating R53 record during instance launch

Usage:  
    r53updater.py <namespace> <zoneid> <domain> [options]

Arguments:  
    <namespace>                 Namespace of your record
    <zoneid>                    Route53 Zone ID
    <domain>                    Route53 Domain Name

Options:  
    --type=<type>               Type of DNS record [default: A]
    --ttl=<ttl>                 Record TTL [default: 300]
    --help                      Show this help string
"""
import sys  
import subprocess  
import logging  
import boto3  
import requests  
from docopt import docopt

METADATA_API = 'http://169.254.169.254/latest'  
_LOGGER = logging.Logger("r53-record-updater")

def setup_logging():  
    _LOGGER.setLevel(logging.INFO)
    handler = logging.StreamHandler(sys.stdout)
    formatter = logging.Formatter('%(asctime)s : %(levelname)s - %(message)s')
    handler.setFormatter(formatter)
    _LOGGER.addHandler(handler)

def get_az():  
    return requests.get(
        "{}/meta-data/placement/availability-zone/".format(METADATA_API)
    ).text

def get_private_ip():  
    return requests.get(
        "{}/meta-data/local-ipv4/".format(
            METADATA_API
        )
    ).text

def change_hostname(namespace, az, domain):  
    newhostname = "{}-{}.{}".format(namespace, az, domain)

    subprocess.call(['hostname', newhostname])
    with open('/etc/hostname', 'w+') as f:
        f.write(newhostname)
        f.truncate()
        f.close()


def update_dns_record(zoneid, namespace, recordtype, domain, az, ttl, value):  
    client = boto3.client('route53')

    resp = client.change_resource_record_sets(
        HostedZoneId=zoneid,
        ChangeBatch={
            'Comment': 'Updating record',
            'Changes': [
                {
                    'Action': 'UPSERT',
                    'ResourceRecordSet': {
                        'Name': "{}-{}.{}".format(
                            namespace, az, domain
                        ),
                        'Type': recordtype,
                        'TTL': int(ttl),
                        'ResourceRecords': [
                            {
                                'Value': value
                            }
                        ]
                    }
                }
            ]
        }
    )
    return resp['ResponseMetadata']['HTTPStatusCode']

if __name__ == '__main__':  
    args = docopt(__doc__)
    setup_logging()

    az = get_az()
    private_ip = get_private_ip()
    recordname = "{}-{}.{}".format(
        args['<namespace>'], az, args['<domain>']
    )
    _LOGGER.info("Upserting record {} of type {} with TTL of {} and value {} into zone {}".format(
        recordname, args['--type'], args['--ttl'], private_ip, args['<zoneid>']
    ))

    try:
        change_hostname(
            args['<namespace>'],
            az,
            args['<domain>']
        )

        resp = update_dns_record(
            args['<zoneid>'],
            args['<namespace>'],
            args['--type'],
            args['<domain>'],
            az,
            args['--ttl'],
            private_ip
        )
        _LOGGER.info("Status Code: {}".format(resp))
    except Exception as e:
        _LOGGER.error(e)

Then use a systemd unit to run it. Note that the three Jinja variables are interpolated from vars set in Ansible. This is laid down as /etc/systemd/system/r53update.service and set as enabled via the Ansible systemd module:

[Unit]
Description=Hostname Updater  
Requires=network-online.target  
After=network-online.target

[Service]

ExecStart=/usr/local/bin/r53updater {{ r53_namespace }} {{ r53_zoneid }} {{ r53_domain }}  
RemainAfterExit=yes  
Type=oneshot

[Install]
WantedBy=multi-user.target  

There's one further piece of magic here. I really wanted to avoid running Ansible at boot time to get current local IPs and interpolate them into our etcd systemd unit. However, I needed the local IP for the -initial-advertise-peer-urls and -advertise-client-urls CLI flags in the etcd unit. To get around this, I made use of two handy things:

Essentially, I have Ansible download and install the setup-network-environment tool and use another systemd unit to run it on boot:

[Unit]
Description=Setup Network Environment  
Documentation=https://github.com/kelseyhightower/setup-network-environment  
Requires=network-online.target  
After=network-online.target

[Service]

ExecStart=/usr/local/bin/setup-network-environment  
RemainAfterExit=yes  
Type=oneshot

[Install]
WantedBy=multi-user.target  

This outputs some very basic network environment info into a handy file at /etc/network-environment:

$ cat /etc/network-environment 
LO_IPV4=127.0.0.1  
ETH0_IPV4=172.16.0.84  
DEFAULT_IPV4=172.16.0.84  

The etcd systemd unit is by far the largest of the three, containing many values interpolated from Ansible vars. Note the Requires and After directives in the unit definition which ensure that, before we run etcd:

  • hostname matches our desired A Record
  • /etc/network-environment is populated and contains our private IP
  • A Record has been upserted with our private IP
[Unit]
Description=etcd Daemon  
After=network.target,sne.service,r53update.service  
Requires=sne.service,r53update.service

[Service]
Type=notify  
User={{ etcd_user }}  
EnvironmentFile=/etc/network-environment  
ExecStart={{ etcd_cmd }} \  
    -discovery-srv {{ etcd_discovery_srv }} \
    -initial-advertise-peer-urls {{ etcd_initial_advertise_peer_urls }} \
    -advertise-client-urls {{ etcd_advertise_client_urls }} \
    -listen-client-urls {{ etcd_listen_client_urls }} \
    -listen-peer-urls {{ etcd_listen_peer_urls }} \
    {% if etcd_client_url_scheme == "https" -%}
    --cert-file {{ etcd_client_cert_file }} \
    --key-file {{ etcd_client_key_file }} \
    {% if etcd_client_cert_auth -%}
    --client-cert-auth \
    --trusted-ca-file {{ etcd_client_trusted_ca_file }} \
    {% endif -%}
    {% endif -%}
    {% if etcd_peer_url_scheme == "https" -%}
    --peer-cert-file {{ etcd_peer_cert_file }} \
    --peer-key-file {{ etcd_peer_key_file }} \
    {% if etcd_peer_client_cert_auth -%}
    --peer-client-cert-auth \
    --peer-trusted-ca-file {{ etcd_peer_trusted_ca_file }} \
    {% endif -%}
    {% endif -%}
    -data-dir {{ etcd_data_dir }} \
    -name %H

Restart=always  
RestartSec=10s  
LimitNOFILE=40000  
TimeoutStartSec=0


[Install]
WantedBy=multi-user.target  

This is rendered on disk as:

[Unit]
Description=etcd Daemon  
After=network.target,sne.service,r53update.service  
Requires=sne.service,r53update.service

[Service]
Type=notify  
User=root  
EnvironmentFile=/etc/network-environment  
ExecStart=/usr/local/sbin/etcd \  
    -discovery-srv kubernetes.tastycidr.net \
    -initial-advertise-peer-urls http://${DEFAULT_IPV4}:2380 \
    -advertise-client-urls http://${DEFAULT_IPV4}:2379 \
    -listen-client-urls http://0.0.0.0:2379 \
    -listen-peer-urls http://0.0.0.0:2380 \
    -data-dir /var/cache/etcd/state \
    -name %H

Restart=always  
RestartSec=10s  
LimitNOFILE=40000  
TimeoutStartSec=0


[Install]
WantedBy=multi-user.target  

There are two bits of magic at play here - the ${DEFAULT_IPV4} is a reference to the /etc/network-environment file value. At the bottom of the ExecStart directive, I also use -name %H to set the friendly name within the etcd cluster to match our hostname using a systemd specifier - %H for hostname!

This allows us to use interpolated values without laying down a new template at boot, though you could absolutely do this by running Ansible locally on boot.

Usage

All that's left is to scale up! I've configured the ASG etc through CloudFormation + Sceptre, this article assumes you can handle that. Once you're scaled up, you should be able to SSH to one of those etcds and inspect a few things:

Hostname is set to our A Record:

$ hostname
etcd-us-west-2a.kubernetes.tastycidr.net  

A Record has correct value (using dig):

;; ANSWER SECTION:
etcd-us-west-2a.kubernetes.tastycidr.net. 300    IN A 172.16.0.84  

etcd member list:

$ etcdctl member list
3313b19f27ec00cd, started, kfs-master-us-west-2b.kubernetes.tastycidr.net, http://etcd-us-west-2b.kubernetes.tastycidr.net:2380, http://172.16.2.207:2379  
9d8eac0cfb2011b6, started, kfs-master-us-west-2c.kubernetes.tastycidr.net, http://etcd-us-west-2c.kubernetes.tastycidr.net:2380, http://172.16.4.147:2379  
ba2f3fd31e7ec3aa, started, kfs-master-us-west-2a.kubernetes.tastycidr.net, http://etcd-us-west-2a.kubernetes.tastycidr.net:2380, http://172.16.0.84:2379  

Get some basic stats:

curl http://172.16.2.207:2379/v2/stats/leader  
{
  "leader": "3313b19f27ec00cd",
  "followers": {
    "9d8eac0cfb2011b6": {
      "latency": {
        "current": 0.001664,
        "average": 0.003212,
        "standardDeviation": 0.0033903302993162,
        "minimum": 0.000925,
        "maximum": 0.013612
      },
      "counts": {
        "fail": 0,
        "success": 13
      }
    },
    "ba2f3fd31e7ec3aa": {
      "latency": {
        "current": 0.000792,
        "average": 0.034388,
        "standardDeviation": 0.10391140255445,
        "minimum": 0.000792,
        "maximum": 0.37898
      },
      "counts": {
        "fail": 1,
        "success": 12
      }
    }
  }
}

Congratulations, your etcd deployment is now autoscaled.