Rancher - Creating a highly available container orchestration cluster on AWS

Note: In these examples I'm naming the project/org tastycidr, but this was a project done at work...aka not tastycidr.net

Approximately a year ago, my organisation began development of a new platform based on a containerised microservices architecture. For us, it was uncharted territory - while some of our developers had experience with using Docker, we had no institutional experience with deploying an architecture like this in a production environment. Early on in the development phase, we realised we were going to need to an orchestration platform. We considered a few (Kubernetes, EC2 Container Service) before ultimately settling on Rancher.

It went through a couple of iterations before we decided on a method for automating and scaling - the final iteration is what this article describes. This article isn't a comprehensive how-to and assumes knowledge of AWS, Terraform, Ansible and networking. My goal in writing it is to document and share information that should help a prospective rancher user to deal with some of the gaps in Rancher docs and gotchas that we encountered.


TL;DR:

AWS Resources

  • Application Load Balancer and corresponding DNS entry
  • ACM Certificate so we can terminate SSL at our ALB
  • ALB Target Group
  • 2x ALB Listeners, one for HTTP and one for HTTPS
  • 3x or 5x EC2 instances, all attached to the ALB Target Group
  • Multiple Autoscaling Groups and Launch Configurations for container hosts
  • An RDS database running MySQL

Configuration

  • A generic packer template and resultant AMI, used by all container host launch configurations
  • Several ansible roles, group variables, and a pair of playbooks.
  • Several Ansible Tower job templates
  • A couple of scripts (python, bash) to automate initing the control cluster, joining container hosts to that cluster, then removing them when it's time to scale down.

TF Repo File Structure

There are many valid ways of storing Terraform code - my experience has been that every org does it a little differently. We've gone through a couple of iterations but ultimately we settled on segregating our environments into manageable chunks - as opposed to one monolithic environment and state, or one based on sharing data via the TF module system. This arguably makes our code less reusable (the main advantage of modules) but effectively limits your blast radius so that you can automate the use of terraform apply without being terrified. We adopted this methodology after reading and discussing this excellent article by Charity Majors of honeycomb.io

In this case, the parent VPC is its own TF environment, which contains only the definition of the VPC itself, the internet gateway, subnets, access control lists, and a super basic route table.

We also separate out Security Groups into their own directory and tfstate. Depending on how you organize your code, this may or may not be wise or necessary - for example, if you completely follow the Module based TF model, with dependency definitions etc, you can mix your SGs into other code. Because we've chosen to use many directories with many tfstates, it's simplest to have all our security group definitions in one state to avoid chicken and egg dependency problems (I can't create this SG because one of its inbound rules is from a SG that doesn't exist etc).

The finished repo file structure looks like this (output from tree -aF)

.
├── base/
│   ├── base-vars.tf
│   └── data-sources.tf
├── vpc/
│   ├── acl.tf
│   ├── base-vars.tf -> ../base/base-vars.tf
│   ├── iam.tf
│   ├── routing.tf
│   ├── subnets.tf
│   ├── .terraform/
│   │   └── terraform.tfstate
│   └── vpc.tf
├── dev/ #truncated, it's identical to prod except resource names
├── infrastructure/ #truncated, rancher doesn't care
├── prod/
│   ├── elasticache/ #truncated, rancher doesn't care
│   ├── elasticsearch/ #truncated, rancher doesn't care
│   ├── application-alb/
│   │   ├── base-vars.tf -> ../../base/base-vars.tf
│   │   ├── prod-internal-alb.tf
│   │   ├── data-sources.tf -> ../../base/data-sources.tf
│   │   ├── prod-application-alb.tf
│   │   └── .terraform/
│   │       └── terraform.tfstate
│   ├── rabbitmq/ #truncated, rancher doesn't care
│   ├── rancher/
│   │   ├── base-vars.tf -> ../../base/base-vars.tf
│   │   ├── data-sources.tf -> ../../base/data-sources.tf
│   │   ├── prod-asg-crawling.tf
│   │   ├── prod-asg-global_web.tf
│   │   ├── prod-rancherctl-alb.tf
│   │   ├── prod-rancherctl.tf
│   │   ├── .terraform/
│   │   │   └── terraform.tfstate
│   │   └── userdata/
│   │       ├── prod-rancher-crawling.sh
│   │       └── prod-rancher-global_web.sh
│   └── rds/
│       ├── application-rds/
│       │   ├── base-vars.tf -> ../../../base/base-vars.tf
│       │   ├── data-sources.tf -> ../../../base/data-sources.tf
│       │   ├── application-rds.tf
│       │   └── .terraform/
│       │       └── terraform.tfstate
│       └── rancher-rds/
│           ├── base-vars.tf -> ../../../base/base-vars.tf
│           ├── data-sources.tf -> ../../../base/data-sources.tf
│           └── rancher-rds.tf
│           └── .terraform/
│               └── terraform.tfstate
└── security-groups/
    ├── base.tf
    ├── base-vars.tf -> ../base/base-vars.tf
    ├── data-sources.tf
    ├── dev.tf
    ├── infrastructure.tf
    ├── production.tf
    ├── .terraform/
    │   └── terraform.tfstate
    └── vault.tf

File Structure Notes:

Note that aside from the base/ dir, anywhere you see base-vars.tf or data-sources.tf they have been symlinked from the base/ dir. This makes it easy to maintain a single source of truth for certain project wide variables and ensures that any other Terraform environment can get access to output variables if required.

We are using Terraform's Remote State functionality - state files are stored in S3. This allows us to only commit the skeleton of our state files in git, containing a reference to the full state information in S3:

{
    "version": 3,
    "serial": 1,
    "remote": {
        "type": "s3",
        "config": {
            "bucket": "tastycidr-tfstate",
            "key": "prod/rancher.tfstate",
            "region": "us-east-1"
        }
    },
    "modules": []
}

The astute will no doubt notice that I've emptied out the list of modules and removed both the lineage and minimum TF version lines. These are both in your tfstate files by default to prevent accidental damage to your infrastructure via the terraform apply command - we've done this knowing the risks in exchange for convenience and haven't had any issues as a result (yet), but your mileage may vary.

When you plan, refresh or apply, Terraform looks in this tfstate file and sees that the actual state information is stored in S3. That information is fetched seamlessly and populates the local copy of the file. The downside to this approach is that each time you run a refresh, plan or apply, you'll need to ensure that you don't accidentally commit the now very populous local state file to git. This can result in both the accidental disclosure of secrets in your source code and unnecessarily verbose pull request diffs.


base-vars.tf and data-sources.tf

base-vars.tf holds variables that basically never change and which I want accessible across the whole project

variable "region" {  
  default = "us-east-1"
}

variable "default_avail_zone" {  
  default = "us-east-1b"
}

variable "pv_amis_1404" {  
  default = {
    us-east-1 = "ami-7388cd19"
  }
}

variable "hvm_amis_1404" {  
  default = {
    us-east-1 = "ami-0f8bce65"
  }
}

variable "r53_zoneid" {  
  default = "XXXXXXXXXXXXXXXX"
}

variable "corporate-cidr" {  
  default = "XX.XXX.XX.XXX/32"
}

variable "corp-guest-cidr" {  
  default = "XXX.XXX.XX.XX/32"
}

variable "vpn-cidr" {  
  default = "XX.XXX.XXX.XX/32"
}

variable "github-cidr" {  
  default = "192.30.252.0/22"
}

data-sources.tf could hold any type of Terraform Data Source, but in this case I am only using it for remote state references:

data "terraform_remote_state" "vpc" {  
    backend    = "s3"
    config {
        bucket = "tastycidr-tfstate"
        key    = "vpc.tfstate"
    }
}

data "terraform_remote_state" "securitygroups" {  
    backend    = "s3"
    config {
        bucket = "tastycidr-tfstate"
        key    = "securitygroups.tfstate"
    }
}

data "terraform_remote_state" "prod-application-alb" {  
    backend    = "s3"
    config {
        bucket = "tastycidr-tfstate"
        key    = "prod/application-alb.tfstate"
    }
}

data "terraform_remote_state" "prod-rancher" {  
    backend    = "s3"
    config {
        bucket = "tastycidr-tfstate"
        key    = "prod/rancher.tfstate"
    }
}

These, plus the outputs I mentioned above, are what allow us to reference data from another environment's tfstate. For example, later on, I'll use these data sources to reference security group IDs for launch configs and EC2 instance configurations - as well as the VPC's ID, subnet IDs, etc.


The VPC

There is plenty of documentation available on standing up a VPC, so I'll leave that outside the scope of this article. Suffice it to say that you'll need:

  • a VPC
  • at least 2 subnets (so you can make a subnet group)
  • an ACL for each subnet
  • an internet gateway
  • A route table and associations for each subnet
  • Terraform outputs for any resource you may want to reference in another part of your TF environment. I assign outputs for subnets, subnet groups and the VPC ID itself.

MySQL

Rancher stores basically everything of value in a MySQL database. You can roll your own if you like, but we've opted to use RDS! Nothing too fancy here - note that all DNS addresses and sensitive credential information has been replaced by fabulous yet fake data:

First, your basic security group - allowing inbound traffic on 3306 (MySQL default) from both our Rancher security group and from our office/backup/VPN endpoints.

resource "aws_security_group" "prod-rds" {  
    name                = "prod-rds"
    description         = "prod-rds"
    vpc_id              = "${data.terraform_remote_state.vpc.vpc-id}"

    ingress {
        from_port       = 3306
        to_port         = 3306
        protocol        = "tcp"
        cidr_blocks     = ["${var.corporate-cidr}", "${var.corp-guest-cidr}", "${var.vpn-cidr}"]
        security_groups = ["${aws_security_group.prod-rancher.id}"]
        self            = false
    }

    tags {
        Name            = "prod-rds"
        Environment     = "prod"
        Terraform       = "True"
    }
}

output "prod-rds-sg-id" {  
    value               = "${aws_security_group.prod-rds.id}"
}

Then of course, the RDS instance itself:

resource "aws_db_instance" "prod-rancher-rds" {  
    identifier              = "prod-rancher-rds"
    allocated_storage       = 100
    storage_type            = "gp2"
    engine                  = "mysql"
    engine_version          = "5.7.11"
    instance_class          = "db.m3.medium"
    name                    = "prodrancher"
    username                = "hasslehoff"
    password                = "KnightRider#1"
    port                    = 3306
    publicly_accessible     = true
    # variable default_avail_zone is defined in our base-vars.tf
    availability_zone       = "${var.default_avail_zone}"
    # Here I get the SG ID for our RDS security group from a TF remote state data source
    vpc_security_group_ids  = [
            "${data.terraform_remote_state.securitygroups.prod-rds-sg-id}"
            ]
    db_subnet_group_name    = "${data.terraform_remote_state.vpc.prod-rds-snetgroup}"
    parameter_group_name    = "default.mysql5.7"
    multi_az                = true
    backup_retention_period = 14
    backup_window           = "00:00-01:00"
    maintenance_window      = "mon:01:30-mon:02:30"
    tags {
        Environment         = "prod"
        Type                = "prod-rancher"
        Terraform           = "true"
    }
}

resource "aws_route53_record" "prod-rancher-rds" {  
    zone_id = "${var.r53_zoneid}"
    name    = "prod-rancher-rds.tastycidr.net"
    type    = "CNAME"
    ttl     = "300"
    records = ["${aws_db_instance.prod-rancher-rds.address}"]
}

Here, we've told RDS to take a nightly snapshot between 00:00 and 01:00 UTC and to hold onto each snapshot for 14 days. Obviously in a production setup that needs to be highly available, you'll want multi-az enabled to protect against meteor strikes etc.


Application Load Balancer

We use an ALB to front several hosts which are running the rancher/server service. When we stand up our control instances (3 of them), they will be members of an ALB Target Group. That target group has a port assigned and a health check. By default, the rancher/server container listens on 8080 and will respond 200 OK at /ping, so we configure the target group and health check accordingly. We will also need a listener - which is configured with an external facing port (443 for HTTPS), a free SSL cert from ACM, and one or more rules which tell it which target group receives what traffic.

First, an SG:

resource "aws_security_group" "prod-rancher-alb" {  
    name                = "prod-rancher-alb"
    description         = "prod-rancher-alb"
    vpc_id              = "${data.terraform_remote_state.vpc.vpc-id}"

    ingress {
        from_port       = 443
        to_port         = 443
        protocol        = "tcp"
        cidr_blocks     = ["${var.corporate-cidr}", "${var.corp-guest-cidr}", "${var.vpn-cidr}"]
        self            = false
    }

    egress {
        from_port       = 0
        to_port         = 0
        protocol        = "-1"
        cidr_blocks     = ["0.0.0.0/0"]
    }

    tags {
        Name            = "prod-rancher-alb"
        Environment     = "prod"
        Terraform       = "True"
    }
}

output "prod-rancher-alb-sg-id" {  
    value               = "${aws_security_group.prod-rancher-alb.id}"
}

This next section contains several pieces - starting with a Terraform Data Source that searches, filters and ultimately finds a pre-existing certificate which I've requested from Amazon Certificate Manager. These certs are only usable on ELB/ALBs, but they're free and zero maintenance once requested. The only unfortunate thing about them is that due to the verification step (they email a preset list of @yourdomain.com email addresses to ensure you own the domain), you cannot directly create an ACM cert via Terraform.

data "aws_acm_certificate" "prod-rancherctl" {  
  domain = "rancher.tastycidr.net"
  statuses = ["ISSUED"]
}

resource "aws_alb" "prod-rancherctl" {  
    name                       = "prod-rancherctl"
    enable_deletion_protection = true
    # Long-ish idle timeout here. Default is 60s, kind of irritating when connecting to control instances via SSH.
    idle_timeout               = 900
    security_groups            = ["${data.terraform_remote_state.securitygroups.prod-rancher-alb-sg-id}"]
    subnets                    = [
            "${data.terraform_remote_state.vpc.prod-subnet-id}",
        ]
    tags {
        Usage                  = "prod-rancherctl"
        Environment            = "prod"
        Terraform              = "True"
    }
}

Note the "default_action" on the listener - we want all traffic that hits the ELB to be forwarded to the control nodes target group behind it, so this default action is sufficient. An ALB can have many such rules if so desired

resource "aws_alb_listener" "prod-rancherctl-HTTPS" {  
  load_balancer_arn = "${aws_alb.prod-rancherctl.arn}"
  port              = "443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-2015-05"
  certificate_arn   = "${data.aws_acm_certificate.prod-rancherctl.arn}"

  default_action {
    target_group_arn = "${aws_alb_target_group.prod-rancherctl.arn}"
    type             = "forward"
  }
}

resource "aws_alb_target_group" "prod-rancherctl" {  
    name             = "prod-rancherctl"
    port             = 8080
    protocol         = "HTTP"
    vpc_id           = "${data.terraform_remote_state.vpc.vpc-id}"
    health_check {
        healthy_threshold   = 2
        unhealthy_threshold = 2
        timeout             = 5
        path                = "/ping"
        interval            = 10
    }
}

This next section is a bit clumsy on my part. You'll see that I've created three separate target group attachment resources (1 for each rancher control node). Those initiated in the ways of AWS might rightly point out that it would be simpler to create an autoscaling group that places all its instances in the above target group. It would not be difficult to implement this, but I haven't yet - so here it is:

resource "aws_alb_target_group_attachment" "prod-rancherctl1" {  
    target_group_arn = "${aws_alb_target_group.prod-rancherctl.arn}"
    target_id        = "${aws_instance.prod-rancherctl1.id}"
    port             = 8080
}

resource "aws_alb_target_group_attachment" "prod-rancherctl2" {  
    target_group_arn = "${aws_alb_target_group.prod-rancherctl.arn}"
    target_id        = "${aws_instance.prod-rancherctl2.id}"
    port             = 8080
}

resource "aws_alb_target_group_attachment" "prod-rancherctl3" {  
    target_group_arn = "${aws_alb_target_group.prod-rancherctl.arn}"
    target_id        = "${aws_instance.prod-rancherctl3.id}"
    port             = 8080
}
resource "aws_route53_record" "prod-rancherctl-elb" {  
    zone_id = "${var.r53_zoneid}"
    name    = "rancher.tastycidr.net"
    type    = "CNAME"
    ttl     = "300"
    records = ["${aws_alb.prod-rancherctl.dns_name}"]
}

output "prod-rancher-fqdn" {  
    value               = "${aws_route53_record.prod-rancherctl-elb.fqdn}"
}

Control Nodes

That does it for the ALB itself, but as was hinted just above, we also need three EC2 instances to function as control nodes. These won't run any containers themselves (aside from those required by the Rancher HA system itself). Three control nodes gives us a fault tolerance of 1 failed instance - if we wanted to tolerate two failures, we'd need to increase the host count to five. This is another reason that, as I mentioned above, it would be better to have our control nodes be members of an autoscaling group. However, I digress...for the sake of brevity, I'm only listing one node + DNS here.

resource "aws_instance" "prod-rancherctl1" {  
    ami                         = "${lookup(var.hvm_amis_1404, var.region)}"
    iam_instance_profile        = "${aws_iam_instance_profile.prod-rancher.name}"
    availability_zone           = "${var.default_avail_zone}"
    ebs_optimized               = false
    instance_type               = "m3.large"
    monitoring                  = false
    key_name                    = "somekeyname"
    subnet_id                   = "${data.terraform_remote_state.vpc.prod-subnet-id}"
    associate_public_ip_address = true
    vpc_security_group_ids      = [
        "${data.terraform_remote_state.securitygroups.ssh-sg-id}", 
        "${data.terraform_remote_state.securitygroups.prod-rancher-sg-id}"
    ]

    root_block_device {
        volume_type             = "gp2"
        volume_size             = 10
        delete_on_termination   = true
    }

    ephemeral_block_device {
        "device_name"           = "/dev/xvdb"
        "virtual_name"          = "ephemeral0"
    }

    tags {
        Usage                 = "Rancher Control Instance"
        Environment           = "prod"
        Name                  = "prod-rancherctl1.tastycidr.net"
        Terraform             = "true"
        Ubuntu-Release        = "14.04"
    }
}

resource "aws_route53_record" "prod-rancherctl1-CNAME" {  
    zone_id = "${var.r53_zoneid}"
    name    = "prod-rancherctl1.tastycidr.net"
    type    = "CNAME"
    ttl     = "300"
    records = ["${aws_instance.prod-rancherctl1.public_dns}"]
}

You, the eagle-eyed reader, may have noticed that I referenced an IAM Instance Profile in that block above. Rancher doesn't NEED any IAM permissions in order to function, however, you may need them to take advantage of services such as S3, EBS, the EC2 Container Registry etc. All EC2 instances should have an IAM role and IAM instance Profile, even if you don't attach any policies to the role (you might later on). In this case, we're using ECR to store all our custom containers, so we'll need a role policy to enable access to ECR.

resource "aws_iam_instance_profile" "prod-rancher" {  
    name  = "prod-rancher"
    roles = ["${aws_iam_role.prod-rancher.name}"]
}

resource "aws_iam_role" "prod-rancher" {  
    name = "prod-rancher"
    path = "/"
    assume_role_policy = <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
           "Sid": "",
           "Effect": "Allow",
           "Principal": {
                "Service": "ec2.amazonaws.com"
           },
           "Action": "sts:AssumeRole"
        }
    ]
}
EOF  
}

resource "aws_iam_role_policy" "prod-rancher" {  
    name = "prod-rancher"
    role = "${aws_iam_role.prod-rancher.id}"
    policy = <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ecr:List*",
                "ecr:Describe*",
                "ecr:Get*",
                "ecr:BatchGetImage"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}
EOF  
}

Enter Ansible

Now that all our AWS pieces are in place, it's time to provision our control hosts. This is fairly straightforward, and there are several ways to accomplish it even just within the scope of Ansible. Because I've opted in this case to define static hosts to use as rancher control nodes, I have the option of using an Inventory. If I were to convert to using an autoscaling group, they would have to be provisioned by something like Ansible Tower when they were scaled up, as there the rancher HA init script I'll describe further down needs to know the instance's internal IP address. In this case, I'll be using an inventory with some group vars for defining key data.

Inventory:

[prod:children]
rancher-control-hosts

[rancher-control-hosts:vars]
# We use sumologic for log aggregation. This could also be a group var
sumo_role=Prod/RancherControl

[rancher-control-hosts]
prod-rancherctl1.tastycidr.net  
prod-rancherctl2.tastycidr.net  
prod-rancherctl3.tastycidr.net  

And a group vars file that applies to all members of my production rancher-control-hosts group. This contains information that rancher needs to connect to our RDS instance.

rancher_rds_url: prod-rancher-rds.tastycidr.net  
rancher_rds_user: hasselhoff  
rancher_rds_pass: KnightRider#1  
rancher_rds_dbname: prodrancher  
rancher_agent_port: 8080  

Next, you'll need an EC2 instance with docker installed and configured. There's a ton of information out there on how this can be accomplished, so I'll leave that outside the scope of this article as well. We do it via a playbook and multiple ansible roles:

- hosts: rancher-control-hosts
  become: true
  gather_facts: true
  roles:
      - { role: base-server, tags: [ 'base-server' ] }
      - { role: docker-host, tags: [ 'docker-host' ] }
      - { role: rancher-control-host, tags: [ 'rancher-control-host' ] }
      - { role: sumologic-logging-base, tags: ['sumologic'] }
      - { role: datadog, when: datadog_enabled | default(false), tags: ['datadog'] }

The base-server role handles things like laying down users/SSH keys for our Devs and Ops teams, as well as disk handling etc. Docker-host is fairly self explanatory, it installs and configures docker. Sumologic and Datadog are log/metrics aggregation tools. Of course, that leaves our rancher-control-host role - which is extremely simple!

tasks/main.yml:

- name: Gather EC2 Facts
  action: ec2_facts

- name: lay down Rancher HA init script
  template: src=rancher-ha-init.sh.j2 dest=/usr/local/lib/rancher-ha-init.sh owner=root mode=0755

- name: init the Rancher HA cluster
  command: sh /usr/local/lib/rancher-ha-init.sh
  when: rancher-ha-init in ['true', true]

- wait_for:
    port: "{{ rancher_agent_port }}"
    delay: 30

And the template script laid down by that above task - thanks to improvements made by the Rancher developers for v1.2, this is now much simpler!

templates/rancher-ha-init.sh.j2:

#!/bin/bash
docker run -d --restart=unless-stopped -p {{ rancher_agent_port }}:{{ rancher_agent_port }} -p 9345:9345 rancher/server:stable \  
--db-host {{ rancher_rds_url }} --db-port 3306 --db-user {{ rancher_rds_user }} --db-pass {{ rancher_rds_pass }} --db-name {{ rancher_rds_dbname }} \
--advertise-address {{ ansible_ec2_local_ipv4 }}

That template gets laid down with the interpolated vars from our group vars file. ansible_ec2_local_ipv4 is the internal IP of each instance. Ansible then runs the script. The result of this should be that each container host pulls down the rancher/server:latest container from the Docker registry and starts it, feeding in the interpolated values. If all goes well, the containers come up and start responding on port 8080 - which results in the target group health check succeeding and the ALB begins routing traffic to the control nodes. Once all three instances are in service, it's time to log in via the web portal and do a little bit of additional setup.


The Rancher GUI

I'm not going to cover this in great detail as it is already fairly well documented and may be different for you depending on your needs. For us, this step involves:

  • Setting up Access Control - we like to use Github's oauth, which is built into Rancher. I login with my own github account, manually add anybody who gets unrestricted Admin access (Operations team basically) and then add our github organization under the "restricted" role for our main environment.
  • Setup an "environment" - this will contain one or more "stacks", which are collections of service containers
  • Configure a "Registry" - as mentioned earlier, we use EC2 Container Registry
  • Add the ECR Login Helper "stack" - annoying ECR requirement, but there's already a stack available to handle keeping you authenticated to ECR. No big deal.
  • Configure a set of API credentials for the environment. We'll use these later on...

I make note of the Environment ID as we'll need it to configure our container hosts. You can get this from the URL while using your environment, or from the rancher API.


Container Hosts

We're now at the stage where we will need docker-ready hosts for running containers. Rancher is quite flexible in this area - you can:

  • just add a couple hosts manually
  • give Rancher some Amazon API keys and let it handle provisioning new hosts for you
  • or, use EC2 autoscaling + cloudwatch alarms for more advanced scaling logic

We've opted for option 3. We want to have multiple groups of hosts that scale independently of one another, run different groups of service containers, and have scaling policies that use different metrics. We may need to add many celery workers and require many hosts for them, but our front end i.e web portal traffic may remain fairly constant. Lucky for us, Rancher provides tools that make this possible - the io.rancher.scheduler.affinity:host_label attribute in Rancher Compose, for instance.

Building an AMI

We use Packer + Ansible to build our AMI. We pre-install all packages and lay down as much configuration as we can beforehand. We store some variables in a base-vars.json that are common across all our AMI builds. The AMI IDs here are from a line of base images maintained by Canonical. We use them as a foundation and then configure them to meet our needs via ansible.

{
    "pv_amis_1204": "ami-c07985a8",
    "pv_amis_1404": "ami-0a8acf60",
    "hvm_amis_1204": "ami-f478849c",
    "hvm_amis_1404": "ami-7b89cc11",
    "ssh-security-group": "sg-a12345678",
    "pip-client-security-group": "sg-b23456789"
}

That pip-client-security-group is used to allow access to our Pypi server, which hosts our custom Pip packages.

{
  "builders": [
    {
      "type": "amazon-ebs",
      "region": "us-east-1",
      "ami_regions": ["us-east-1"],
      "source_ami": "{{user `hvm_amis_1404`}}",
      "ami_virtualization_type": "hvm",
      "force_deregister": true,
      "instance_type": "m3.xlarge",
      "ssh_username": "ubuntu",
      "security_group_ids": [
        "{{user `ssh-security-group`}}",
        "{{user `pip-client-security-group`}}"
        ],
      "ami_name": "packer-prod-rancher-containerhost_{{timestamp}}"
    }
  ],
  "provisioners": [
    {
      "type": "ansible",
      "playbook_file": "../ansible/build-rancher-containerhost.yml",
      "groups": ["prod","rancher-container-hosts"],
      "extra_arguments": [ 
        "--extra-vars",
        "amibuilder=true sumo_enabled=true"
      ],
      "user": "ubuntu"
    }
  ]
}

The amibuilder=true flag is used in places inside our ansible code to deliberately do (or not do) certain actions if we are building an AMI (vs simply running a deploy from Tower). Once we have these in place, we generate an AMI by doing packer build -var-file=base-vars.json packertemplatefile.json. Packer creates an EC2 instance, runs our Ansible playbook against it (getting all group vars from prod and rancher-container-hosts) and then generates an AMI once everything in the provisioners block has completed. Below, we use a Terraform data source to find the latest version of this AMI and use it in the creation of our launch configuration. The Ansible playbook - build-rancher-containerhost.yml - does more or less the same thing as the control host playbook does - it preps the machine for running Docker containers.

The Autoscaling Group

We have multiple autoscaling groups for multiple functions - here, I'll only be illustrating one. The principle is the same no matter how many ASGs you need.In Terraform, we'll need to fetch the latest AMI ID, then define another IAM Instance Profile, launch configuration and autoscaling Group. In this case, the ASG/container service we want to scale is a fleet of celery workers:

data "aws_ami" "prod-rancher-celery" {  
  most_recent = true
  filter {
    name   = "name"
    values = ["packer-prod-rancher-containerhost_*"]
  }
  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
  # Your Amazon account ID, assuming you own the AMI in question
  owners = ["XXXXXXXXXXXX"]
}

# In this case we're using a generic rancher container host IAM role
# But having separate instance profiles allows us to segregate them later without stopping any instances

resource "aws_iam_instance_profile" "prod-rancher-celery" {  
  name  = "prod-rancher-celery"
  roles = ["rancher-container-host"]
}

resource "aws_launch_configuration" "prod-rancher-celery" {  
  name_prefix   = "prod-rancher-celery_"
  image_id      = "${data.aws_ami.prod-rancher-celery.id}"
  instance_type = "m3.xlarge"
  user_data     = "${file("./userdata/prod-rancher-celery.sh")}"

  security_groups = [
    "${data.terraform_remote_state.securitygroups.ssh-sg-id}",
    "${data.terraform_remote_state.securitygroups.prod-rancher-sg-id}"
  ]

  key_name             = "somekeyname"
  iam_instance_profile = "${aws_iam_instance_profile.prod-rancher-celery.name}"

  lifecycle {
    create_before_destroy = true
  }

  ephemeral_block_device {
    "device_name"  = "/dev/xvdb"
    "virtual_name" = "ephemeral0"
  }

  ephemeral_block_device {
    "device_name"  = "/dev/xvdc"
    "virtual_name" = "ephemeral1"
  }
}

resource "aws_autoscaling_group" "prod-rancher-celery" {  
  availability_zones   = ["${var.default_avail_zone}"]
  name                 = "prod-rancher-celery"
  # These min and max sizes are of course arbitrary and must be tailored to your workload
  max_size             = 25
  min_size             = 5
  force_delete         = true
  vpc_zone_identifier  = ["${data.terraform_remote_state.vpc.prod-subnet-id}"]
  launch_configuration = "${aws_launch_configuration.prod-rancher-celery.name}"

  tag {
    key                 = "Name"
    value               = "prod-rancher-celery"
    propagate_at_launch = true
  }

  tag {
    key                 = "Environment"
    value               = "prod"
    propagate_at_launch = true
  }

  tag {
    key                 = "Usage"
    value               = "celery"
    propagate_at_launch = true
  }

  tag {
    key                 = "Terraform"
    value               = "true"
    propagate_at_launch = true
  }

  tag {
    key                 = "Ubuntu-Release"
    value               = "14.04"
    propagate_at_launch = true
  }
}

That Launch config I define above referenced a "userdata" shell script - it's just a shell script that is run by cloud-init to execute some arbitrary code when an instance is first started.

#!/bin/bash

PYTHONPATH=. python /usr/local/lib/provision.py http://ansible.tastycidr.net 1092 72c22198ej298u149dde22 941 3 someusername somepassword  

Best explained by the script's own docopt string:

"""
ansible/roles/base-server/files/provision.py

This script calls home to Ansible Tower and causes it to provision the server.  
If it cannot successfully call home to Ansible Tower for 10 minutes, it  
triggers an alert.

Usage:  
    provision.py <base_url> <job_template> <host_key> <group_id> <inventory_id>
        <username> <password>
"""

provision.py itself is a >300 line chunk of Python and outside the scope of this article. The tl;dr version is that it joins the host to a group and inventory and then runs a deploy job. The Ansible Tower job itself is fairly simply - most everything is pre-configured and already present via the AMI when the new instance comes up, so all that is left to do is lay down a couple of scripts - one for joining the container host to the correct rancher environment, one for removing it from the cluster when we scale down (more on that later!).

The role we've created for joining a new container host to our rancher cluster is very simple:

- name: Rancher | Lay Down scripts for use with Autoscaling
  template: src={{ item.0 }} dest={{ item.1 }} mode=0755 owner=root
  with_together:
    - "{{ rancher_asg_sources }}"
    - "{{ rancher_asg_dests }}"
  tags:
    - rancher_asg

- meta: flush_handlers

- name: run rancher startup scripts
  command: sh /usr/local/lib/rancher-join-cluster.sh
  when: amibuilder not in ['true', true]

That lays down the above mentioned scripts and then runs the one which joins our host to the cluster (unless you're currently building an AMI!). It uses these group vars:

rancher_registration_url: https://rancher.tastycidr.net  
rancher_agent_version: 1.1.2  
rancher_project_id: 1a5  
rancher_access_key: FNWF392014490DNQQLDK  
rancher_secret_key: 4489few1248932nioenworn2384yRWOIFOKNFKLW23  
rancher_agent_port: 8080  
rancher_chost_role: crawling  

Which populate this template:

#!/bin/bash
## some host labels as needed
CATTLE_HOST_LABELS="role={{ rancher_chost_role }}&env={{ env }}"  
VOLUMES="-v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher"

RANCHER_REG_URL=$(curl -su {{ rancher_access_key }}:{{ rancher_secret_key }} {{ rancher_registration_url }}/v1/registrationtokens?projectId={{ rancher_project_id }} | head -1 | grep -nhoe 'registrationUrl[^},]*}' | egrep -hoe 'https?:.*[^"}]' | sed 's/\\//g')

docker run \  
-e $CATTLE_HOST_LABELS \
-e CATTLE_AGENT_IP="{{ ansible_ec2_local_ipv4 }}" \
$VOLUMES \
rancher/agent:v{{ rancher_agent_version }} \  
$RANCHER_REG_URL

Basically, this script gets a registration URL from the control cluster, then uses it (along with a volume mount string and the tags we define) to run a docker container, which joins the machine to the control cluster. At some point I intend to rewrite this script in Python as it's much easier to understand and implement proper error handling than it is with that ugly bash one-liner.

You'll note that I'm dropping in a CATTLE_HOST_LABEL there. This is how we tell Rancher Compose which container hosts will run which containers. We label the host during the Ansible Tower deploy process via the above template, and as mentioned earlier, it will match the io.rancher.scheduler.affinity:host_label value that we define for a particular service in our rancher compose file. Each autoscaling group will have different value(s) for rancher_chost_role, allowing us to scale each service's container group independent of one another.

Once our Tower deploy is complete and the rancher-join-cluster.sh script has run, you ought to see it in your hosts panel in the Rancher GUI:

Imgur

Note the that env is set to prod for both, but the role differs. That label tells Rancher where to deploy containers based on our Rancher Compose instructions! As the host is added to the cluster, the appropriate stacks and containers are propagated to it.

Scaling Down

One flaw we have encountered with Rancher (at least up to 1.2) is that when we scale down a container host via EC2, the Rancher web panel doesn't seem to realise it in an acceptable time frame. I don't know why this eventuality isn't handled by Rancher by default, but our experience was that after we terminated a container host, the Rancher UI didn't even tell us that the stacks were unhealthy...nevermind that the host was no longer existed. As a result, I wrote some code to deal with pulling a container host out of the cluster upon shutdown:

"""
rancher-deprovision.py 

This script calls home to the Rancher control cluster and causes it to deprovision this container host.

Usage:  
    rancher-provision.py (--rancherurl=<u>) (--accesskey=<a>) (--secretkey=<s>) (--project=<p>)

Options:  
    -u --rancherurl=<u> URL of the rancher control server/ELB
    -a --accesskey=<a>  Rancher Access Key
    -s --secretkey=<s>  Rancher Secret Key
    -p --project=<p>    Project ID of your rancher environment
    -h --help           display this help text
"""

import requests, json, sys, socket, os, traceback, time  
from requests.auth import HTTPBasicAuth  
from docopt import docopt  
from create_pagerduty_incident import trigger_incident

PD_OPSLOW_INTKEY = "1901u0a902304890509180948109480918a"

# Path for a file laid down by the amibuilder ansible job. This file should not
# exist on a live box, as it is laid down on ephemeral volume by amibuilder

amibuilderfile = "/mnt/tmp/amibuilder.txt"  
instance_type_path = "instance-type"

class TooManyMatches(Exception):  
    pass

class NoMatches(Exception):  
    pass

class FailureInPostAction(Exception):  
    pass

def get_hostname():  
    return socket.gethostname()

def post_request_to_rancher(rancher_url, project_id, host_id, action, desiredstate, auth):  
    resp = requests.post(
        "{}/v1/projects/{}/hosts/{}/?action={}".format(
            rancher_url, project_id, host_id, action
    ), auth=auth)
    if resp.status_code == 202:
        i = 0
        while i <= 6:
            state = get_rancher_host_state(
                rancher_url, project_id, host_id, desiredstate, auth
            )
            if state == desiredstate:
                return True
            else:
                time.sleep(5)
                i += 1
        return False
    else:
        raise FailureInPostAction("Action: {} - Non-202 Response Code: {}".format(
            action, resp.status_code
            )
        )

def get_rancher_host_state(rancher_url, project_id, host_id, desiredstate, auth):  
    return requests.get(
    "{}/v1/projects/{}/hosts/{}".format(rancher_url, project_id, host_id), 
    auth=auth).json()['state']

def get_rancher_host_id(rancher_url, project_id, local_hostname, auth):  
    resp = requests.get(
            "{}/v1/projects/{}/hosts/".format(
                rancher_url, project_id
            ), auth=auth).json()
    if resp['data']:
        matches = []
        for h in resp['data']:
            if h['hostname'] == local_hostname:
                matches.append(h['id'])
        if len(matches) > 1:
            raise TooManyMatches("More than one match for hostname: {}".format(
                local_hostname
                )
            )
        elif len(matches) == 0:
            raise NoMatches("No hosts found matching {}".format(local_hostname))
        else:
            return str(matches[0])

if __name__ == '__main__':  
    if os.path.isfile(amibuilderfile):
        sys.exit(0)
    arguments = docopt(__doc__)
    accesskey = arguments["--accesskey"]
    secretkey = arguments["--secretkey"]
    rancherurl = arguments["--rancherurl"]
    projectid = arguments["--project"]
    hostname = get_hostname()
    auth = HTTPBasicAuth(accesskey, secretkey)

    try:
        host_id = get_rancher_host_id(rancherurl, projectid, hostname, auth)

        deactivate = post_request_to_rancher(
            rancherurl, projectid, host_id, "deactivate", "inactive", auth
        )
        if deactivate:
            remove = post_request_to_rancher(
                rancherurl, projectid, host_id, "remove", "removed", auth
            )
            if remove:
                purge = post_request_to_rancher(
                    rancherurl, projectid, host_id, "purge", "purged", auth
                )

    except (FailureInPostAction, NoMatches, TooManyMatches) as e:
        trigger_incident(
            PD_OPSLOW_INTKEY, repr(e)
            )

    except Exception as e:
        trigger_incident(
            PD_OPSLOW_INTKEY,
            "Encountered Exception while removing {} from {}: {}".format(
                hostname, rancherurl, repr(e)
            )
        )

This code goes through the three stages required to fully remove a container host from rancher - deactivate, remove, purge. Since the host in question will be in the process of shutting down, instead of logging any error output if something unexpected happens, I send it to PagerDuty (but this could just as easily be Slack or some other notification method) to notify Ops that we may have to a) deal with a phantom host and b) handle the exception better in this code!

To ensure that this gets run any time the host instance receives a shutdown signal, we drop the following shell script:

cd /usr/local/lib  
python rancher-deprovision.py --rancherurl={{ rancher_registration_url }} \  
--accesskey={{ rancher_access_key }} \
--secretkey={{ rancher_secret_key }} \
--project={{ rancher_project_id }}

Into /etc/rc0.d/K04rancher-asg-shutdown.sh - for a fuller explanation of what is going on here, see this stackexchange question. Note that this particular methodology will break in SystemD based OS (ie, Ubuntu > 14.04 and basically all the other major distros I am familiar with) - however, SystemD also allows you to achieve this behaviour. I leave it to the reader to figure out that part if they need it (as I'll have to do when we migrate to Xenial...)


Conclusion

This was a very long article, but also an informative one, I hope. If you have questions for me - or suggestions that would help me improve this article or any of the methods described therein - you can find me on reddit as /u/tastycidr. Any and all constructive feedback is welcome!