Dread Pirate Diego

Diego is the official container management system inside of Cloud Foundry.  Cloud Foundry has been running applications inside containers since before it was cool.  In May 2017, Diego was made the official container runtime inside of CF, with the older DAEs deprecated.  Diego has arrived!

This blog post quotes liberally from Eric Malm’s Cloud Foundry Diego Overview from the Cloud Foundry Summit in June 2017.
https://www.youtube.com/watch?v=gB-nrdYTTKU

If you’ve ever used Cloud Foundry, then you’ve done a

cf push

This command interacts with the Cloud Controller to package your application and run it inside the platform. The application instances are actually containers running in CF’s home grown container engine, Garden.  Garden, knows how to execute containers, but doesn’t know anything about the rest of CF such as it’s distributed nature.

Diego’s responsibility is to orchestrate the placement of containers across hundreds or thousands of execution sites that constitute the platform.  It keeps those application instances up and running if they crash or if the host VM’s crash.  Diego needs a persistent consistent data store provided by a SQL DB and leverages consul for component discovery and coordination.

The Cloud Controller submits work to the Diego core to get it to run, and Gorouter receives information from Diego to manage routing tables.

Diego Core Components

Diego VM’s are called Cells upon which container instances run.  Each Cell has it’s own installation of garden appropriate for Linux, Windows, etc.  Garden creates containers and runs processes.

Cell rep is the first of Diegos three core components.  The rep resides on the Cell VM and controls the local garden to create containers as needed by CF.  It also broadcasts the presence of the cell to the rest of the system.

The Cloud Controller doesn’t talk directly to the Cells via the rep, it talks to the second of Diegos three core components, the Bulletin Board System (BBS).  The BBS typically runs on a separate VM and provides the public API that the rest of the system uses for communication with Diego.  The BBS is in charge of governing the lifecycles of the various app instances running on the cluster.

When the BBS gets new work, it delegates to the third of Diego’s core components, the Auctioneer.  The Auctioneer also runs in isolation on a separate VM in a deployment.  The Auctioneer is responsible for understanding the current state of the Cells and making optimal placement decisions.

Since this is a complex distributed system, many things can go wrong.  Eventual consistency between the desired state and the actual running state is the responsibility of the BBS component of the Diego system overall, applying corrective action as needed.

The component that broadcasts route information to the routing tier (Gorouter) is called the route-emitter.  This used to be a global function, but the coordination across large numbers of cells created system instability.  The function of the route-emitter has been moved onto each Cell and now each reports the status of only it’s local routes to the Gorouter, sharding the functionality and eliminating the global lock bottleneck.

Workload Types

There are two different flavors of lifecycle policies associated with the work that Diego can run.  The first is the Long Running Process (LRP).  The characteristics of the LRP:

  • continual work that needs restart if stopped
  • scalable work that can be seamlessly scalable
  • availability assumed to be maintained through concurrent running instances

These characteristics are abstracted out of the needs of the App Processes common to the 12 factor app model we are familiar with.

The second flavor of work Diego can run are Tasks:

  • Tasks are assumed to terminate with a success / failure exit status that can be reported up
  • These are single isolated units of work.  The client will schedule other individual tasks as needed
  • Diego tries to be very consistent with the tasks and does not leverage eventual consistency

Tasks were originally envisioned as staging work for building packages and managing internal system routines.  It was found that application tasks may have lots of one off work like database migrations suitable for use as Tasks.

Scalability

Replacing etcd persistence layer inside the BBS with a SQL backend and other code improvements have increased measured scale of Diego to support up to 250K app instances at > 1000 cells under management.  The two SQL databases supported are cf-mysql and Postgres.  Continuous testing using a benchmark test suite ensures changes maintain scalability of the system.

cfdot

cfdot is a CLI for Diego that provides functionality to manipulate the BBS API.  A BOSH job now deploys cfdot and jq binaries suitable for slicing and dicing the output in a human readable format.

Example:  Count of app instance states

$ cfdot actual-lrp-groups | jq .instance \
    | jq -s -r 'group_by(.state)[] | "\(.[0].state): \(length)"'

CRASHED:  38
RUNNING:  288
UNCLAIMED:  3

Example:  Find app guid and index at IP and port

$ cfdot actual-lrp-groups | jq -r '.instance'\
' | select(.address=="10.10.50.34" and (.ports[].host_port'\
' | contains(61077))) | "\(.process_guid[0:36]) \(.index)"'

ff6d424f-9edc-478c-b591-cdd3a1498460 0