The metal-stack is a compound of microservices predominantly written in Golang.
This page gives you an overview over which microservices exist, how they communicate with each other and where they are deployed.
The foundation of the metal-stack is what we call the metal control plane. The metal control plane is typically deployed in a Kubernetes cluster and is not strictly required to run inside your data center. It even makes sense not to place the metal control plane in the same failure domain with your servers that you are going to manage with the metal-stack. The control plane does not depend on Kubernetes functionality, such that deployments to other target platforms are theoretically possible.
The control plane contains of a couple of essential microservices for the metal-stack including:
- metal-api The API to manage and control plane resources like machines, switches, operating system images, machine sizes, networks, IP addresses and more. The exposed API is an old-fashioned REST API with different authentication methods. The metal-api stores the state of these entities in a RethinkDB database. The metal-api also has its own IP address management (go-ipam), which writes IP address and network allocations into a PostgreSQL backend.
- masterdata-api Manages tenant and project entities, which can be described as entities used for company-specific resource separation and grouping. Having these "higher level entities" managed by a separate microservice was a design choice that allows to re-use the information by other microservices without having them to know the metal-api at all. The masterdata gets persisted in a dedicated PostgreSQL database.
- metal-console Provides access for users to a machine's serial console via SSH. It can be seen as an optional component.
- nsq A message queuing system (not developed by the metal-stack) used for decoupling microservices and distributing tasks.
The following figure shows the relationships between these microservices:
Figure 1: The metal control plane deployed in a Kubernetes environment with an ingress-controller exposing additional services via service exposal.
Some notes on this picture:
- Users can access the metal-api with the CLI client called metalctl.
- You can programmatically access the metal-api with client libraries (e.g. metal-go).
- Our databases are wrapped in a specially built backup-restore-sidecar, which is consistently backing up the databases in external blob storage.
- The metal-api can be scaled out using replicas when being deployed in Kubernetes.
A partition is our term for describing hardware in the data center controlled by the metal-stack with all the hardware participating in the same network topology. Being in the same network topology causes the hardware inside a partition to build a failure domain. Even though the network topology for running the metal-stack is required to be redundant by design, you should consider setting up multiple partitions. With multiple partitions it is possible for users to maintain availability of their applications by spreading them across the partitions. Installing partitions in multiple data centers would be even better in regards of fail-safe application performance, which would even tolerate the meltdown of a data center.
In our setups, we encode the name of a region and a zone name into our partition names. However, we do not have dedicated entities for regions and zones in our APIs.
A region is a geographic area in which data centers are located.
Zones are geographic locations in a region usually in different fire compartments. Regions can consist of several zones.
A zone can consist of several partitions. Usually, a partition spans a rack or a group of racks.
We strongly advise to group your hardware into racks that are specifically assembled for running metal-stack. When using modular rack design, the amount of compute resources of a partition can easily be extended by adding more racks to your partition.
The hardware that we currently support to be placed inside a partition is described in the hardware document.
How large you can grow your partitions and how the network topology inside a partition looks like is described in the networking document.
The metal-stack has microservices running on the leaf switches in a partition. For this reason, your leaf switches are required to run a Linux distribution that you have full access to. Additionally, there are a servers not added to the pool of user-allocatable machines, which are instead required for running metal-stack and we call them management servers. TODO: Explain management network, management firewall and switch plane.
The microservices running inside a partition are:
- metal-hammer (runs on a server when not allocated by user, often referred to as discovery image) An initrd, which is booted up in PXE mode, preparing and registering a machine. When a user allocates a machine, the metal-hammer will install the target operating system on this machine and kexec into the new operating system kernel.
- metal-core (runs on leaf switches) Dynamically configures the leaf switch from information provided by the metal-api. It also proxies requests from the metal-hammer to the metal-api including publishment of machine lifecycle events and machine registration requests.
- pixiecore (runs on leaf switches, not developed by metal-stack) Provides the capability of PXE booting servers in the PXE boot network.
- bmc-proxy (runs on management servers) Belongs to the metal-console, allowing user access to the machine's serial console. It can be seen as an optional component.
- ipmi-catcher (runs on management servers) Reports the ip addresses that are leased to ipmi devices together with their machine uuids to the metal-api. This provides machine discovery in the partition machines and keeps all IPMI interface access data up-to-date.
Figure 2: Simplified illustration of services running inside a partition.
Some notes on this picture:
- This figure is slightly simplified. The switch plane consists of spine switches, exit routers, management firewalls and a bastion router with more software components deployed on these entities. Please refer to the networking document to see the full overview over the switch plane.
- The image-cache is an optional component consisting of multiple services to allow caching images from the public image store inside a partition. This brings increased download performance on machine allocation and increases independence of a partition on the internet connection.
The following figure shows several partitions connected to a single metal control plane. Of course, it is also possible to have multiple metal control planes, which can be useful for staging.
Figure 3: Reduced view on the communication between the metal control plane and multiple partitions.
Some notes on this picture:
- By design, a partition only has very few ports open for incoming-connections from the internet. This contributes to a smaller attack surface and higher security of your infrastructure.
- With the help of NSQ, it is not required to have connections from the metal control plane to the metal-core. The metal-core instances register at the message bus and can then consume partition-specfic topics, e.g. when a machine deletion gets issued by a user.
The following sequence diagram illustrates some of the main principles of the machine provisioning lifecycle.
Figure 4: Sequence diagram of the machine provisioning sequence.
Here is a video showing a screen capture of a machine's serial console while running the metal-hammer in "wait mode". Then, a user allocates the machine and the metal-hammer installs the target operating system and the machine boots into the new operating system kernel via the kexec system call.