typhoon/docs/topics/performance.md
Dalton Hubble 7b38271212 Run etcd cluster on-host, across controllers on Google Cloud
* Change controllers from a managed group to individual instances
* Create discrete DNS records to each controller's private IP for etcd
* Change etcd to run on-host, across controllers (etcd-member.service)
* Reduce time to bootstrap a cluster
* Deprecate self-hosted-etcd on the Google Cloud platform
2017-11-05 11:03:35 -08:00

2.5 KiB

Performance

Provision Time

Provisioning times vary based on the platform. Sampling the time to create (apply) and destroy clusters with 1 controller and 2 workers shows (roughly) what to expect.

Platform Apply Destroy
AWS 20 min 8 min 10 sec
Bare-Metal 10-14 min NA
Digital Ocean 3 min 30 sec 20 sec
Google Cloud 4 min 4 min 30 sec

Notes:

  • AWS is alpha
  • DNS propagation times have a large impact on provision time
  • Platforms with auto-scaling take more time to provision (AWS, Google)
  • Bare-metal provision times vary depending on the time for machines to POST and network bandwidth to download images.

Network Performance

Network performance varies based on the platform and CNI plugin. iperf was used to measture the bandwidth between different hosts and different pods. Host-to-host indicates the typical bandwidth offered by the provider. Pod-to-pod shows the bandwidth between two iperf containers. The difference provides some idea about the overhead.

Platform / Plugin Theory Host to Host Pod to Pod
AWS (flannel) ? 976 MB/s 900-999 MB/s
AWS (calico, MTU 1480) ? 976 MB/s 100-350 MB/s
AWS (calico, MTU 8991) ? 976 MB/s 900-999 MB/s
Bare-Metal (flannel) 1 GB/s 934 MB/s 903 MB/s
Bare-Metal (calico) 1 GB/s 941 MB/s 931 MB/s
Bare-Metal (flannel, bond) 3 GB/s 2.3 GB/s 1.17 GB/s
Bare-Metal (calico, bond) 3 GB/s 2.3 GB/s 1.17 GB/s
Digital Ocean ? 938 MB/s 820-880 MB/s
Google Cloud (flannel) ? 1.94 GB/s 1.76 GB/s
Google Cloud (calico) ? 1.94 GB/s 1.81 GB/s

Notes:

  • AWS is alpha
  • Network bandwidth fluctuates on AWS and Digital Ocean.
  • Only certain AWS EC2 instance types allow jumbo frames. This is why the default MTU on AWS must be 1480.
  • Between Flannel and Calico, performance differences are usually minimal. Platform and configuration differenes dominate.
  • Pods do not seem to be able to leverage the hosts' bonded NIC setup. Possibly a testing artifact.
  • Observing the same bonded NIC pod-to-pod limit suggests the bottleneck lies below flannel and calico.