To infinity and beyond: The definitive guide to scaling 10k VMs on Azure



To infinity and beyond: The definitive guide to scaling 10k VMs on Azure

To infinity and beyond: The definitive guide to scaling 10k VMs on Azure

Every platform has limits, workstations and physical servers have resource boundaries, APIs may be rate-limited, and even the perceived endlessness of the virtual public cloud enforces limitations that protect the platform from overuse or misuse. You can learn more about these limitations by visiting our documentation, “Azure subscription and service limits, quotas, and constraints.” When working on scenarios that take platforms to their extreme, those limits become real and therefore thought should be put into overcoming them.

The following post includes essential notes taken from my work with Mike Kiernan, Mayur Dhondekar, and Idan Shahar. It also covers some iterations where we try to reach a limit of 10K virtual machines running on Microsoft Azure and explores the pros/cons of the different implementations.

Load tests at cloud scale

Load and stress tests before moving a new version to production are critical on the one hand, but pose a real challenge for IT on the other. This is because they require a considerable amount of resources to be available for only a short amount of time, every release-cycle. When purchased the infrastructure doesn’t justify its cost over extended periods, making this a perfect use-case for a public cloud platform where payment is billed only per usage.

This post is in fact based on a customer we’ve been working with, and discusses challenges we have met. However, the provided solution is general enough to be used for other use cases where large clusters of VMs in Azure exist, such as:

  • Scaling requirements beyond a single VMSS, and the cluster is static in size once provisioned (HPC clusters).
  • DDoS simulation – Please note, in this case ethics must be practiced and the targeted endpoint should be owned by you, otherwise you assume risk the liability for damages.

The process

    At a high level, to provision and initialize a cluster of x VMs that “do something” the following steps should be taken:

    • Start from a base image.
    • Provision x VMs from the base image.
    • Download and install required software and data to each VM.
    • Start the “do-something” process on each VM.

    However, given the targeted hyper-scale there are a number of critical elements that must be taken into account. It quickly becomes clear that the concerns of implementing such scenarios are as much about management, cost optimization, and avoiding platform limits as they are about infrastructure and the provisioning process.

    • How do you manage 10K VMs? How do you even count them?
    • What is the origin of data and can it handle the load of 10K concurrent downloads?
    • How would you know that the process completes?
    • Can the cloud provide 10K VMs in one region and which?
    • How long would it take to provision and reach its scale?

    The next section describes a load-test scenario implemented using different services and tackling the questions raised previously with the following goals:

    • Generate stress on a backend service located in some other datacenter using client machines (VMs) in Azure.
    • Trigger the process using HTTP POST.
    • Avoid manual steps, pre-requisites, and custom images which may be outdated over time.
    • Minimal time to reach a full-scale cluster.

    The solution outline

      Load-test scenario implementaed using various servies flow chart
      Read more about all the details of the solution in the blog post, “To Infinity and Beyond (or: The Definitive Guide to Scaling 10k VMs on Azure).” You can also see the solution code and deployment scripts on GitHub.