Large-scale cluster management at Google with Borg

Cluster management system

2015

15

Trusted by Many

Year of Publish

What it is about?

Borg is a cluster management system that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters.
Borg offers abstract resource management, high reliability and enables efficient large-scale operations.

Learn More

User Perspective

Users submit their work to Borg in the form of jobs, each of which consists of one or more tasks. Each job runs in one Borg cell. Borg cells handles two parts in workload: long-running services that should “never” go down and batch jobs. Jobs are collections of similar tasks where each runs a binary resource like CPU, memory and disk. It uses BCL and packaging system to deploy binary and config files. This is then submitted via CLI with some constraints, needs and priorities. Then tasks a re assigned DNS names to discover each other. Borg then Ofers dashboards to track the performance and debugging and has a rolling updates to allow staged deployment and rollback.

Learn More

Borg’s Architecture

BorgMaster

The main idea is that the columnar format must reconstruct to record-based format to support MR. It should also be quick. Dremel supports fields that are repeated, missing and optional. Each field becomes a column and it forms a hierarchy.

Availability

Since failures are common and expected, Borg mitigates this by: automatic rescheduling of evicted tasks, rate limits on task disruption and so on. Borg uses replication for availability

Scheduling

A job when admitted, enters the pending queue. The scheduler processes the queue and uses round-robin fairness. It has 2 phases: feasibility checking and scoring (based on preferences, locality of packages)

Experiments

Dremel (column storage) proved to have better performance when reading from disk. The execution time was significantly less when MR used column-type of storage compared to MR that uses record-type storage. Aggregates that returns multiple groups benefit from multi-level serving trees. Leaf nodes process 99% of the tablet. If there are less replicas, the likelihood of straggler hindering execution is high. So high replication will solve this by rerunning tasks on other replicas.

Borglet

Runs on every machine in a cell. Its responsibility is to start and stop tasks, restart failed ones, manage resource, logs. Master polls Borglets every few seconds.

Observations

Dremel compliments MR and gives good results when used together. Achieves linear scalability for deeply nested data. All lanes should use column-based data in order to improve speed.

Utilization

Evaluation Methodology:

Cell compaction - given a workload, we found out how small a cell it could be fitted into by removing machines until the workload no longer fitted, repeatedly re-packing the workload from scratch to ensure that we didn’t get hung up on an unlucky configuration.

Fine-grained resource requests

Since the request size is varying, there is not correct bucket size that cover most tasks. Bucketing in the powers of 2 will increase overhead.

Cell sharing

Machines run both prod and non-prod tasks concurrently. If they were separated, ~20% to 30% more machines would be needed. More tasks increased CPI but overall effect was small. The efficiency gain from resource pooling outweighs performance loss from interference.

Resource reclamation

it reclaims unused allocated resources for tasks. Each task is set a limit and a reservation. Prod tasks are scheduled agains a limit and non prod tasks use reservation. When it exceeds reservation, non-prod tasks are throttles or killed. About ~20% workload runs on reclaimed resources. Without reclamation, more machines would be needed.

Large cells

Google has large cells to allow large computations to be run and to decrease resource fragmentation. Partitioning a cell increased the resource cost.

Observations

Dremel compliments MR and gives good results when used together. Achieves linear scalability for deeply nested data. All lanes should use column-based data in order to improve speed.

Borg’s primary goals is to make efficient use of Google’s fleet of machines. So they examined Borg’s maximization of resource usage.

Isolation

Isolation is essential to ensure security and performance in a shared environment.

Security isolation: Linux chroot jails are used to isolate tasks on same machine. Borgssh launches shell inside same chroot and cgroup.
Performance Isolation: All tasks run in cgroup based Linux containers. User-facing or critical infra tasks, receive top priority and can starve others.

Learn More

Future work

Needs better performance isolation, improved debugging tools, policy tuning based on feedback

Conclusion

Borg is Google’s widely used cluster management system that powers applications across thousands of machines offering high utilization efficiently and reliably.