Episode 34: Slack and the Safety Dance of Chaos Engineering from Screaming in the Cloud

Screaming in the Cloud

Corey Quinn

Episodes

Episode 34: Slack and the Safety Dance of Chaos Engineering

32 minutes Posted Oct 30, 2018 at 11:00 pm.

0:00

32:57

Download MP3

Share at current time

Show notes

In the early days, angry nerd corners on the Internet viewed Slack and some of its predecessors as, “Oh, it’s just IRC. Now, you pay someone for it.” Many fell into that trap of wondering about what value such systems offered.The big differentiator? Slack is built as a collaborative business tool.

Today, we’re talking to Holly Allen, who helped make government software better while serving as the director of engineering at 18F. Now, she’s a senior engineering manager at Slack, a collaborative chat program where you can do most of your work through a rich platform of integrations. Holly enjoys taking a weird set of skills that make a computer do things and convincing people who know how to make computers do things do things.

Some of the highlights of the show include:

Safety engineering brings chaos and resilience engineering, incident management, and post-mortem processes together for resiliency and reliability

Slack strives to move really fast while being in complete control

Slack is primarily on AWS, but is working on a multi-Cloud strategy because if AWS is down, Slack still needs to work

Slack has a close relationship with AWS and is a collaborative company; it has immediate access to AWS staff anytime there’s a problem

Slack uses Terraform and Chef and working to determine if its production workflows in Kubernetes would be worthwhile

Disasterpiece Theater: Real scenario that might happen and surmise what will happen; don’t cause production issues, but teach Slack employees

Slack hires collaborative, empathetic people to create a collaborative environment where everyone works together toward a goal

Slack was firmly in a centralized operations model, but is transforming toward development teams to increase responsibility and service ownership

Slack doesn’t encourage remote work because it’s not in a position to put in that investment; day-to-day work happens in hallways and between desks

Slack sees itself as an enterprise software company; an enterprise software company must have enterprise software reliability, stability, and processes

Slack has thousands of servers, so events and disruptions happen more often; system needs to respond, react, and repair itself without human intervention

Links:

Holly Allen on Twitter

Slack

Freenode IRC

HipChat

Kubernetes

Terraform

Chef

QCon

Datadog

← See all 543 episodes of Screaming in the Cloud