Kolton Andrus on Lessons Learnt From Failure Testing at Amazon and Netflix and New Venture Gremlin
Published December 2, 2016
28 min
    Add to queue
    Copy URL
    Show notes
    In this week's podcast, QCon chair Wesley Reisz talks to Kolton Andrus. Andrus is the founder of Gremlin Inc. He was a Chaos Engineer at Netflix, focused on the resilience of the Edge services. He designed and built FIT: Netflix’s failure injection service. Prior, he improved the performance and reliability of the Amazon Retail website. Why listen to this podcast: - Gremlin, Kolton Andrus' new start-up, is focused on providing failure testing as a service. Version 1, currently in closed beta, is focused on infrastructure failures. - Lineage-driven Fault Injection (LDFI) allowed Netflix to dramatically reduce the number of tests they needed to run in order to explore a problem space. - You generally want to run failure tests in production, but you can't start there. Start in developemnt and build up. - Having failure testing at an application level, as Netflix does, so you can have request level fault injection for a specific user or a specific device. - Being able to trace infrastructure with something like Dapper or Zipkin offers tremendous value. At Netflix, the failure injection system is integrated into the tracing system, which meant that when they caused a failure they could see all the points in the system that it touched. More on this: Quick scan our curated show notes on InfoQ. http://bit.ly/2fT9YiM You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq
        0:00:00 / 0:00:00