Streaming: Danny Yuan on Real-Time, Time Series Forecasting @Uber
Published March 31, 2018
26 min
    Add to queue
    Copy URL
    Show notes
    On this week’s podcast, Danny Yuan, Uber’s Real-time Streaming/Forecasting Lead, lays out a thorough recipe book for building a real-time streaming platform with a major focus on forecasting. In this podcast, Danny discusses everything from the scale Uber operates at to what the major steps for training/deploy models in an iterative (almost Darwinistic) fashion and wraps with his advice for software engineers who want to begin applying machine learning into their day-to-day job. Why listen to this podcast: * Uber processes 850,000 - 1.3 million messages per second in their streaming platform with about 12 TB of growth per day. The system’s queries scan 100 million to 4 billion documents per second. * Uber’s frontend is mobile. The frontend talks to an API layer. All services generate events that are shuffled into Kafka. The real-time forecasting pipeline taps into Kafka to processes events and stores the data into Elasticsearch. * There is a federated query layer in front of Elasticsearch to provide OLAP query capabilities. * Apache Flink’s advanced windowing features, programming model, and checkpointing convinced Uber to move away from the simplicity of Apache Samza. * The forecasting system allows Uber to remove the notion of delay by using recent signals plus historical data to project what is happening now and what will happen into the future. * Uber’s pipeline for deploying ML models: HDFS, feature engineering, organizing into data structures (similar to data frames), deploy mostly offline training models, train models, & store into a container-based model manager. * A model serving layer is used to pick which model to use, forecasting results are stored in an OLAP data store, a validation layer compares real results against forecast results to verify the model is working as desired, and a rollback feature enables poor performing models to be automatically replaced by previous one. * “Without output, you don’t have input.” If you want to start leveraging machine learning, developers just need to start doing. Start with intuition and practice. Over time ask questions and learn what you need, then apply a laser focus to gain that knowledge. You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. Subscribe: Like InfoQ on Facebook: Follow on Twitter: Follow on LinkedIn: Check the landing page on InfoQ:
        0:00:00 / 0:00:00