Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
Unique ID generation in distributed systemsDave Gardner
31 slides•80.1K views
The document discusses different strategies for generating unique IDs in a distributed system. It covers using auto-incrementing numeric IDs in MySQL, which are not resilient, and various solutions like UUIDs, Twitter Snowflake IDs, and Flickr ticket servers that generate IDs in a distributed and ordered way without coordination between data centers. It also provides code examples of generating Twitter Snowflake-like IDs in PHP without coordination using ZeroMQ.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
This document discusses using ClickHouse for experimentation and metrics at Spotify. It describes how Spotify built an experimentation platform using ClickHouse to provide teams interactive queries on granular metrics data with low latency. Key aspects include ingesting data from Google Cloud Storage to ClickHouse daily, defining metrics through a centralized catalog, and visualizing metrics and running queries using Superset connected to ClickHouse. The platform aims to reduce load on notebooks and BigQuery by serving common queries directly from ClickHouse.
Real Time Analytics: Algorithms and SystemsArun Kejariwal
180 slides•23.2K views
In this tutorial, an in-depth overview of streaming analytics -- applications, algorithms and platforms -- landscape is presented. We walk through how the field has evolved over the last decade and then discuss the current challenges -- the impact of the other three Vs, viz., Volume, Variety and Veracity, on Big Data streaming analytics.
Redis is an in-memory key-value store that is often used as a database, cache, and message broker. It supports various data structures like strings, hashes, lists, sets, and sorted sets. While data is stored in memory for fast access, Redis can also persist data to disk. It is widely used by companies like GitHub, Craigslist, and Engine Yard to power applications with high performance needs.
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
Presented by Adrien Grand, Software Engineer, Elasticsearch
Although people usually come to Lucene and related solutions in order to make data searchable, they often realize that it can do much more for them. Indeed, its ability to handle high loads of complex queries make Lucene a perfect fit for analytics applications and, for some use-cases, even a credible replacement for a primary data-store. It is important to understand the design decisions behind Lucene in order to better understand the problems it can solve and the problems it cannot solve. This talk will explain the design decisions behind Lucene, give insights into how Lucene stores data on disk and how it differs from traditional databases. Finally, there will be highlights of recent and future changes in Lucene index file formats.
Roko Kruze of vectorized.io describes real-time analytics using Redpanda event streams and ClickHouse data warehouse. 15 December 2021 SF Bay Area ClickHouse Meetup
Redis is an open source, in-memory data structure store that can be used as a database, cache, or message broker. It supports data structures like strings, hashes, lists, sets, sorted sets with ranges and pagination. Redis provides high performance due to its in-memory storage and support for different persistence options like snapshots and append-only files. It uses client/server architecture and supports master-slave replication, partitioning, and failover. Redis is useful for caching, queues, and other transient or non-critical data.
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
The document summarizes how Twitter handles and analyzes large amounts of real-time data, including tweets, timelines, social graphs, and search indices. It describes Twitter's original implementations using relational databases and the problems they encountered due to scale. It then discusses their current solutions, which involve partitioning the data across multiple servers, replicating and indexing the partitions, and pre-computing derived data when possible to enable low-latency queries. The principles discussed include exploiting locality, keeping working data in memory, and distributing computation across partitions to improve scalability and throughput.
Linux Kernel vs DPDK: HTTP Performance ShowdownScyllaDB
31 slides•1.9K views
In this session I will use a simple HTTP benchmark to compare the performance of the Linux kernel networking stack with userspace networking powered by DPDK (kernel-bypass).
It is said that kernel-bypass technologies avoid the kernel because it is "slow", but in reality, a lot of the performance advantages that they bring just come from enforcing certain constraints.
As it turns out, many of these constraints can be enforced without bypassing the kernel. If the system is tuned just right, one can achieve performance that approaches kernel-bypass speeds, while still benefiting from the kernel's battle-tested compatibility, and rich ecosystem of tools.
Data Con LA 2020
Description
Apache Druid is a cloud-native open-source database that enables developers to build highly-scalable, low-latency, real-time interactive dashboards and apps to explore huge quantities of data. This column-oriented database provides the microsecond query response times required for ad-hoc queries and programmatic analytics. Druid natively streams data from Apache Kafka (and more) and batch loads just about anything. At ingestion, Druid partitions data based on time so time-based queries run significantly faster than traditional databases, plus Druid offers SQL compatibility. Druid is used in production by AirBnB, Nielsen, Netflix and more for real-time and historical data analytics. This talk provides an introduction to Apache Druid including: Druid's core architecture and its advantages, Working with streaming and batch data in Druid, Querying data and building apps on Druid and Real-world examples of Apache Druid in action
Speaker
Matt Sarrel, Imply Data, Developer Evangelist
Redis Cluster is an approach to distributing Redis across multiple nodes. Key-value pairs are partitioned across nodes using consistent hashing on the key's hash slot. Nodes specialize as masters or slaves of data partitions for redundancy. Clients can query any node, which will redirect requests as needed. Nodes continuously monitor each other to detect and address failures, maintaining availability as long as each partition has at least one responsive node. The redis-trib tool is used to setup, check, resize, and repair clusters as needed.
The document discusses Facebook's use of HBase to store messaging data. It provides an overview of HBase, including its data model, performance characteristics, and how it was a good fit for Facebook's needs due to its ability to handle large volumes of data, high write throughput, and efficient random access. It also describes some enhancements Facebook made to HBase to improve availability, stability, and performance. Finally, it briefly mentions Facebook's migration of messaging data from MySQL to their HBase implementation.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
60 slides•34.6K views
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
32 slides•9.3K views
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Whoops, The Numbers Are Wrong! Scaling Data Quality @ NetflixDataWorks Summit
49 slides•1.4K views
Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies?
In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into details of our data quality solution. We’ll cover what worked, what didn’t work so well, and what we plan to work on next. We’ll conclude with some tips and lessons learned for ensuring data quality on big data.
XStream: stream processing platform at facebookAniket Mokashi
22 slides•603 views
XStream is Facebook's unified stream processing platform that provides a fully managed stream processing service. It was built using the Stylus C++ stream processing framework and uses a common SQL dialect called CoreSQL. XStream employs an interpretive execution model using the new Velox vectorized SQL evaluation engine for high performance. This provides a consistent and high efficiency stream processing platform to support diverse real-time use cases at planetary scale for Facebook.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains as well as integration with other big data technologies such as Apache Spark, Druid, and Kafka. The talk will also provide a glimpse of what is expected to come in the near future.
Netflix's architecture for viewing data has evolved as streaming usage has grown. Each generation was designed for the next order of magnitude, and was informed by learnings from the previous. From SQL to NoSQL, from data center to cloud, from proprietary to open source, look inside to learn how this system has evolved. (from talk given at QConSF 2014)
(Jason Gustafson, Confluent) Kafka Summit SF 2018
Kafka has a well-designed replication protocol, but over the years, we have found some extremely subtle edge cases which can, in the worst case, lead to data loss. We fixed the cases we were aware of in version 0.11.0.0, but shortly after that, another edge case popped up and then another. Clearly we needed a better approach to verify the correctness of the protocol. What we found is Leslie Lamport’s specification language TLA+.
In this talk I will discuss how we have stepped up our testing methodology in Apache Kafka to include formal specification and model checking using TLA+. I will cover the following:
1. How Kafka replication works
2. What weaknesses we have found over the years
3. How these problems have been fixed
4. How we have used TLA+ to verify the fixed protocol.
This talk will give you a deeper understanding of Kafka replication internals and its semantics. The replication protocol is a great case study in the complex behavior of distributed systems. By studying the faults and how they were fixed, you will have more insight into the kinds of problems that may lurk in your own designs. You will also learn a little bit of TLA+ and how it can be used to verify distributed algorithms.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
This document discusses Yahoo's approach to interactive analytics on human timescales for their large-scale advertising data warehouse. It describes how they ingest billions of daily events and terabytes of data, transform and store it using technologies like Druid and Storm, and perform real-time analytics like computing overlaps between user groups in under a minute. It also compares their "instant overlap" technique using feature sequences and bitmaps to existing approaches like exact computation and sketches.
Scalable Event Analytics with MongoDB & Ruby on RailsJared Rosoff
45 slides•28.4K views
The document discusses scaling event analytics applications using Ruby on Rails and MongoDB. It describes how the author's startup initially used a standard Rails architecture with MySQL, but ran into performance bottlenecks. It then explores solutions like replication, sharding, key-value stores and Hadoop, but notes the development challenges with each approach. The document concludes that using MongoDB provided scalable writes, flexible reporting and the ability to keep the application as "just a Rails app". MongoDB's sharding allows scaling to high concurrency levels with increased storage and transaction capacity across shards.
Roko Kruze of vectorized.io describes real-time analytics using Redpanda event streams and ClickHouse data warehouse. 15 December 2021 SF Bay Area ClickHouse Meetup
Redis is an open source, in-memory data structure store that can be used as a database, cache, or message broker. It supports data structures like strings, hashes, lists, sets, sorted sets with ranges and pagination. Redis provides high performance due to its in-memory storage and support for different persistence options like snapshots and append-only files. It uses client/server architecture and supports master-slave replication, partitioning, and failover. Redis is useful for caching, queues, and other transient or non-critical data.
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
The document summarizes how Twitter handles and analyzes large amounts of real-time data, including tweets, timelines, social graphs, and search indices. It describes Twitter's original implementations using relational databases and the problems they encountered due to scale. It then discusses their current solutions, which involve partitioning the data across multiple servers, replicating and indexing the partitions, and pre-computing derived data when possible to enable low-latency queries. The principles discussed include exploiting locality, keeping working data in memory, and distributing computation across partitions to improve scalability and throughput.
Linux Kernel vs DPDK: HTTP Performance ShowdownScyllaDB
31 slides•1.9K views
In this session I will use a simple HTTP benchmark to compare the performance of the Linux kernel networking stack with userspace networking powered by DPDK (kernel-bypass).
It is said that kernel-bypass technologies avoid the kernel because it is "slow", but in reality, a lot of the performance advantages that they bring just come from enforcing certain constraints.
As it turns out, many of these constraints can be enforced without bypassing the kernel. If the system is tuned just right, one can achieve performance that approaches kernel-bypass speeds, while still benefiting from the kernel's battle-tested compatibility, and rich ecosystem of tools.
Data Con LA 2020
Description
Apache Druid is a cloud-native open-source database that enables developers to build highly-scalable, low-latency, real-time interactive dashboards and apps to explore huge quantities of data. This column-oriented database provides the microsecond query response times required for ad-hoc queries and programmatic analytics. Druid natively streams data from Apache Kafka (and more) and batch loads just about anything. At ingestion, Druid partitions data based on time so time-based queries run significantly faster than traditional databases, plus Druid offers SQL compatibility. Druid is used in production by AirBnB, Nielsen, Netflix and more for real-time and historical data analytics. This talk provides an introduction to Apache Druid including: Druid's core architecture and its advantages, Working with streaming and batch data in Druid, Querying data and building apps on Druid and Real-world examples of Apache Druid in action
Speaker
Matt Sarrel, Imply Data, Developer Evangelist
Redis Cluster is an approach to distributing Redis across multiple nodes. Key-value pairs are partitioned across nodes using consistent hashing on the key's hash slot. Nodes specialize as masters or slaves of data partitions for redundancy. Clients can query any node, which will redirect requests as needed. Nodes continuously monitor each other to detect and address failures, maintaining availability as long as each partition has at least one responsive node. The redis-trib tool is used to setup, check, resize, and repair clusters as needed.
The document discusses Facebook's use of HBase to store messaging data. It provides an overview of HBase, including its data model, performance characteristics, and how it was a good fit for Facebook's needs due to its ability to handle large volumes of data, high write throughput, and efficient random access. It also describes some enhancements Facebook made to HBase to improve availability, stability, and performance. Finally, it briefly mentions Facebook's migration of messaging data from MySQL to their HBase implementation.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
60 slides•34.6K views
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
32 slides•9.3K views
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Whoops, The Numbers Are Wrong! Scaling Data Quality @ NetflixDataWorks Summit
49 slides•1.4K views
Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies?
In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into details of our data quality solution. We’ll cover what worked, what didn’t work so well, and what we plan to work on next. We’ll conclude with some tips and lessons learned for ensuring data quality on big data.
XStream: stream processing platform at facebookAniket Mokashi
22 slides•603 views
XStream is Facebook's unified stream processing platform that provides a fully managed stream processing service. It was built using the Stylus C++ stream processing framework and uses a common SQL dialect called CoreSQL. XStream employs an interpretive execution model using the new Velox vectorized SQL evaluation engine for high performance. This provides a consistent and high efficiency stream processing platform to support diverse real-time use cases at planetary scale for Facebook.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains as well as integration with other big data technologies such as Apache Spark, Druid, and Kafka. The talk will also provide a glimpse of what is expected to come in the near future.
Netflix's architecture for viewing data has evolved as streaming usage has grown. Each generation was designed for the next order of magnitude, and was informed by learnings from the previous. From SQL to NoSQL, from data center to cloud, from proprietary to open source, look inside to learn how this system has evolved. (from talk given at QConSF 2014)
(Jason Gustafson, Confluent) Kafka Summit SF 2018
Kafka has a well-designed replication protocol, but over the years, we have found some extremely subtle edge cases which can, in the worst case, lead to data loss. We fixed the cases we were aware of in version 0.11.0.0, but shortly after that, another edge case popped up and then another. Clearly we needed a better approach to verify the correctness of the protocol. What we found is Leslie Lamport’s specification language TLA+.
In this talk I will discuss how we have stepped up our testing methodology in Apache Kafka to include formal specification and model checking using TLA+. I will cover the following:
1. How Kafka replication works
2. What weaknesses we have found over the years
3. How these problems have been fixed
4. How we have used TLA+ to verify the fixed protocol.
This talk will give you a deeper understanding of Kafka replication internals and its semantics. The replication protocol is a great case study in the complex behavior of distributed systems. By studying the faults and how they were fixed, you will have more insight into the kinds of problems that may lurk in your own designs. You will also learn a little bit of TLA+ and how it can be used to verify distributed algorithms.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
This document discusses Yahoo's approach to interactive analytics on human timescales for their large-scale advertising data warehouse. It describes how they ingest billions of daily events and terabytes of data, transform and store it using technologies like Druid and Storm, and perform real-time analytics like computing overlaps between user groups in under a minute. It also compares their "instant overlap" technique using feature sequences and bitmaps to existing approaches like exact computation and sketches.
Scalable Event Analytics with MongoDB & Ruby on RailsJared Rosoff
45 slides•28.4K views
The document discusses scaling event analytics applications using Ruby on Rails and MongoDB. It describes how the author's startup initially used a standard Rails architecture with MySQL, but ran into performance bottlenecks. It then explores solutions like replication, sharding, key-value stores and Hadoop, but notes the development challenges with each approach. The document concludes that using MongoDB provided scalable writes, flexible reporting and the ability to keep the application as "just a Rails app". MongoDB's sharding allows scaling to high concurrency levels with increased storage and transaction capacity across shards.
This document summarizes Kevin Weil's presentation on Hadoop and Pig at Twitter. Weil discusses how Twitter uses Hadoop and Pig to analyze massive amounts of user data, including tweets. He explains how Pig allows for more concise and readable analytics jobs compared to raw MapReduce. Weil also provides examples of how Twitter builds data-driven products and services using these tools, such as their People Search feature.
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
58 slides•143.3K views
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
The document discusses MongoDB's use in the CMS experiment at CERN. MongoDB is used as the backend for CMS's Data Aggregation System (DAS), which acts as an intelligent cache to query distributed data services. DAS translates user queries, retrieves data from multiple services, aggregates the results, and returns consolidated responses. This architecture allows users to access different data without knowledge of the underlying services. MongoDB provides a flexible schema and fast I/O that make it suitable for caching distributed data and executing complex queries in DAS.
Hadoop summit 2010 frameworks panel elephant birdKevin Weil
11 slides•2.1K views
Elephant Bird is a framework for working with structured data within Hadoop ecosystems. It allows users to specify a flexible, forward-backward compatible, self-documenting data schema and then generates code for input/output formats, Hadoop Writables, and Pig load/store functions. This reduces the amount of code needed and allows users to focus on their data. Elephant Bird underlies 20,000 Hadoop jobs per day at Twitter.
Kevin Weil presented on Hadoop at Twitter. He discussed Twitter's data lifecycle including data input via Scribe and Crane, storage in HDFS and HBase, analysis using Pig and Oink, and data products like Birdbrain. He described how tools like Scribe, Crane, Elephant Bird, Pig, and HBase were developed and used at Twitter to handle large volumes of log and tabular data at petabyte scale.
This document provides guidance on using analytics without feeling overwhelmed. It outlines 5 principles for finding valuable analytics insights: 1) Avoid vanity metrics that don't impact business, 2) Focus on customers, 3) Track whether customers return, 4) Track your customer funnel, and 5) Establish your own benchmarks for comparison over time. It also provides 4 tactics for using Google Analytics, including setting up goals and funnels to track customer actions and conversions. Customer analytics is recommended for a more complete view of customers rather than anonymous traffic data alone.
Email Optimization: A discussion about how A/B testing generated $500 million...MarketingSherpa
65 slides•5K views
In this webinar you’ll hear from Amelia Showalter, who headed the email and digital analytics teams for President Barack Obama’s 2012 presidential campaign.
They’ll discuss how they maintained a breakneck testing schedule for its massive email program with Daniel Burstein, Director of Editorial Content, MECLABS. Take an inside look at the campaign headquarters, detailing the thrilling successes and informative failures Obama for America encountered at the cutting edge of digital politics.
In this session, you'll learn:
• Why subject lines mattered so much, and the back stories behind some of the most memorable ones
• Why the prettiest email isn’t always the best email
• How a free bumper sticker offer can pay for itself many times over
• The most important takeaway from all those tests and trials
Leading in the new world of work – Human Resonance
With so many models and approaches – from large firms to business schools to boutiques – it is hard for companies to architect the tailored yet integrated experiences they need.
In our “Human Resonance” approach we offer what is needed.
Next level practice instead of best practice!
In this new world of work, the barriers between work and life are eliminated. The “new world of work” is one that requires a dramatic change in strategies for leadership, talent, and human resources.
A new playbook for new times
Growth, volatility, change, and disruptive technology drive companies to shift their underlying business model. It is time to address this disruption, transforming the leaders from a transaction-execution function into a dominant partner who pushes innovative solutions to managers at all levels. Unless c-level managers embraces this transformation, they will struggle to solve problems at the pace the business demands.
Today’s challenges require a new playbook – one that makes leaders more agile, forward thinking, bolder and more pushy in their solutions. Our goal in this presentation is to give business leaders fresh ideas and perspectives to shape thinking about priorities for 2015. In a growing, changing economy, business challenges abound. Yet few can be addressed successfully without new approaches to solving the people challenges that accompany them— challenges that have grown in importance and complexity.
Using Single Keyword Ad Groups To Drive PPC PerformanceSam Owen
31 slides•4.8K views
A look at how a simplified campaign structure can improve PPC results through top performer single keyword ad group campaigns. From Sam Owen at Hero Conf 2014.
The Social Consumer, study explores the factors that inform, impact and shape trust, loyalty and preferences of the digitally connected consumer.
In this study, we tested the belief that brands which can tap into emotions about and awareness of their values (human/social) are most likely to inspire positive action and loyalty from consumers.
Our view is that the super-connectedness of global communications has challenged how companies interact, engage and maintain relevance and trust with their key audiences and the public-at-large. As such, the reputation of a company is no longer defined by what they “report” or what they “say” they stand for. Instead, they are increasingly defined by the shared opinions and experiences of socially-connected consumers.
The findings reflect a number of surprising and validating insights, informed by surveys completed by 927 respondents mostly from the U.S. with about 10 percent from rest-of-world with great distribution and balance across age and gender.
Brand equity is an intangible asset that has psychological and financial value for an organization. It is measured using three metrics: knowledge, preference, and financial consideration. Knowledge metrics measure brand awareness and association through recall and brand-product connections. Preference metrics evaluate customer loyalty and purchasing behavior. Financial metrics assess a brand's monetary value in terms of market share, price premiums, and revenue generation. High scores across these metrics optimize a brand's financial performance.
The Prevacid®24HR Panel was a consumer trial panel and brand ambassador program launched by Novartis Consumer Health to cultivate influential frequent heartburn sufferers to try their new over-the-counter heartburn treatment and provide feedback. Over 10,000 consumers applied and 800 were selected to trial the product and document their experiences online. Panelists engaged on a discussion forum and provided testimonials that reached over 250,000 consumers and helped make Prevacid®24HR the second largest branded OTC heartburn treatment. The successful program increased positive perception of the brand.
The Response Rates of Personalized Cross-Media Marketing CampaignsAaron Corson
18 slides•9.3K views
The document discusses response rates from personalized cross-media marketing campaigns. MindFireInc, a leader in marketing intelligence software, was named the 20th fastest growing software company in the United States for the second consecutive year in 2008 and 2009. The document examines response rate data from MindFireInc's database, other industry reports, and case studies of campaigns run by Proven Direct and for Anchor Bank and Cardinal Stritch University.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L.
Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com.
Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.
This document provides an introduction to streams and their uses for data processing and analysis. Streams allow processing large datasets in a manageable way by handling data as a sequential flow rather than loading entire files into memory at once. The document discusses readable and writable streams that sources and sink data, as well as transform streams that manipulate data. It provides examples of using streams for tasks like scraping websites, normalizing data, and performing map-reduce operations. The programming benefits of streams like separation of concerns and a functional programming style are also outlined.
Description of some of the elements that go in to creating a PostgreSQL-as-a-Service for organizations with many teams and a diverse ecosystem of applications and teams.
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
42 slides•4.2K views
The Data Platform at Twitter supports engineers and data scientists running batch jobs on Hadoop clusters that are several 1000s of nodes, and real-time jobs on top of systems such as Storm. In this presentation, I discuss the overall Data Platform stack at Twitter. In particular, I talk about enabling real-time and batch analytics at scale with the help of Scalding, which is a Scala DSL for batch jobs using MapReduce, Summingbird, which is a framework for combined real-time and batch processing, and Tsar, which is a framework for real-time time-series aggregations.
This document discusses real-time analytics on streaming data. It describes why real-time data streaming and analytics are important due to the perishable nature of data value over time. It then covers key components of real-time analytics systems including data sources, stream storage, stream ingestion, stream processing, and stream delivery. Finally, it discusses streaming data processing techniques like filtering, enriching, and converting streaming data.
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformScyllaDB
45 slides•2.3K views
Grab is one of the most frequently used mobile platforms in Southeast Asia, providing the everyday services that matter most to consumers. Its users commute, eat, arrange shopping deliveries, and pay with one e-wallet. Grab relies on the combination of Apache Kafka and Scylla for a very critical use case -- instantaneously detecting fraudulent transactions that might occur across approximately more than six million on-demand rides per day taking place in eight countries across Southeast Asia. Doing this successfully requires many things to happen in near-real time.
Join our webinar for this fascinating real-time big data use case, and learn the steps Grab took to optimize their fraud detection systems using the Scylla NoSQL database along with Apache Kafka.
Using Event Streams in Serverless ApplicationsJonathan Dee
65 slides•112 views
This document summarizes a presentation about building serverless applications using event streams. It discusses what serverless computing means for developers, common use cases like APIs and stream processing using functions as a service (FaaS). It also covers using event streams and message buses to build event-driven architectures and decouple services. Key aspects covered include event-driven design principles, using queues to control parallelism, and designing independent, scalable queue workers.
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
42 slides•163 views
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
TSAR is a framework for building time-series aggregation jobs at scale. It allows users to specify how to extract events from data sources, the dimensions to aggregate on, metrics to compute, and datastores to write outputs to. TSAR then handles all aspects of deploying and operating the aggregation pipeline, including processing events in real-time and batch, coordinating data schemas, and providing operational tools. It is based on Summingbird which provides abstractions for distributed computation across different platforms.
TSAR (TimeSeries AggregatoR) Tech TalkAnirudh Todi
98 slides•1.3K views
Twitter's 250 million users generate tens of billions of tweet views per day. Aggregating these events in real time - in a robust enough way to incorporate into our products - presents a massive scaling challenge. In this presentation I introduce TSAR (the TimeSeries AggregatoR), a robust, flexible, and scalable service for real-time event aggregation designed to solve this problem and a range of similar ones. I discuss how we built TSAR using Python and Scala from the ground up, almost entirely on open-source technologies (Storm, Summingbird, Kafka, Aurora, and others), and describe some of the challenges we faced in scaling it to process tens of billions of events per day.
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
42 slides•3.6K views
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
The functional paradigm is not only applicable to programming. There is even more reason for using functional patterns at an architectural level. MapReduce is the most famous example of such a pattern. In this talk, we will go through a few other architectural patterns, and their corresponding stateful anti-patterns.
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPDaniel Zivkovic
47 slides•200 views
The document is about an upcoming meetup hosted by ServerlessToronto.org on "Serverless Cloud Native Java with Spring Cloud GCP" presented by Ray Tsang. It includes an agenda for the event with topics on Spring Cloud GCP features and integrations with Google Cloud Platform services. There is also information about upcoming meetups from the organization and a thank you from Ray Tsang for attending the presentation.
Cloud Native Data Pipelines (in Eng & Japanese) - QCon TokyoSid Anand
176 slides•1K views
Slides from "Cloud Native Data Pipelines" talk given @ QCon Tokyo 2016. The slides are in both English and Japanese. Thanks to Kiro Harada (https://jp.linkedin.com/in/haradakiro) for the translation.
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...StampedeCon
69 slides•10.7K views
At the StampedeCon 2013 Big Data conference in St. Louis, Riot Games discussed Using Hadoop to Understand and Improve Player Experience. Riot Games aims to be the most player-focused game company in the world. To fulfill that mission, it’s vital we develop a deep, detailed understanding of players’ experiences. This is particularly challenging since our debut title, League of Legends, is one of the most played video games in the world, with more than 32 million active monthly players across the globe. In this presentation, we’ll discuss several use cases where we sought to understand and improve the player experience, the challenges we faced to solve those use cases, and the big data infrastructure that supports our capability to provide continued insight.
This document discusses big data solutions on AWS. It covers data generation, collection and storage using services like S3, DynamoDB and Kinesis. It also discusses analytics and computation using EMR, Redshift and real-time analytics with Kinesis. Finally it discusses collaboration and sharing insights using visualization tools and BI tools. Real-world examples of using these AWS services for big data are provided.
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidDataWorks Summit
63 slides•6.9K views
This document discusses using an open source Lambda architecture with Kafka, Hadoop, Samza, and Druid to handle event data streams. It describes the problem of interactively exploring large volumes of time series data. It outlines how Druid was developed as a fast query layer for Hadoop to enable low-latency queries over aggregated data. The architecture ingests raw data streams in real-time via Kafka and Samza, aggregates the data in Druid, and enables reprocessing via Hadoop for reliability.
Embrace NoSQL and Eventual Consistency with RippleSean Cribbs
119 slides•2.2K views
So, there's this "NoSQL" thing you may have heard of, and this related thing called "eventual consistency". Supposedly, they help you scale, but no one has ever explained why! Well, wonder no more! This talk will demystify NoSQL, eventual consistency, how they might help you scale, and -- most importantly -- why you should care.
We'll look closely at how Riak, a linearly-scalable, distributed and fault-tolerant NoSQL datastore, implements eventual consistency, and how you can harness it from Ruby via the slick Ripple client/ORM. When the talk is finished, you'll have the tools both to understand eventual consistency and to handle it like a pro inside your next Ruby application.
2. Agenda
‣ Why Real-time Analytics?
‣ Rainbird and Cassandra
‣ Production Uses at Twitter
‣ Open Source
3. My Background
‣ Mathematics and Physics at Harvard, Physics at
Stanford
‣ Tropos Networks (city-wide wireless): mesh
routing algorithms, GBs of data
‣ Cooliris (web media): Hadoop and Pig for
analytics, TBs of data
‣ Twitter: Hadoop, Pig, HBase, Cassandra, data
viz, social graph analysis, soon to be PBs of data
4. My Background
‣ Mathematics and Physics at Harvard, Physics at
Stanford
‣ Tropos Networks (city-wide wireless): mesh
routing algorithms, GBs of data
‣ Cooliris (web media): Hadoop and Pig for
analytics, TBs of data
‣ Twitter: Hadoop, Pig, HBase, Cassandra, data
viz, social graph analysis, soon to be PBs of data
Now revenue products!
5. Agenda
‣ Why Real-time Analytics?
‣ Rainbird and Cassandra
‣ Production Uses at Twitter
‣ Open Source
10. Real-time Reporting
‣ Discussion around ad-based revenue model
‣ Help shape the conversation in real-time with
Promoted Tweets
11. Real-time Reporting
‣ Discussion around ad-based revenue model
‣ Help shape the conversation in real-time with
Promoted Tweets
‣ Realtime reporting
ties it all together
12. Agenda
‣ Why Real-time Analytics?
‣ Rainbird and Cassandra
‣ Production Uses at Twitter
‣ Open Source
13. Requirements
‣ Extremely high write volume
‣ Needs to scale to 100,000s of WPS
14. Requirements
‣ Extremely high write volume
‣ Needs to scale to 100,000s of WPS
‣ High read volume
‣ Needs to scale to 10,000s of RPS
15. Requirements
‣ Extremely high write volume
‣ Needs to scale to 100,000s of WPS
‣ High read volume
‣ Needs to scale to 10,000s of RPS
‣ Horizontally scalable (reads, storage, etc)
‣ Needs to scale to 100+ TB
16. Requirements
‣ Extremely high write volume
‣ Needs to scale to 100,000s of WPS
‣ High read volume
‣ Needs to scale to 10,000s of RPS
‣ Horizontally scalable (reads, storage, etc)
‣ Needs to scale to 100+ TB
‣ Low latency
‣ Most reads <100 ms (esp. recent data)
17. Cassandra
‣ Pro: In-house expertise
‣ Pro: Open source Apache project
‣ Pro: Writes are extremely fast
‣ Pro: Horizontally scalable, low latency
‣ Pro: Other startup adoption (Digg, SimpleGeo)
18. Cassandra
‣ Pro: In-house expertise
‣ Pro: Open source Apache project
‣ Pro: Writes are extremely fast
‣ Pro: Horizontally scalable, low latency
‣ Pro: Other startup adoption (Digg, SimpleGeo)
‣ Con: It was really young (0.3a)
19. Cassandra
‣ Pro: Some dudes at Digg had already started
working on distributed atomic counters in
Cassandra
20. Cassandra
‣ Pro: Some dudes at Digg had already started
working on distributed atomic counters in
Cassandra
‣ Say hi to @kelvin
21. Cassandra
‣ Pro: Some dudes at Digg had already started
working on distributed atomic counters in
Cassandra
‣ Say hi to @kelvin
‣ And @lenn0x
22. Cassandra
‣ Pro: Some dudes at Digg had already started
working on distributed atomic counters in
Cassandra
‣ Say hi to @kelvin
‣ And @lenn0x
‣ A dude from
Sweden began helping: @skr
23. Cassandra
‣ Pro: Some dudes at Digg had already started
working on distributed atomic counters in
Cassandra
‣ Say hi to @kelvin
‣ And @lenn0x
‣ A dude from
Sweden began helping: @skr
‣ Now all at Twitter :)
24. Rainbird
‣ It counts things. Really quickly.
‣ Layers on top of the distributed
counters patch, CASSANDRA-1072
25. Rainbird
‣ It counts things. Really quickly.
‣ Layers on top of the distributed
counters patch, CASSANDRA-1072
‣ Relies on Zookeeper, Cassandra, Scribe, Thrift
‣ Written in Scala
26. Rainbird Design
‣ Aggregators
buffer for 1m
‣ Intelligent
flush to
Cassandra
‣ Query
servers read
once written
‣ 1m is
configurable
33. Hierarchical Aggregation
‣ Say we’re counting Promoted Tweet impressions
‣ category = pti
‣ keys = [advertiser_id, campaign_id, tweet_id]
‣ count = 1
‣ Rainbird automatically increments the count for
‣ [advertiser_id, campaign_id, tweet_id]
‣ [advertiser_id, campaign_id]
‣ [advertiser_id]
‣ Means fast queries over each level of hierarchy
‣ Configurable in rainbird.conf, or dynamically via ZK
34. Hierarchical Aggregation
‣ Another example: tracking URL shortener tweets/clicks
‣ full URL = http://music.amazon.com/some_really_long_path
‣ keys = [com, amazon, music, full URL]
‣ count = 1
‣ Rainbird automatically increments the count for
‣ [com, amazon, music, full URL]
‣ [com, amazon, music]
‣ [com, amazon]
‣ [com]
‣ Means we can count clicks on full URLs
‣ And automatically aggregate over domains and subdomains!
35. Hierarchical Aggregation
‣ Another example: tracking URL shortener tweets/clicks
‣ full URL = http://music.amazon.com/some_really_long_path
‣ keys = [com, amazon, music, full URL]
‣ count = 1
‣ Rainbird automatically increments the count for
‣ [com, amazon, music, full URL]
‣ [com, amazon, music] How many people tweeted
‣ [com, amazon] full URL?
‣ [com]
‣ Means we can count clicks on full URLs
‣ And automatically aggregate over domains and subdomains!
36. Hierarchical Aggregation
‣ Another example: tracking URL shortener tweets/clicks
‣ full URL = http://music.amazon.com/some_really_long_path
‣ keys = [com, amazon, music, full URL]
‣ count = 1
‣ Rainbird automatically increments the count for
‣ [com, amazon, music, full URL]
‣ [com, amazon, music] How many people tweeted
‣ [com, amazon] any music.amazon.com URL?
‣ [com]
‣ Means we can count clicks on full URLs
‣ And automatically aggregate over domains and subdomains!
37. Hierarchical Aggregation
‣ Another example: tracking URL shortener tweets/clicks
‣ full URL = http://music.amazon.com/some_really_long_path
‣ keys = [com, amazon, music, full URL]
‣ count = 1
‣ Rainbird automatically increments the count for
‣ [com, amazon, music, full URL]
‣ [com, amazon, music] How many people tweeted
‣ [com, amazon] any amazon.com URL?
‣ [com]
‣ Means we can count clicks on full URLs
‣ And automatically aggregate over domains and subdomains!
38. Hierarchical Aggregation
‣ Another example: tracking URL shortener tweets/clicks
‣ full URL = http://music.amazon.com/some_really_long_path
‣ keys = [com, amazon, music, full URL]
‣ count = 1
‣ Rainbird automatically increments the count for
‣ [com, amazon, music, full URL]
‣ [com, amazon, music] How many people tweeted
‣ [com, amazon] any .com URL?
‣ [com]
‣ Means we can count clicks on full URLs
‣ And automatically aggregate over domains and subdomains!
39. Temporal Aggregation
‣ Rainbird also does (configurable) temporal
aggregation
‣ Each count is kept minutely, but also
denormalized hourly, daily, and all time
‣ Gives us quick counts at varying granularities
with no large scans at read time
‣ Trading storage for latency
40. Multiple Formulas
‣ So far we have talked about sums
‣ Could also store counts (1 for each event)
‣ ... which gives us a mean
‣ And sums of squares (count * count for each event)
‣ ... which gives us a standard deviation
‣ And min/max as well
‣ Configure this per-category in rainbird.conf
41. Rainbird
‣ Write 100,000s of events per second, each with
hierarchical structure
‣ Query with minutely granularity over any level of
the hierarchy, get back a time series
‣ Or query all time values
‣ Or query all time means, standard deviations
‣ Latency < 100ms
42. Agenda
‣ Why Real-time Analytics?
‣ Rainbird and Cassandra
‣ Production Uses at Twitter
‣ Open Source
43. Production Uses
‣ It turns out we need to count things all the time
‣ As soon as we had this service, we started
finding all sorts of use cases for it
‣ Promoted Products
‣ Tweeted URLs, by domain/subdomain
‣ Per-user Tweet interactions (fav, RT, follow)
‣ Arbitrary terms in Tweets
‣ Clicks on t.co URLs
45. Each different metric is part
Production Uses of the key hierarchy
‣ Promoted Tweet Analytics
46. Uses the temporal
aggregation to quickly show
Production Uses different levels of granularity
‣ Promoted Tweet Analytics
47. Data can be historical, or
Production Uses from 60 seconds ago
‣ Promoted Tweet Analytics
48. Production Uses
‣ Internal Monitoring and Alerting
‣ We require operational reporting on all internal services
‣ Needs to be real-time, but also want longer-term
aggregates
‣ Hierarchical, too: [stat, datacenter, service, machine]
49. Production Uses
‣ Tweet Button Counts
‣ Tweet Button counts are requested many many
times each day from across the web
‣ Uses the all time field
50. Agenda
‣ Why Real-time Analytics?
‣ Rainbird and Cassandra
‣ Production Uses at Twitter
‣ Open Source
53. Open Source?
‣ Yes! ... but not yet
‣ Relies on unreleased version of Cassandra
54. Open Source?
‣ Yes! ... but not yet
‣ Relies on unreleased version of Cassandra
‣ ... but the counters patch is committed in trunk (0.8)
55. Open Source?
‣ Yes! ... but not yet
‣ Relies on unreleased version of Cassandra
‣ ... but the counters patch is committed in trunk (0.8)
‣ ... also relies on some internal frameworks we need to
open source
56. Open Source?
‣ Yes! ... but not yet
‣ Relies on unreleased version of Cassandra
‣ ... but the counters patch is committed in trunk (0.8)
‣ ... also relies on some internal frameworks we need to
open source
‣ It will happen
57. Open Source?
‣ Yes! ... but not yet
‣ Relies on unreleased version of Cassandra
‣ ... but the counters patch is committed in trunk (0.8)
‣ ... also relies on some internal frameworks we need to
open source
‣ It will happen
‣ See http://github.com/twitter for proof of how much
Twitter open source
58. Team
‣ John Corwin (@johnxorz)
‣ Adam Samet (@damnitsamet)
‣ Johan Oskarsson (@skr)
‣ Kelvin Kakugawa (@kelvin)
‣ Chris Goffinet (@lenn0x)
‣ Steve Jiang (@sjiang)
‣ Kevin Weil (@kevinweil)
59. If You Only Remember One Slide...
‣ Rainbird is a distributed, high-volume counting
service built on top of Cassandra
‣ Write 100,000s events per second, query it with
hierarchy and multiple time granularities, returns
results in <100 ms
‣ Used by Twitter for multiple products internally,
including our Promoted Products, operational
monitoring and Tweet Button
‣ Will be open sourced so the community can use and
improve it!