SlideShare a Scribd company logo
Rainbird:
Real-time Analytics @Twitter
Kevin Weil -- @kevinweil
Product Lead for Revenue, Twitter




                                    TM
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
My Background
‣   Mathematics and Physics at Harvard, Physics at
    Stanford
‣   Tropos Networks (city-wide wireless): mesh
    routing algorithms, GBs of data
‣   Cooliris (web media): Hadoop and Pig for
    analytics, TBs of data
‣   Twitter: Hadoop, Pig, HBase, Cassandra, data
    viz, social graph analysis, soon to be PBs of data
My Background
‣   Mathematics and Physics at Harvard, Physics at
    Stanford
‣   Tropos Networks (city-wide wireless): mesh
    routing algorithms, GBs of data
‣   Cooliris (web media): Hadoop and Pig for
    analytics, TBs of data
‣   Twitter: Hadoop, Pig, HBase, Cassandra, data
    viz, social graph analysis, soon to be PBs of data
    Now revenue products!
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Why Real-time Analytics
‣   Twitter is real-time
Why Real-time Analytics
‣   Twitter is real-time
‣   ... even in space
And My Personal Favorite
And My Personal Favorite
Real-time Reporting
‣   Discussion around ad-based revenue model
‣   Help shape the conversation in real-time with
    Promoted Tweets
Real-time Reporting
‣   Discussion around ad-based revenue model
‣   Help shape the conversation in real-time with
    Promoted Tweets
‣   Realtime reporting
    ties it all together
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS

‣   High read volume
‣      Needs to scale to 10,000s of RPS
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS

‣   High read volume
‣      Needs to scale to 10,000s of RPS

‣   Horizontally scalable (reads, storage, etc)
‣      Needs to scale to 100+ TB
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS

‣   High read volume
‣      Needs to scale to 10,000s of RPS

‣   Horizontally scalable (reads, storage, etc)
‣      Needs to scale to 100+ TB

‣   Low latency
‣      Most reads <100 ms (esp. recent data)
Cassandra
‣   Pro: In-house expertise
‣   Pro: Open source Apache project
‣   Pro: Writes are extremely fast
‣   Pro: Horizontally scalable, low latency
‣   Pro: Other startup adoption (Digg, SimpleGeo)
Cassandra
‣   Pro: In-house expertise
‣   Pro: Open source Apache project
‣   Pro: Writes are extremely fast
‣   Pro: Horizontally scalable, low latency
‣   Pro: Other startup adoption (Digg, SimpleGeo)




‣   Con: It was really young (0.3a)
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
‣   And @lenn0x
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
‣   And @lenn0x
‣   A dude from
    Sweden began helping: @skr
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
‣   And @lenn0x
‣   A dude from
    Sweden began helping: @skr


‣   Now all at Twitter :)
Rainbird
‣   It counts things. Really quickly.
‣   Layers on top of the distributed
    counters patch, CASSANDRA-1072
Rainbird
‣   It counts things. Really quickly.
‣   Layers on top of the distributed
    counters patch, CASSANDRA-1072


‣   Relies on Zookeeper, Cassandra, Scribe, Thrift
‣   Written in Scala
Rainbird Design
‣   Aggregators
    buffer for 1m
‣   Intelligent
    flush to
    Cassandra
‣   Query
    servers read
    once written
‣   1m is
    configurable
Rainbird Data Structures
struct Event
{
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Unix timestamp of event
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Stat category name
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Stat keys (hierarchical)
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Actual count (diff)
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               More later
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Hierarchical Aggregation
‣   Say we’re counting Promoted Tweet impressions
‣   category = pti
‣   keys = [advertiser_id, campaign_id, tweet_id]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [advertiser_id, campaign_id, tweet_id]
‣      [advertiser_id, campaign_id]
‣      [advertiser_id]
‣   Means fast queries over each level of hierarchy
‣   Configurable in rainbird.conf, or dynamically via ZK
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]
‣      [com, amazon]
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 full URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 any music.amazon.com URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 any amazon.com URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 any .com URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Temporal Aggregation
‣   Rainbird also does (configurable) temporal
    aggregation
‣   Each count is kept minutely, but also
    denormalized hourly, daily, and all time
‣   Gives us quick counts at varying granularities
    with no large scans at read time
‣      Trading storage for latency
Multiple Formulas
‣   So far we have talked about sums
‣   Could also store counts (1 for each event)
‣   ... which gives us a mean
‣   And sums of squares (count * count for each event)
‣   ... which gives us a standard deviation
‣   And min/max as well


‣   Configure this per-category in rainbird.conf
Rainbird
‣   Write 100,000s of events per second, each with
    hierarchical structure
‣   Query with minutely granularity over any level of
    the hierarchy, get back a time series
‣   Or query all time values
‣   Or query all time means, standard deviations
‣   Latency < 100ms
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Production Uses
‣   It turns out we need to count things all the time
‣   As soon as we had this service, we started
    finding all sorts of use cases for it
‣      Promoted Products
‣      Tweeted URLs, by domain/subdomain
‣      Per-user Tweet interactions (fav, RT, follow)
‣      Arbitrary terms in Tweets
‣      Clicks on t.co URLs
Use Cases
‣   Promoted Tweet Analytics
Each different metric is part
Production Uses                of the key hierarchy

‣   Promoted Tweet Analytics
Uses the temporal
                               aggregation to quickly show
Production Uses                different levels of granularity

‣   Promoted Tweet Analytics
Data can be historical, or
Production Uses                from 60 seconds ago

‣   Promoted Tweet Analytics
Production Uses
‣   Internal Monitoring and Alerting




‣   We require operational reporting on all internal services
‣   Needs to be real-time, but also want longer-term
    aggregates
‣   Hierarchical, too: [stat,   datacenter, service, machine]
Production Uses
‣   Tweet Button Counts




‣   Tweet Button counts are requested many many
    times each day from across the web
‣   Uses the all time field
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Open Source?
‣   Yes!
Open Source?
‣   Yes!   ... but not yet
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
‣      ... also relies on some internal frameworks we need to
    open source
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
‣      ... also relies on some internal frameworks we need to
    open source
‣   It will happen
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
‣      ... also relies on some internal frameworks we need to
    open source
‣   It will happen
‣   See http://github.com/twitter for proof of how much
    Twitter    open source
Team
‣   John Corwin (@johnxorz)
‣   Adam Samet (@damnitsamet)
‣   Johan Oskarsson (@skr)
‣   Kelvin Kakugawa (@kelvin)
‣   Chris Goffinet (@lenn0x)
‣   Steve Jiang (@sjiang)
‣   Kevin Weil (@kevinweil)
If You Only Remember One Slide...
‣   Rainbird is a distributed, high-volume counting
    service built on top of Cassandra
‣   Write 100,000s events per second, query it with
    hierarchy and multiple time granularities, returns
    results in <100 ms
‣   Used by Twitter for multiple products internally,
    including our Promoted Products, operational
    monitoring and Tweet Button
‣   Will be open sourced so the community can use and
    improve it!
Questions?
        Follow me: @kevinweil




                       TM

More Related Content

What's hot (20)

Redpanda and ClickHouse by Altinity Ltd, has 14 slides with 904 views.Roko Kruze of vectorized.io describes real-time analytics using Redpanda event streams and ClickHouse data warehouse. 15 December 2021 SF Bay Area ClickHouse Meetup
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouse
Altinity Ltd
14 slides904 views
Introduction to Redis by Arnab Mitra, has 31 slides with 13545 views.Redis is an open source, in-memory data structure store that can be used as a database, cache, or message broker. It supports data structures like strings, hashes, lists, sets, sorted sets with ranges and pagination. Redis provides high performance due to its in-memory storage and support for different persistence options like snapshots and append-only files. It uses client/server architecture and supports master-slave replication, partitioning, and failover. Redis is useful for caching, queues, and other transient or non-critical data.
Introduction to RedisIntroduction to Redis
Introduction to Redis
Arnab Mitra
31 slides13.5K views
Introduction to memcached by Jurriaan Persyn, has 77 slides with 75654 views.Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
Introduction to memcachedIntroduction to memcached
Introduction to memcached
Jurriaan Persyn
77 slides75.7K views
Big Data in Real-Time at Twitter by nkallen, has 71 slides with 139810 views.The document summarizes how Twitter handles and analyzes large amounts of real-time data, including tweets, timelines, social graphs, and search indices. It describes Twitter's original implementations using relational databases and the problems they encountered due to scale. It then discusses their current solutions, which involve partitioning the data across multiple servers, replicating and indexing the partitions, and pre-computing derived data when possible to enable low-latency queries. The principles discussed include exploiting locality, keeping working data in memory, and distributing computation across partitions to improve scalability and throughput.
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitter
nkallen
71 slides139.8K views
Linux Kernel vs DPDK: HTTP Performance Showdown by ScyllaDB, has 31 slides with 1916 views.In this session I will use a simple HTTP benchmark to compare the performance of the Linux kernel networking stack with userspace networking powered by DPDK (kernel-bypass). It is said that kernel-bypass technologies avoid the kernel because it is "slow", but in reality, a lot of the performance advantages that they bring just come from enforcing certain constraints. As it turns out, many of these constraints can be enforced without bypassing the kernel. If the system is tuned just right, one can achieve performance that approaches kernel-bypass speeds, while still benefiting from the kernel's battle-tested compatibility, and rich ecosystem of tools.
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance Showdown
ScyllaDB
31 slides1.9K views
Apache Druid 101 by Data Con LA, has 32 slides with 503 views.Data Con LA 2020 Description Apache Druid is a cloud-native open-source database that enables developers to build highly-scalable, low-latency, real-time interactive dashboards and apps to explore huge quantities of data. This column-oriented database provides the microsecond query response times required for ad-hoc queries and programmatic analytics. Druid natively streams data from Apache Kafka (and more) and batch loads just about anything. At ingestion, Druid partitions data based on time so time-based queries run significantly faster than traditional databases, plus Druid offers SQL compatibility. Druid is used in production by AirBnB, Nielsen, Netflix and more for real-time and historical data analytics. This talk provides an introduction to Apache Druid including: Druid's core architecture and its advantages, Working with streaming and batch data in Druid, Querying data and building apps on Druid and Real-world examples of Apache Druid in action Speaker Matt Sarrel, Imply Data, Developer Evangelist
Apache Druid 101Apache Druid 101
Apache Druid 101
Data Con LA
32 slides503 views
Netflix Global Cloud Architecture by Adrian Cockcroft, has 59 slides with 72175 views.Latest version of Netflix Architecture presentation, variants presented several times during October 2012
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
Adrian Cockcroft
59 slides72.2K views
Redis cluster by iammutex, has 17 slides with 7222 views.Redis Cluster is an approach to distributing Redis across multiple nodes. Key-value pairs are partitioned across nodes using consistent hashing on the key's hash slot. Nodes specialize as masters or slaves of data partitions for redundancy. Clients can query any node, which will redirect requests as needed. Nodes continuously monitor each other to detect and address failures, maintaining availability as long as each partition has at least one responsive node. The redis-trib tool is used to setup, check, resize, and repair clusters as needed.
Redis clusterRedis cluster
Redis cluster
iammutex
17 slides7.2K views
Facebook Messages & HBase by 强 王, has 39 slides with 40056 views.The document discusses Facebook's use of HBase to store messaging data. It provides an overview of HBase, including its data model, performance characteristics, and how it was a good fit for Facebook's needs due to its ability to handle large volumes of data, high write throughput, and efficient random access. It also describes some enhancements Facebook made to HBase to improve availability, stability, and performance. Finally, it briefly mentions Facebook's migration of messaging data from MySQL to their HBase implementation.
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
强 王
39 slides40.1K views
MyRocks Deep Dive by Yoshinori Matsunobu, has 152 slides with 25829 views.Detailed technical material about MyRocks -- RocksDB storage engine for MySQL -- https://github.com/facebook/mysql-5.6
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
Yoshinori Matsunobu
152 slides25.8K views
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud by Noritaka Sekiyama, has 60 slides with 34610 views.This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
60 slides34.6K views
The Parquet Format and Performance Optimization Opportunities by Databricks, has 32 slides with 9317 views.The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
32 slides9.3K views
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix by DataWorks Summit, has 49 slides with 1400 views.Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies? In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into details of our data quality solution. We’ll cover what worked, what didn’t work so well, and what we plan to work on next. We’ll conclude with some tips and lessons learned for ensuring data quality on big data.
Whoops, The Numbers Are Wrong! Scaling Data Quality @ NetflixWhoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
DataWorks Summit
49 slides1.4K views
XStream: stream processing platform at facebook by Aniket Mokashi, has 22 slides with 603 views.XStream is Facebook's unified stream processing platform that provides a fully managed stream processing service. It was built using the Stylus C++ stream processing framework and uses a common SQL dialect called CoreSQL. XStream employs an interpretive execution model using the new Velox vectorized SQL evaluation engine for high performance. This provides a consistent and high efficiency stream processing platform to support diverse real-time use cases at planetary scale for Facebook.
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
Aniket Mokashi
22 slides603 views
What's New in Apache Hive by DataWorks Summit, has 37 slides with 2170 views.Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features. We will discuss optimizations which provide major performance gains as well as integration with other big data technologies such as Apache Spark, Druid, and Kafka. The talk will also provide a glimpse of what is expected to come in the near future.
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
37 slides2.2K views
Netflix viewing data architecture evolution - QCon 2014 by Philip Fisher-Ogden, has 53 slides with 27386 views.Netflix's architecture for viewing data has evolved as streaming usage has grown. Each generation was designed for the next order of magnitude, and was informed by learnings from the previous. From SQL to NoSQL, from data center to cloud, from proprietary to open source, look inside to learn how this system has evolved. (from talk given at QConSF 2014)
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014
Philip Fisher-Ogden
53 slides27.4K views
Hardening Kafka Replication by confluent, has 232 slides with 5708 views.(Jason Gustafson, Confluent) Kafka Summit SF 2018 Kafka has a well-designed replication protocol, but over the years, we have found some extremely subtle edge cases which can, in the worst case, lead to data loss. We fixed the cases we were aware of in version 0.11.0.0, but shortly after that, another edge case popped up and then another. Clearly we needed a better approach to verify the correctness of the protocol. What we found is Leslie Lamport’s specification language TLA+. In this talk I will discuss how we have stepped up our testing methodology in Apache Kafka to include formal specification and model checking using TLA+. I will cover the following: 1. How Kafka replication works 2. What weaknesses we have found over the years 3. How these problems have been fixed 4. How we have used TLA+ to verify the fixed protocol. This talk will give you a deeper understanding of Kafka replication internals and its semantics. The replication protocol is a great case study in the complex behavior of distributed systems. By studying the faults and how they were fixed, you will have more insight into the kinds of problems that may lurk in your own designs. You will also learn a little bit of TLA+ and how it can be used to verify distributed algorithms.
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
confluent
232 slides5.7K views
SSD Deployment Strategies for MySQL by Yoshinori Matsunobu, has 52 slides with 18925 views.Slides for MySQL Conference & Expo 2010: http://en.oreilly.com/mysql2010/public/schedule/detail/13519
SSD Deployment Strategies for MySQLSSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQL
Yoshinori Matsunobu
52 slides18.9K views
Apache Spark Architecture by Alexey Grishchenko, has 114 slides with 80638 views.This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
114 slides80.6K views
Interactive Analytics in Human Time by DataWorks Summit, has 34 slides with 6295 views.This document discusses Yahoo's approach to interactive analytics on human timescales for their large-scale advertising data warehouse. It describes how they ingest billions of daily events and terabytes of data, transform and store it using technologies like Druid and Storm, and perform real-time analytics like computing overlaps between user groups in under a minute. It also compares their "instant overlap" technique using feature sequences and bitmaps to existing approaches like exact computation and sketches.
Interactive Analytics in Human TimeInteractive Analytics in Human Time
Interactive Analytics in Human Time
DataWorks Summit
34 slides6.3K views
The Parquet Format and Performance Optimization Opportunities by Databricks, has 32 slides with 9317 views.The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
32 slides9.3K views

Viewers also liked (19)

Scalable Event Analytics with MongoDB & Ruby on Rails by Jared Rosoff, has 45 slides with 28384 views.The document discusses scaling event analytics applications using Ruby on Rails and MongoDB. It describes how the author's startup initially used a standard Rails architecture with MySQL, but ran into performance bottlenecks. It then explores solutions like replication, sharding, key-value stores and Hadoop, but notes the development challenges with each approach. The document concludes that using MongoDB provided scalable writes, flexible reporting and the ability to keep the application as "just a Rails app". MongoDB's sharding allows scaling to high concurrency levels with increased storage and transaction capacity across shards.
Scalable Event Analytics with MongoDB & Ruby on RailsScalable Event Analytics with MongoDB & Ruby on Rails
Scalable Event Analytics with MongoDB & Ruby on Rails
Jared Rosoff
45 slides28.4K views
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010) by Kevin Weil, has 75 slides with 5465 views.A look at Twitter's data lifecycle, some of the tools we use to handle big data, and some of the questions we answer from our data.
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Kevin Weil
75 slides5.5K views
Hadoop and pig at twitter (oscon 2010) by Kevin Weil, has 35 slides with 9529 views.This document summarizes Kevin Weil's presentation on Hadoop and Pig at Twitter. Weil discusses how Twitter uses Hadoop and Pig to analyze massive amounts of user data, including tweets. He explains how Pig allows for more concise and readable analytics jobs compared to raw MapReduce. Weil also provides examples of how Twitter builds data-driven products and services using these tools, such as their People Search feature.
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Kevin Weil
35 slides9.5K views
NoSQL at Twitter (NoSQL EU 2010) by Kevin Weil, has 156 slides with 94421 views.A discussion of the different NoSQL-style datastores in use at Twitter, including Hadoop (with Pig for analysis), HBase, Cassandra, and FlockDB.
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)
Kevin Weil
156 slides94.4K views
Hadoop, Pig, and Twitter (NoSQL East 2009) by Kevin Weil, has 58 slides with 143344 views.A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
58 slides143.3K views
Protocol Buffers and Hadoop at Twitter by Kevin Weil, has 49 slides with 41932 views.How Twitter uses Hadoop and Protocol Buffers for efficient, flexible data storage and fast MapReduce/Pig jobs.
Protocol Buffers and Hadoop at TwitterProtocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at Twitter
Kevin Weil
49 slides41.9K views
MongoDB at the energy frontier by Valentin Kuznetsov, has 33 slides with 9516 views.The document discusses MongoDB's use in the CMS experiment at CERN. MongoDB is used as the backend for CMS's Data Aggregation System (DAS), which acts as an intelligent cache to query distributed data services. DAS translates user queries, retrieves data from multiple services, aggregates the results, and returns consolidated responses. This architecture allows users to access different data without knowledge of the underlying services. MongoDB provides a flexible schema and fast I/O that make it suitable for caching distributed data and executing complex queries in DAS.
MongoDB at the energy frontierMongoDB at the energy frontier
MongoDB at the energy frontier
Valentin Kuznetsov
33 slides9.5K views
Hadoop summit 2010 frameworks panel elephant bird by Kevin Weil, has 11 slides with 2078 views.Elephant Bird is a framework for working with structured data within Hadoop ecosystems. It allows users to specify a flexible, forward-backward compatible, self-documenting data schema and then generates code for input/output formats, Hadoop Writables, and Pig load/store functions. This reduces the amount of code needed and allows users to focus on their data. Elephant Bird underlies 20,000 Hadoop jobs per day at Twitter.
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
Kevin Weil
11 slides2.1K views
Hadoop at Twitter (Hadoop Summit 2010) by Kevin Weil, has 46 slides with 7148 views.Kevin Weil presented on Hadoop at Twitter. He discussed Twitter's data lifecycle including data input via Scribe and Crane, storage in HDFS and HBase, analysis using Pig and Oink, and data products like Birdbrain. He described how tools like Scribe, Crane, Elephant Bird, Pig, and HBase were developed and used at Twitter to handle large volumes of log and tabular data at petabyte scale.
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
Kevin Weil
46 slides7.1K views
Big Data at Twitter, Chirp 2010 by Kevin Weil, has 71 slides with 14051 views.Slides from my talk on collecting, storing, and analyzing big data at Twitter for Chirp Hack Day at Twitter's Chirp conference.
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010
Kevin Weil
71 slides14.1K views
How to Start Using Analytics Without Feeling Overwhelmed by Kissmetrics on SlideShare, has 54 slides with 3405 views.This document provides guidance on using analytics without feeling overwhelmed. It outlines 5 principles for finding valuable analytics insights: 1) Avoid vanity metrics that don't impact business, 2) Focus on customers, 3) Track whether customers return, 4) Track your customer funnel, and 5) Establish your own benchmarks for comparison over time. It also provides 4 tactics for using Google Analytics, including setting up goals and funnels to track customer actions and conversions. Customer analytics is recommended for a more complete view of customers rather than anonymous traffic data alone.
How to Start Using Analytics Without Feeling OverwhelmedHow to Start Using Analytics Without Feeling Overwhelmed
How to Start Using Analytics Without Feeling Overwhelmed
Kissmetrics on SlideShare
54 slides3.4K views
Email Optimization: A discussion about how A/B testing generated $500 million... by MarketingSherpa, has 65 slides with 5009 views.In this webinar you’ll hear from Amelia Showalter, who headed the email and digital analytics teams for President Barack Obama’s 2012 presidential campaign. They’ll discuss how they maintained a breakneck testing schedule for its massive email program with Daniel Burstein, Director of Editorial Content, MECLABS. Take an inside look at the campaign headquarters, detailing the thrilling successes and informative failures Obama for America encountered at the cutting edge of digital politics. In this session, you'll learn: • Why subject lines mattered so much, and the back stories behind some of the most memorable ones • Why the prettiest email isn’t always the best email • How a free bumper sticker offer can pay for itself many times over • The most important takeaway from all those tests and trials
Email Optimization: A discussion about how A/B testing generated $500 million...Email Optimization: A discussion about how A/B testing generated $500 million...
Email Optimization: A discussion about how A/B testing generated $500 million...
MarketingSherpa
65 slides5K views
Human resonance for leaders pap 2015 by Bernhard K.F. Pelzer, has 28 slides with 2451 views.Leading in the new world of work – Human Resonance With so many models and approaches – from large firms to business schools to boutiques – it is hard for companies to architect the tailored yet integrated experiences they need. In our “Human Resonance” approach we offer what is needed. Next level practice instead of best practice! In this new world of work, the barriers between work and life are eliminated. The “new world of work” is one that requires a dramatic change in strategies for leadership, talent, and human resources. A new playbook for new times Growth, volatility, change, and disruptive technology drive companies to shift their underlying business model. It is time to address this disruption, transforming the leaders from a transaction-execution function into a dominant partner who pushes innovative solutions to managers at all levels. Unless c-level managers embraces this transformation, they will struggle to solve problems at the pace the business demands. Today’s challenges require a new playbook – one that makes leaders more agile, forward thinking, bolder and more pushy in their solutions. Our goal in this presentation is to give business leaders fresh ideas and perspectives to shape thinking about priorities for 2015. In a growing, changing economy, business challenges abound. Yet few can be addressed successfully without new approaches to solving the people challenges that accompany them— challenges that have grown in importance and complexity.
Human resonance for leaders pap 2015Human resonance for leaders pap 2015
Human resonance for leaders pap 2015
Bernhard K.F. Pelzer
28 slides2.5K views
Why So Many Businesses are Hopeless at Blogging - New Statistics for 2013 by Passle, has 28 slides with 5262 views.Passle has spent the last few months researching business blogging, and how to make it much easier. Here are our latest findings.
Why So Many Businesses are Hopeless at Blogging - New Statistics for 2013Why So Many Businesses are Hopeless at Blogging - New Statistics for 2013
Why So Many Businesses are Hopeless at Blogging - New Statistics for 2013
Passle
28 slides5.3K views
Using Single Keyword Ad Groups To Drive PPC Performance by Sam Owen, has 31 slides with 4815 views.A look at how a simplified campaign structure can improve PPC results through top performer single keyword ad group campaigns. From Sam Owen at Hero Conf 2014.
Using Single Keyword Ad Groups To Drive PPC PerformanceUsing Single Keyword Ad Groups To Drive PPC Performance
Using Single Keyword Ad Groups To Drive PPC Performance
Sam Owen
31 slides4.8K views
The Social Consumer Study by Don Bulmer, has 13 slides with 7666 views.The Social Consumer, study explores the factors that inform, impact and shape trust, loyalty and preferences of the digitally connected consumer. In this study, we tested the belief that brands which can tap into emotions about and awareness of their values (human/social) are most likely to inspire positive action and loyalty from consumers. Our view is that the super-connectedness of global communications has challenged how companies interact, engage and maintain relevance and trust with their key audiences and the public-at-large. As such, the reputation of a company is no longer defined by what they “report” or what they “say” they stand for. Instead, they are increasingly defined by the shared opinions and experiences of socially-connected consumers. The findings reflect a number of surprising and validating insights, informed by surveys completed by 927 respondents mostly from the U.S. with about 10 percent from rest-of-world with great distribution and balance across age and gender.
The Social Consumer StudyThe Social Consumer Study
The Social Consumer Study
Don Bulmer
13 slides7.7K views
Brand Equity Measures by sat2coolguy, has 7 slides with 2147 views.Brand equity is an intangible asset that has psychological and financial value for an organization. It is measured using three metrics: knowledge, preference, and financial consideration. Knowledge metrics measure brand awareness and association through recall and brand-product connections. Preference metrics evaluate customer loyalty and purchasing behavior. Financial metrics assess a brand's monetary value in terms of market share, price premiums, and revenue generation. High scores across these metrics optimize a brand's financial performance.
Brand Equity MeasuresBrand Equity Measures
Brand Equity Measures
sat2coolguy
7 slides2.1K views
Novartis: Prevacid 24HR Online Consumer Trial Panel and Brand Ambassador Prog... by Affinitive, has 16 slides with 3829 views.The Prevacid®24HR Panel was a consumer trial panel and brand ambassador program launched by Novartis Consumer Health to cultivate influential frequent heartburn sufferers to try their new over-the-counter heartburn treatment and provide feedback. Over 10,000 consumers applied and 800 were selected to trial the product and document their experiences online. Panelists engaged on a discussion forum and provided testimonials that reached over 250,000 consumers and helped make Prevacid®24HR the second largest branded OTC heartburn treatment. The successful program increased positive perception of the brand.
Novartis: Prevacid 24HR Online Consumer Trial Panel and Brand Ambassador Prog...Novartis: Prevacid 24HR Online Consumer Trial Panel and Brand Ambassador Prog...
Novartis: Prevacid 24HR Online Consumer Trial Panel and Brand Ambassador Prog...
Affinitive
16 slides3.8K views
The Response Rates of Personalized Cross-Media Marketing Campaigns by Aaron Corson, has 18 slides with 9334 views.The document discusses response rates from personalized cross-media marketing campaigns. MindFireInc, a leader in marketing intelligence software, was named the 20th fastest growing software company in the United States for the second consecutive year in 2008 and 2009. The document examines response rate data from MindFireInc's database, other industry reports, and case studies of campaigns run by Proven Direct and for Anchor Bank and Cardinal Stritch University.
The Response Rates of Personalized Cross-Media Marketing CampaignsThe Response Rates of Personalized Cross-Media Marketing Campaigns
The Response Rates of Personalized Cross-Media Marketing Campaigns
Aaron Corson
18 slides9.3K views
Human resonance for leaders pap 2015 by Bernhard K.F. Pelzer, has 28 slides with 2451 views.Leading in the new world of work – Human Resonance With so many models and approaches – from large firms to business schools to boutiques – it is hard for companies to architect the tailored yet integrated experiences they need. In our “Human Resonance” approach we offer what is needed. Next level practice instead of best practice! In this new world of work, the barriers between work and life are eliminated. The “new world of work” is one that requires a dramatic change in strategies for leadership, talent, and human resources. A new playbook for new times Growth, volatility, change, and disruptive technology drive companies to shift their underlying business model. It is time to address this disruption, transforming the leaders from a transaction-execution function into a dominant partner who pushes innovative solutions to managers at all levels. Unless c-level managers embraces this transformation, they will struggle to solve problems at the pace the business demands. Today’s challenges require a new playbook – one that makes leaders more agile, forward thinking, bolder and more pushy in their solutions. Our goal in this presentation is to give business leaders fresh ideas and perspectives to shape thinking about priorities for 2015. In a growing, changing economy, business challenges abound. Yet few can be addressed successfully without new approaches to solving the people challenges that accompany them— challenges that have grown in importance and complexity.
Human resonance for leaders pap 2015Human resonance for leaders pap 2015
Human resonance for leaders pap 2015
Bernhard K.F. Pelzer
28 slides2.5K views

Similar to Rainbird: Realtime Analytics at Twitter (Strata 2011) (20)

Elastic Data Analytics Platform @Datadog by C4Media, has 95 slides with 1142 views.Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L. Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com. Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
C4Media
95 slides1.1K views
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson by CODE BLUE, has 60 slides with 1077 views.分散型のスキャナーの構築は挑戦のし甲斐があり、実在のブラウザを使って作る場合はなおさらである。 今回紹介するスキャナーでは、ChromiumにJSのライブラリやそのバージョンを得るためのJavaScriptを注入することで、スキャンしたサイトのすべてのHTMLとJavaScript、独自アーキテクチャを必要とするセキュリティヘッダを保存できる。 このスキャナーでトップの100万サイトに対してスキャンを行い、現在のWeb上の状況を調べることが可能となるスケーラブルなシステムを設計する際に克服した課題についてカバーした。 本講演では、データ分析で得られた興味深い点にも触れるつもりである。 --- アイザック・ドーソンIsaac Dawson アイザック・ドーソンは、Veracode社の主要なセキュリティ研究者の一人で、彼の率いる同社の研究開発チームは、Veracode社の動的解析の提供に努めている。 Veracode社の前は@stake社とSymantec社でコンサルタントをしていた。 2004年にアプリケーションセキュリティのコンサルティングチーム発足させるため、日本へやってきた。 Veracode社での勤務が始まった後、彼の中で日本があまりにも快適であることがはっきりしたので、それ以降、滞在し続けることを決めたのだった。 Go言語の熱心なプログラマーであり、分散システムに関心があり、特にWebのスキャニングに強い関心をもっている。
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
CODE BLUE
60 slides1.1K views
Streams by Marielle Lange, has 21 slides with 802 views.This document provides an introduction to streams and their uses for data processing and analysis. Streams allow processing large datasets in a manageable way by handling data as a sequential flow rather than loading entire files into memory at once. The document discusses readable and writable streams that sources and sink data, as well as transform streams that manipulate data. It provides examples of using streams for tasks like scraping websites, normalizing data, and performing map-reduce operations. The programming benefits of streams like separation of concerns and a functional programming style are also outlined.
StreamsStreams
Streams
Marielle Lange
21 slides802 views
Creating PostgreSQL-as-a-Service at Scale by Sean Chittenden, has 45 slides with 1241 views.Description of some of the elements that go in to creating a PostgreSQL-as-a-Service for organizations with many teams and a diverse ecosystem of applications and teams.
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
Sean Chittenden
45 slides1.2K views
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale by Sriram Krishnan, has 42 slides with 4150 views.The Data Platform at Twitter supports engineers and data scientists running batch jobs on Hadoop clusters that are several 1000s of nodes, and real-time jobs on top of systems such as Storm. In this presentation, I discuss the overall Data Platform stack at Twitter. In particular, I talk about enabling real-time and batch analytics at scale with the help of Scalding, which is a Scala DSL for batch jobs using MapReduce, Summingbird, which is a framework for combined real-time and batch processing, and Tsar, which is a framework for real-time time-series aggregations.
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Sriram Krishnan
42 slides4.2K views
Realtime Analytics on AWS by Sungmin Kim, has 79 slides with 538 views.This document discusses real-time analytics on streaming data. It describes why real-time data streaming and analytics are important due to the perishable nature of data value over time. It then covers key components of real-time analytics systems including data sources, stream storage, stream ingestion, stream processing, and stream delivery. Finally, it discusses streaming data processing techniques like filtering, enriching, and converting streaming data.
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
Sungmin Kim
79 slides538 views
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform by ScyllaDB, has 45 slides with 2267 views.Grab is one of the most frequently used mobile platforms in Southeast Asia, providing the everyday services that matter most to consumers. Its users commute, eat, arrange shopping deliveries, and pay with one e-wallet. Grab relies on the combination of Apache Kafka and Scylla for a very critical use case -- instantaneously detecting fraudulent transactions that might occur across approximately more than six million on-demand rides per day taking place in eight countries across Southeast Asia. Doing this successfully requires many things to happen in near-real time. Join our webinar for this fascinating real-time big data use case, and learn the steps Grab took to optimize their fraud detection systems using the Scylla NoSQL database along with Apache Kafka.
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformReal-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
ScyllaDB
45 slides2.3K views
Using Event Streams in Serverless Applications by Jonathan Dee, has 65 slides with 112 views.This document summarizes a presentation about building serverless applications using event streams. It discusses what serverless computing means for developers, common use cases like APIs and stream processing using functions as a service (FaaS). It also covers using event streams and message buses to build event-driven architectures and decouple services. Key aspects covered include event-driven design principles, using queues to control parallelism, and designing independent, scalable queue workers.
Using Event Streams in Serverless ApplicationsUsing Event Streams in Serverless Applications
Using Event Streams in Serverless Applications
Jonathan Dee
65 slides112 views
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ... by Landon Robinson, has 42 slides with 163 views.The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Landon Robinson
42 slides163 views
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo by Ian Massingham, has 40 slides with 513 views.Slides from the 2014 Import.io Data Summit. Includes some slides that have a getting started walk through for setting up Impala on Amazon EMR.
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
Ian Massingham
40 slides513 views
Tsar tech talk by Anirudh Todi, has 98 slides with 277 views.TSAR is a framework for building time-series aggregation jobs at scale. It allows users to specify how to extract events from data sources, the dimensions to aggregate on, metrics to compute, and datastores to write outputs to. TSAR then handles all aspects of deploying and operating the aggregation pipeline, including processing events in real-time and batch, coordinating data schemas, and providing operational tools. It is based on Summingbird which provides abstractions for distributed computation across different platforms.
Tsar tech talkTsar tech talk
Tsar tech talk
Anirudh Todi
98 slides277 views
TSAR (TimeSeries AggregatoR) Tech Talk by Anirudh Todi, has 98 slides with 1279 views.Twitter's 250 million users generate tens of billions of tweet views per day. Aggregating these events in real time - in a robust enough way to incorporate into our products - presents a massive scaling challenge. In this presentation I introduce TSAR (the TimeSeries AggregatoR), a robust, flexible, and scalable service for real-time event aggregation designed to solve this problem and a range of similar ones. I discuss how we built TSAR using Python and Scala from the ground up, almost entirely on open-source technologies (Storm, Summingbird, Kafka, Aurora, and others), and describe some of the challenges we faced in scaling it to process tens of billions of events per day.
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech Talk
Anirudh Todi
98 slides1.3K views
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring by Databricks, has 42 slides with 3643 views.The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
42 slides3.6K views
Functional architectural patterns by Lars Albertsson, has 35 slides with 1698 views.The functional paradigm is not only applicable to programming. There is even more reason for using functional patterns at an architectural level. MapReduce is the most famous example of such a pattern. In this talk, we will go through a few other architectural patterns, and their corresponding stateful anti-patterns.
Functional architectural patternsFunctional architectural patterns
Functional architectural patterns
Lars Albertsson
35 slides1.7K views
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP by Daniel Zivkovic, has 47 slides with 200 views.The document is about an upcoming meetup hosted by ServerlessToronto.org on "Serverless Cloud Native Java with Spring Cloud GCP" presented by Ray Tsang. It includes an agenda for the event with topics on Spring Cloud GCP features and integrations with Google Cloud Platform services. There is also information about upcoming meetups from the organization and a thank you from Ray Tsang for attending the presentation.
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Daniel Zivkovic
47 slides200 views
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo by Sid Anand, has 176 slides with 1048 views.Slides from "Cloud Native Data Pipelines" talk given @ QCon Tokyo 2016. The slides are in both English and Japanese. Thanks to Kiro Harada (https://jp.linkedin.com/in/haradakiro) for the translation.
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Sid Anand
176 slides1K views
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp... by StampedeCon, has 69 slides with 10652 views.At the StampedeCon 2013 Big Data conference in St. Louis, Riot Games discussed Using Hadoop to Understand and Improve Player Experience. Riot Games aims to be the most player-focused game company in the world. To fulfill that mission, it’s vital we develop a deep, detailed understanding of players’ experiences. This is particularly challenging since our debut title, League of Legends, is one of the most played video games in the world, with more than 32 million active monthly players across the globe. In this presentation, we’ll discuss several use cases where we sought to understand and improve the player experience, the challenges we faced to solve those use cases, and the big data infrastructure that supports our capability to provide continued insight.
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
StampedeCon
69 slides10.7K views
20141021 AWS Cloud Taekwon - Big Data on AWS by Amazon Web Services Korea, has 76 slides with 3956 views.This document discusses big data solutions on AWS. It covers data generation, collection and storage using services like S3, DynamoDB and Kinesis. It also discusses analytics and computation using EMR, Redshift and real-time analytics with Kinesis. Finally it discusses collaboration and sharing insights using visualization tools and BI tools. Real-world examples of using these AWS services for big data are provided.
20141021 AWS Cloud Taekwon - Big Data on AWS20141021 AWS Cloud Taekwon - Big Data on AWS
20141021 AWS Cloud Taekwon - Big Data on AWS
Amazon Web Services Korea
76 slides4K views
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid by DataWorks Summit, has 63 slides with 6890 views.This document discusses using an open source Lambda architecture with Kafka, Hadoop, Samza, and Druid to handle event data streams. It describes the problem of interactively exploring large volumes of time series data. It outlines how Druid was developed as a fast query layer for Hadoop to enable low-latency queries over aggregated data. The architecture ingests raw data streams in real-time via Kafka and Samza, aggregates the data in Druid, and enables reprocessing via Hadoop for reliability.
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
DataWorks Summit
63 slides6.9K views
Embrace NoSQL and Eventual Consistency with Ripple by Sean Cribbs, has 119 slides with 2210 views.So, there's this "NoSQL" thing you may have heard of, and this related thing called "eventual consistency". Supposedly, they help you scale, but no one has ever explained why! Well, wonder no more! This talk will demystify NoSQL, eventual consistency, how they might help you scale, and -- most importantly -- why you should care. We'll look closely at how Riak, a linearly-scalable, distributed and fault-tolerant NoSQL datastore, implements eventual consistency, and how you can harness it from Ruby via the slick Ripple client/ORM. When the talk is finished, you'll have the tools both to understand eventual consistency and to handle it like a pro inside your next Ruby application.
Embrace NoSQL and Eventual Consistency with RippleEmbrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with Ripple
Sean Cribbs
119 slides2.2K views
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson by CODE BLUE, has 60 slides with 1077 views.分散型のスキャナーの構築は挑戦のし甲斐があり、実在のブラウザを使って作る場合はなおさらである。 今回紹介するスキャナーでは、ChromiumにJSのライブラリやそのバージョンを得るためのJavaScriptを注入することで、スキャンしたサイトのすべてのHTMLとJavaScript、独自アーキテクチャを必要とするセキュリティヘッダを保存できる。 このスキャナーでトップの100万サイトに対してスキャンを行い、現在のWeb上の状況を調べることが可能となるスケーラブルなシステムを設計する際に克服した課題についてカバーした。 本講演では、データ分析で得られた興味深い点にも触れるつもりである。 --- アイザック・ドーソンIsaac Dawson アイザック・ドーソンは、Veracode社の主要なセキュリティ研究者の一人で、彼の率いる同社の研究開発チームは、Veracode社の動的解析の提供に努めている。 Veracode社の前は@stake社とSymantec社でコンサルタントをしていた。 2004年にアプリケーションセキュリティのコンサルティングチーム発足させるため、日本へやってきた。 Veracode社での勤務が始まった後、彼の中で日本があまりにも快適であることがはっきりしたので、それ以降、滞在し続けることを決めたのだった。 Go言語の熱心なプログラマーであり、分散システムに関心があり、特にWebのスキャニングに強い関心をもっている。
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
CODE BLUE
60 slides1.1K views

Rainbird: Realtime Analytics at Twitter (Strata 2011)

  • 1. Rainbird: Real-time Analytics @Twitter Kevin Weil -- @kevinweil Product Lead for Revenue, Twitter TM
  • 2. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 3. My Background ‣ Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, Cassandra, data viz, social graph analysis, soon to be PBs of data
  • 4. My Background ‣ Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, Cassandra, data viz, social graph analysis, soon to be PBs of data Now revenue products!
  • 5. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 6. Why Real-time Analytics ‣ Twitter is real-time
  • 7. Why Real-time Analytics ‣ Twitter is real-time ‣ ... even in space
  • 8. And My Personal Favorite
  • 9. And My Personal Favorite
  • 10. Real-time Reporting ‣ Discussion around ad-based revenue model ‣ Help shape the conversation in real-time with Promoted Tweets
  • 11. Real-time Reporting ‣ Discussion around ad-based revenue model ‣ Help shape the conversation in real-time with Promoted Tweets ‣ Realtime reporting ties it all together
  • 12. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 13. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS
  • 14. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS ‣ High read volume ‣ Needs to scale to 10,000s of RPS
  • 15. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS ‣ High read volume ‣ Needs to scale to 10,000s of RPS ‣ Horizontally scalable (reads, storage, etc) ‣ Needs to scale to 100+ TB
  • 16. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS ‣ High read volume ‣ Needs to scale to 10,000s of RPS ‣ Horizontally scalable (reads, storage, etc) ‣ Needs to scale to 100+ TB ‣ Low latency ‣ Most reads <100 ms (esp. recent data)
  • 17. Cassandra ‣ Pro: In-house expertise ‣ Pro: Open source Apache project ‣ Pro: Writes are extremely fast ‣ Pro: Horizontally scalable, low latency ‣ Pro: Other startup adoption (Digg, SimpleGeo)
  • 18. Cassandra ‣ Pro: In-house expertise ‣ Pro: Open source Apache project ‣ Pro: Writes are extremely fast ‣ Pro: Horizontally scalable, low latency ‣ Pro: Other startup adoption (Digg, SimpleGeo) ‣ Con: It was really young (0.3a)
  • 19. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra
  • 20. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin
  • 21. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x
  • 22. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x ‣ A dude from Sweden began helping: @skr
  • 23. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x ‣ A dude from Sweden began helping: @skr ‣ Now all at Twitter :)
  • 24. Rainbird ‣ It counts things. Really quickly. ‣ Layers on top of the distributed counters patch, CASSANDRA-1072
  • 25. Rainbird ‣ It counts things. Really quickly. ‣ Layers on top of the distributed counters patch, CASSANDRA-1072 ‣ Relies on Zookeeper, Cassandra, Scribe, Thrift ‣ Written in Scala
  • 26. Rainbird Design ‣ Aggregators buffer for 1m ‣ Intelligent flush to Cassandra ‣ Query servers read once written ‣ 1m is configurable
  • 27. Rainbird Data Structures struct Event { 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 28. Rainbird Data Structures struct Event { Unix timestamp of event 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 29. Rainbird Data Structures struct Event { Stat category name 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 30. Rainbird Data Structures struct Event { Stat keys (hierarchical) 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 31. Rainbird Data Structures struct Event { Actual count (diff) 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 32. Rainbird Data Structures struct Event { More later 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 33. Hierarchical Aggregation ‣ Say we’re counting Promoted Tweet impressions ‣ category = pti ‣ keys = [advertiser_id, campaign_id, tweet_id] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [advertiser_id, campaign_id, tweet_id] ‣ [advertiser_id, campaign_id] ‣ [advertiser_id] ‣ Means fast queries over each level of hierarchy ‣ Configurable in rainbird.conf, or dynamically via ZK
  • 34. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] ‣ [com, amazon] ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 35. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] full URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 36. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] any music.amazon.com URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 37. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] any amazon.com URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 38. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] any .com URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 39. Temporal Aggregation ‣ Rainbird also does (configurable) temporal aggregation ‣ Each count is kept minutely, but also denormalized hourly, daily, and all time ‣ Gives us quick counts at varying granularities with no large scans at read time ‣ Trading storage for latency
  • 40. Multiple Formulas ‣ So far we have talked about sums ‣ Could also store counts (1 for each event) ‣ ... which gives us a mean ‣ And sums of squares (count * count for each event) ‣ ... which gives us a standard deviation ‣ And min/max as well ‣ Configure this per-category in rainbird.conf
  • 41. Rainbird ‣ Write 100,000s of events per second, each with hierarchical structure ‣ Query with minutely granularity over any level of the hierarchy, get back a time series ‣ Or query all time values ‣ Or query all time means, standard deviations ‣ Latency < 100ms
  • 42. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 43. Production Uses ‣ It turns out we need to count things all the time ‣ As soon as we had this service, we started finding all sorts of use cases for it ‣ Promoted Products ‣ Tweeted URLs, by domain/subdomain ‣ Per-user Tweet interactions (fav, RT, follow) ‣ Arbitrary terms in Tweets ‣ Clicks on t.co URLs
  • 44. Use Cases ‣ Promoted Tweet Analytics
  • 45. Each different metric is part Production Uses of the key hierarchy ‣ Promoted Tweet Analytics
  • 46. Uses the temporal aggregation to quickly show Production Uses different levels of granularity ‣ Promoted Tweet Analytics
  • 47. Data can be historical, or Production Uses from 60 seconds ago ‣ Promoted Tweet Analytics
  • 48. Production Uses ‣ Internal Monitoring and Alerting ‣ We require operational reporting on all internal services ‣ Needs to be real-time, but also want longer-term aggregates ‣ Hierarchical, too: [stat, datacenter, service, machine]
  • 49. Production Uses ‣ Tweet Button Counts ‣ Tweet Button counts are requested many many times each day from across the web ‣ Uses the all time field
  • 50. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 51. Open Source? ‣ Yes!
  • 52. Open Source? ‣ Yes! ... but not yet
  • 53. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra
  • 54. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8)
  • 55. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8) ‣ ... also relies on some internal frameworks we need to open source
  • 56. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8) ‣ ... also relies on some internal frameworks we need to open source ‣ It will happen
  • 57. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8) ‣ ... also relies on some internal frameworks we need to open source ‣ It will happen ‣ See http://github.com/twitter for proof of how much Twitter open source
  • 58. Team ‣ John Corwin (@johnxorz) ‣ Adam Samet (@damnitsamet) ‣ Johan Oskarsson (@skr) ‣ Kelvin Kakugawa (@kelvin) ‣ Chris Goffinet (@lenn0x) ‣ Steve Jiang (@sjiang) ‣ Kevin Weil (@kevinweil)
  • 59. If You Only Remember One Slide... ‣ Rainbird is a distributed, high-volume counting service built on top of Cassandra ‣ Write 100,000s events per second, query it with hierarchy and multiple time granularities, returns results in <100 ms ‣ Used by Twitter for multiple products internally, including our Promoted Products, operational monitoring and Tweet Button ‣ Will be open sourced so the community can use and improve it!
  • 60. Questions? Follow me: @kevinweil TM

Editor's Notes