Slides for talk presented at LA Redis meetup, April 16, 2016 at Scopely.
This is a draft of a session to be presented at Redis Conference 2016.
Description:
Scopely's portfolio of social and mid-core games generates billions of events each day, covering everything from in-game actions to advertising to game engine performance. As this portfolio grew in the past two years, Scopely moved all event analysis from third-party hosted solutions to a new event analytics pipeline on top of Redis and Kinesis, dramatically reducing operating costs and enabling new real-time analysis and more efficient warehousing. Our solution receives events over HTTP and SQS and provides real-time aggregation using a custom Redis-backed application, as well as prompt loads into HDFS for batch analyses.
Recently, we migrated our realtime layer from a pure Redis datastore to a hybrid datastore with recent data in Redis and older data in DynamoDB, retaining performance while further reducing costs. In this session we will describe our experience building, tuning and monitoring this pipeline, and the role of Redis in supporting handling of Kinesis worker failover, deployment, and idempotence, in addition to its more visible role in data aggregation. This session is intended be helpful for those building streaming data systems and looking for solutions for aggregation and idempotence.
3. What kind of data?
• App opened
• Killed a walker
• Bought something
• Heartbeat
• Memory usage report
• App error
• Declined a review
prompt
• Finished the tutorial
• Clicked on that button
• Lost a battle
• Found a treasure chest
• Received a push
message
• Finished a turn
• Sent an invite
• Scored a Yahtzee
• Spent 100 silver coins
• Anything else any
game designer or
developer wants to
learn about
9. Kinesis
• Distributed, sharded streams. Akin to Kafka.
• Get an iterator over the stream— and checkpoint with current stream
pointer occasionally.
• Workers coordinate shard leases and checkpoints in DynamoDB
(via KCL)
Shard 0
Shard 1
Shard 2
13. Shard 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Checkpointing
Checkpoint for Shard 0: 10 Given: Worker checkpoints every 5
K
Worker A 🔥
14. Shard 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Checkpointing
Checkpoint for Shard 0: 10 Given: Worker checkpoints every 5
K
Worker A 🔥
K
Worker B
15. Shard 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Checkpointing
Checkpoint for Shard 0: 10 Given: Worker checkpoints every 5
K
Worker A 🔥
K
Worker B
16. Auxiliary Idempotence
• Idempotence keys at each stage
• Redis sets of idempotence keys by time window
• Gives resilience against various types of failures
21. 1. Deserialize
2. Reverse deduplication
3. Apply changes to application
properties
4. Get current device and application
properties
5. Generate Event ID
6. Emit.
Collection
Kinesis
Enrichment
22. 1. Deserialize
2. Reverse deduplication
3. Apply changes to application
properties
4. Get current device and application
properties
5. Generate Event ID
6. Emit.
Collection
Kinesis
Enrichment
Idempotence Key: Device
Token + API Key + Event
Batch Sequence + Event
Batch Session
23. Now we have a stream of well-
described, denormalized event
facts.
24. K
Enriched Event Data
Preparing for Warehousing (SDW Forwarder)
dice app
open
bees level
complete
slots
payment
0:10
0:01
0:05
emitted by time
emitted by size
• Game
• Event Name
• Superset of
Properties in batch
• Data
Slice
… ditto …
Slice
SQS
25. K
Enriched Event Data
Preparing for Warehousing (SDW Forwarder)
dice app
open
bees level
complete
slots
payment
0:10
0:01
0:05
emitted by time
emitted by size
• Game
• Event Name
• Superset of
Properties in batch
• Data
Slice
… ditto …
Slice
SQS
Idempotence Key: Event ID
26. K
But everything can die!
dice app
open
bees level
complete
slots
payment
Shudder
ASG
SNS
SQS
27. K
But everything can die!
dice app
open
bees level
complete
slots
payment
Shudder
ASG
SNS
SQS
HTTP
“Prepare to Die!”
28. K
But everything can die!
dice app
open
bees level
complete
slots
payment
Shudder
ASG
SNS
SQS
HTTP
“Prepare to Die!”
emit!
emit!
emit!
29. Pipeline to HDFS
• Partitioned by event name and game, buffered in-memory and
written to S3
• Picked up every hour by Spark job
• Converts to Parquet, loaded to HDFS
37. HyperLogLog
• High-level algorithm (four bullet-point version stolen from my
colleague, Cristian)
• b bits of the hashed function is used as an index pointer
(redis uses b = 14, i.e. m = 16384 registers)
• The rest of the hash is inspected for the longest run of zeroes
we can encounter (N)
• The register pointed by the index is replaced with
max(currentValue, N + 1)
• An estimator function is used to calculate the approximated
cardinality
http://content.research.neustar.biz/blog/hll.html
38. K
Live Metrics (Ariel)
Enriched Event Data
name: game_end
time: 2015-07-15 10:00:00.000 UTC
_devices_per_turn: 1.0
event_id: 12345
device_token: AAAA
user_id: 100
name: game_end
time: 2015-07-15 10:01:00.000 UTC
_devices_per_turn: 14.1
event_id: 12346
device_token: BBBB
user_id: 100
name: Cheating Games
predicate: _devices_per_turn > 1.5
target: event_id
type: DISTINCT
id: 1
name: Cheating Players
predicate: _devices_per_turn > 1.5
target: user_id
type: DISTINCT
id: 2
name: game_end
time: 2015-07-15 10:01:00.000 UTC
_devices_per_turn: 14.1
event_id: 12347
device_token: BBBB
user_id: 100
PFADD /m/1/2015-07-15-10-00 12346
PFADD /m/1/2015-07-15-10-00 123467
PFADD /m/2/2015-07-15-10-00 BBBB
PFADD /m/2/2015-07-15-10-00 BBBB
PFCOUNT /m/1/2015-07-15-10-00
2
PFCOUNT /m/2/2015-07-15-10-00
1
Configured Metrics
We can count
different things
41. Alarm Clocks
• Push timestamp of current events to per-game
pub/sub channel
• Take 99th percentile age as delay
• Use that time for alarm calculations
• Overlay delays on dashboards
48. Hybrid Datastore: Plan
• Move older HLL sets to DynamoDB
• They’re just strings!
• Cache reports aggressively
• Fetch backing HLL data from DynamoDB as
needed on web layer, merge using on-instance
Redis
51. Redis Roles
• Idempotence
• Configuration Caching
• Aggregation
• Clock
• Scratchpad for merges
• Cache of reports
52. Other Considerations
• Multitenancy. We run parallel stacks and give
games an assigned affinity, to insulate from
pipeline delays
• Backfill. System is forward-looking only; can
replay Kinesis backups to backfill, or backfill from
warehouse