With tens of thousands of concurrent online gamers sending data to Gearbox in realtime, their data infrastructure was beginning to become stressed. They called Full 360 to evaluate, redesign, build and tune a new solution.
Gearbox data warehouse running on Amazon Redshift was part of a complex stack:
- Archive data is flattened into JSON rows via Elastic Map Reduce.
- ETL job checks data files for new attributes, then adds attributes as a new column in target table.
- JSON data is parsed (by Redshift) into target table. JSON source, with a table for each type of event generated by the game.
- The original state has no optimizations for columnar storage.
The existing Redshift ingestion process and database design at Gearbox could not scale to the 20TB per day of raw source data. Load performance was inadequate. The disk footprint in Redshift was too large. Handling of new attributes challenged the original design. Historical queries were unacceptably slow. This raised general questions about cost, performance and staffing.
Full 360 worked with Gearbox to streamline the data model, change the source data format, create a new loader process using ECS, and tune the tables to perform properly. First we identified pros and cons of the initial state. Then we evaluated two new design proposals, Event Based CSV and KVP from CSV.
Our recommendation was KVP from CSV which employed a hybrid storage model giving the following characteristics.
Common data model between cheap and expensive storage
- KVP solution involves creating a DW tier in S3 that is in identical KVP format as Redshift.
- Redshift merely reflects a rolling window subset of the KVP data in S3.
- Hive can be used against S3 KVP data using SQL queries that are very similar to Redshift queries.
- Any question users may have can be answered through either Redshift or Hive.
Not all data is retained in expensive storage
- Redshift is expensive storage compared to S3.
- With event based tables, all data is stored regardless of whether or not it is required by end users. In a production environment this will not scale for longer timeframes.
- In KVP model, only fields that are actually required for analysis are persisted in non-KVP fact tables. While the actual KVP data takes more data storage per day in Redshift, the actual fact tables will be significantly smaller than the event tables in width and volume.
Optimization and performance
- Data loads will perform well because of narrow source data files and CSV format.
- All fact tables will be optimized for end user requirements.
Proposed Solution & Architecture
- Streamline the data model
- Create a new loader process
- KVP tables loaded from CSV source
- Tune Redshift tables
- New Sorting
- New distribution keys
- New field encoding
Architecture & Technology
- Docker Containers
- ECS Fleet
- HFMS / Amazon Elastic Map Reduce
Results, Outcomes & Success Metrics
The creation of a new ingestion process enabled Gearbox to handle the volume of transactions as needed. Full 360 was able to analyze the problem and understand the basic constrains of their design and implementation and are able to expand their capabilities on a going forward basis if needed. Significant price reduction in Redshift spending, increased performance, and satisfied customer.
Gearbox is an American video game development company based in Frisco, Texas.