Data Warehouse
How to design a cloud-based data warehousing solution using Amazon Redshift, the petabyte-scale data warehouse in AWS.
AWS Marketplace for Amazon RedShift
RedShift Architecture
- Redshift SQL Commands https://docs.aws.amazon.com/redshift/latest/dg/c_SQL_commands.html
- Clusters and Nodes in Amazon Redshift https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#rs-about-clusters-and-nodes
RedShift Schema and Table Design
- Choosing Sort Keys https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html Sorting enables efficient handling of range-restricted predicates.
- Distribution Styles https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html
- Tutorial: Tuning Table Design https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables.html in this tutorial, you will learn how to optimize the design of your tables. You will start by creating tables based on the Star Schema Benchmark (SSB) schema without sort keys, distribution styles, and compression encodings.
- 10 Best Practices for Amazon Redshift Spectrum https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/
- Amazon Redshift Best Practices for Designing Tables https://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html These design choices also have a significant effect on storage requirements, which in turn affects query performance by reducing the number of I/O operations and minimizing the memory required to process queries.
- TICKIT Sample Database Schema https://docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html
Data Loading – COPY commands
- Data Format Parameters for the COPY command https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html
- Zero Administration Amazon Redshift Database Loader https://aws.amazon.com/blogs/big-data/a-zero-administration-amazon-redshift-database-loader/ With this AWS Lambda function, you simply push files into a variety of locations on Amazon S3 and have them automatically loaded into your Amazon Redshift clusters.
RedShift Tuning
- Amazon Redshift Deep Dive: Tuning and Best Practices https://www.youtube.com/watch?v=fmy3jCxUliM&index=25&list=PLhr1KZpdzukdsblOEVXrCYtvUsDakzYJI AWS re:Invent 2015 | (BDT401)
- Amazon Redshift Engineering’s Advanced Table Design Playbook: https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-preamble-prerequisites-and-prioritization/ AWS Data Blog series from 2016 covers Distribution Styles and Distribution Keys, Compound and Interleaved Sort Keys, Compression Encodings and Table Data Durability.
- Querying the Catalog Tables – Amazon Redshift https://docs.aws.amazon.com/redshift/latest/dg/c_join_PG.html
- Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift https://aws.amazon.com/blogs/big-data/top-8-best-practices-for-high-performance-etl-processing-using-amazon-redshift/
- Amazon Redshift Query Plan https://docs.aws.amazon.com/redshift/latest/dg/c-the-query-plan.html
- Concurrency Scaling for Amazon Redshift https://aws.amazon.com/blogs/aws/new-concurrency-scaling-for-amazon-redshift-peak-performance-at-all-times/ Peak Performance at All Times
- Resizing Clusters in Amazon Redshift ttps://docs.aws.amazon.com/redshift/latest/mgmt/rs-resize-tutorial.html
- Amazon Redshift Advisor Recommendations https://docs.aws.amazon.com/redshift/latest/dg/advisor-recommendations.html
Monitoring and Maintaining Redshift Clusters
- This GitHub provides a collection of scripts and utilities that will assist you in getting the best performance possible from Amazon Redshift. https://github.com/awslabs/amazon-redshift-utils.
- Amazon RedShift Parameter Groups can be used to configure database settings such as query timeout and datestyle. https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-parameter-groups.html.
- Analyzing Tables https://docs.aws.amazon.com/redshift/latest/dg/t_Analyzing_tables.html – Automatic analyze is enabled by default.
- VACUUM https://docs.aws.amazon.com/redshift/latest/dg/r_VACUUM_command.html Resorts rows and reclaims space in either a specified table or all tables in the current database. Amazon Redshift automatically sorts data and runs VACUUM DELETE in the background.
- Performing a Deep Copy https://docs.aws.amazon.com/redshift/latest/dg/performing-a-deep-copy.html
- Run Mixed Workloads with Amazon Redshift Workload Management https://aws.amazon.com/blogs/big-data/run-mixed-workloads-with-amazon-redshift-workload-management/
- Working with Concurrency Scaling https://docs.aws.amazon.com/redshift/latest/dg/concurrency-scaling.html– automatically adds additional cluster capacity when you need it to process an increase in concurrent read queries.
- Implementing Automatic WLM https://docs.aws.amazon.com/redshift/latest/dg/automatic-wlm.html
Data Lake Concepts
- What is a data lake? https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
- Migrating Your Traditional Data Warehouse to a Modern Data Lake https://www.youtube.com/watch?v=3Xg3yu5xnMY AWS re:Invent 2017: (ABD327)
- Building Data Lakes on AWS https://d1.awsstatic.com/whitepapers/Storage/data-lake-on-aws.pdf AWS white paper.
- AWS Lake Formation https://aws.amazon.com/lake-formation/ makes it easy to set up a secure data lake in days.