Improving Redshift Data Warehouse performance

Amazon Redshift conceptual overview - Amazon Redshift

(ETL engine in the above could be AWS Glue)

There are various ways to define performance and what that means.  A simple way to be consistent with management is so improve how fast dashboards load. This gives the ‘use case’ where you can mprove SELECT query performance with aggregation and some filtering by data time and some other frequently used fields.  Proper performance testing and baselining  is important when you are looking at cloud datawarehouse compute and speed on large datasets. 

What are the ways to improve it?

Long story short, you can either:  1) increase the resources or 2) try to limit how many blocks your data warehouse needs to scan/read to find the data you’re looking for. Of course, there are other ways and techniques, like caching the most frequent results, using more performance SQL functions, etc., but the two things mentioned above are the core principles that will make a significant impact.

Redshift

In the AWS Documentation there is a High-level overview of Redshift architecture.

  • Redshift is a columnar base database, which is the opposite of many relational databases where they store their data by rows.
  • In contrast to row-based databases, a columnar database store its data in columns, with the data for each column stored together.
  • Using columnar storage, each data block stores the values of a single column of multiple rows.
  • By storing all the records for one field close to each other, columnar databases can perform aggerations and analytical queries faster than row-based databases.

The block size depends on the database, but Amazon Redshift uses a block size of 1 MB. That’s a lot bigger compared to typical databases, which use block sizes up to 32 KB. According to Amazon, this is more efficient and further reduces the number of I/O requests needed to perform any database loading or other operations that are part of query execution. In addition to its data, each block also stores some meta information. For instance, the minimum and maximum values of a column in the block. This helps Redshift better understand which blocks it should read and which ones it can skip.

AWS documentation explaining Columnar storage is a good read.

What techniques do we use to improve performance?

Techniques which help to significantly improve performance.

  • Applying Sort and Distribution keys
  • Using Column Encoding
  • Setting WLMs and configuring concurrency scaling
  • Using RA3 nodes
  • Creating Data marts
  • Vacuuming and analyzing databases frequently

It may not be the same for all cases, but those are pretty general recommendations you should follow. Let’s review them one by one.

Apply Sort and Distribution keys

Choosing the appropriate SORT key can lead to faster query times as large sections of data can be skipped while processing the query. This is especially relevant for queries with range-restricted predicates because it can reduce disk I/O by improving zone map effectiveness

While SORT keys define how data is sorted, the DIST key defines the distribution of rows to compute nodes. Optimizing the distribution also can reduce query time by minimizing the impact of the query optimizer’s redistribution step.

Use Column Encoding

Using column encoding can reduce your storage footprint and improve query performance. If you don’t use compression, then data eventually consumes additional storage space, and when querying it, Redshift needs to read more blocks. Applying compression to large columns can not only reduce the storage but also significantly improve the performance of your cluster.

Amazon Redshift also has an automated Advisor that reviews storage metadata associated with large uncompressed columns that aren’t used as sort keys. It can give you useful suggestions on which compressions you should use for which columns, in addition to some other recommendations to improve performance and decrease operating costs. Just take everything there with a grain of salt.

WLM rules

Although RedShift has an Automatic WLM and Query Priorities queue mechanism, it’s possible to control it manually too. This may add some additional overhead but allows to have fine-grained workload groups and control concurrency scaling manually.

Concurrency scaling allows to dynamically expand cluster capacity in response to the workload. The concurrent scaling function in Amazon Redshift can extend the throughput of the cluster up to 10 times. The maximum number you can currently set is 10, but it’s a soft limit that you could ask to increase.

Materialized views and/or Data marts

Another common and effective way to increase query performance is to pre-aggregate your data in data marts or create materialized views.

You can use dbt for managing models, and it doesn’t support Redshift materialized views. So instead of materialized views, create data marts.

  • Store order data in your data warehouse that comes from a transaction database.
  • Orders may be created every minute, but your clients don’t necessarily do an analysis on every order, minute by minute.
  • If your use case allows, grouping those orders and aggregating them by an hour or a day will significantly reduce the overall data needed to store.
  • Consequently, this will improve the query performance because, again, Redshift will need to read fewer blocks.

Use RA3 Nodes for Independent Compute and Storage Scaling

In 2019, Amazon introduced RA3 nodes for Redshift. They use very large and performant SSDs for local caching, automatic fine-grained data eviction, and some smart data pre-fetching. The bottom line is that with RA3 nodes, you can separate compute and storage. What is nice about that separation is that you can scale your clusters according to your computing needs, not by how much data you need to store. For instance, you could even have an automated script that increases node count during peak hours and reduces node count over the night, or something similar.

Additionally, with RA3 nodes, you can leverage AQUA (Advanced Query Accelerator). Which can speed up certain queries with the LIKE or SIMILAR TO expression that AQUA supports. You can find more information in the Working with AQUA (Advanced Query Accelerator) documentation section.

Database maintenance and house-keeping

Vacuuming and analyzing tables might help improve performance too, or rather, not make it worse after some time. In Redshift, two commands will help you to do that — VACUUM and ANALYZE:

  • VACUUM — Re-sorts rows and reclaims space in either a specified table or all tables in the current database.
  • ANALYZE — Updates table statistics for use by the query planner.

Vacuum as often as you need to maintain consistent query performance. If you delay vacuuming, the vacuum will take longer because more data has to be reorganized. So it makes sense to run it more often. For instance, we chose to run it every day.

You don’t necessarily need to run the ANALYZE when you create a table first. But if you make any modifications, you should not forget to run the ANALYZE command to update table statistics. For simplicity, we chose to run ANALYZE every day together with VACUUM.

=END