A straightfoward method to automate data ingestion from S3 buckets (data lake) to a Redshift (data warehouse) cluster; by using Glue.
Create a Redshift cluster inside a VPC, with at least one security group assigned to it.
Setup the Security Group within the VPC.
Go to VPC > Security > Security Groups look for the one that is attached to the Redshift cluster and click on Inbound rules > Edit inbound rules, you need to add a new rule type: ALL TCP with source: custom and look for the name of the self-security group that you are editing.
Create a new Role or edit an existing one that you already have to grant your glue Jobs with permissions to run and read from S3. Go to IAM > Roles > Create Role, select Glue and look for the next two policies:
- AWSGlueServiceRoles
- AmazonRedshiftFullAccess
Then you will need to create and add two more policies:
1. IAMPassRole
👉 2. S3ReadAccess
{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::*/*" }, { "Sid": "VisualEditor1", "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:GetBucketLocation" ], "Resource": "arn:aws:s3:::*" } ] }
S3 Wheel File
Download the following wheel file that will allow you to manage the connection and writing to Redshift link. If you specified a bucket in the S3 policy, make sure to upload the file to this same bucket.
Glue Connections
Create two connections, go to Glue > Connections > Add Connection:
- 1. This connection will allow your Glue Job to reach your services within your VPC.
Connection name: glue-connection-tovpc
Connection type: Network
VPC: Same as you Redshift Cluster
Subnet: Subnet within your VPC
Security groups: Choose the one you previously created
- 2. This connection is to securely save and retrieve your login data to Redshift.
Connection name: glue-connection-toredshift
Connection type: JDBC
JDBC URL: Go to Redshift > MyRedshiftCluster > General Information > JBDC URL
Username: Redshift user with write permissions to tables.
Password: Password of the user.
VPC: Same as you Redshift Cluster
Subnet: Subnet within your VPC
Security groups: Choose the one you previously created
Glue Job Configuration
Create a new Glue Job:
Name: automate-s3-to-redshift
IAM role: Role that you previously created.
Type: Spark / Python Shell
Glue version: Spark 2.4, Python 3 (Glue Version 2.0) / Python 3 (Glue Version 1.0)
This job runs: A new script to be authored by you
Connections: Add the two that were created in previous steps.
Go to Actions > Edit Job > Security configuration > Job parameters:
Key: — additional-python-modules
Value: s3://your_bucket/path_1/redshift_connector/redshift_connector-1.2-py3-none-any.whl (URI Object)
Glue Job Script
The wheel file will do the following:
- Get your credentials from the 2nd connection you created.
- Use these values to establish a connection with Redshift.
- Copy your S3 files to Redshift.
The pipeline from S3 to Redshift is now automated.