Redshift
This page guides you through the process of setting up the Redshift destination connector.
Prerequisites
The Airbyte Redshift destination allows you to sync data to Redshift.
This Redshift destination connector has two replication strategies:
- INSERT: Replicates data via SQL INSERT queries. This is built on top of the destination-jdbc code base and is configured to rely on JDBC 4.2 standard drivers provided by Amazon via Maven Central here as described in Redshift documentation here. Not recommended for production workloads as this does not scale well.
For INSERT strategy:
- Host
- Port
- Username
- Password
- Schema
- Database
- This database needs to exist within the cluster provided.
- JDBC URL Params (optional)
- COPY: Replicates data by first uploading data to an S3 bucket and issuing a COPY command. This is the recommended loading approach described by Redshift best practices. Requires an S3 bucket and credentials. Data is copied into S3 as multiple files with a manifest file.
Airbyte automatically picks an approach depending on the given configuration - if S3 configuration is present, Airbyte will use the COPY strategy and vice versa.
For COPY strategy:
- S3 Bucket Name
- See this to create an S3 bucket.
- S3 Bucket Region
- Place the S3 bucket and the Redshift cluster in the same region to save on networking costs.
- Access Key Id
- See this on how to generate an access key.
- We recommend creating an Airbyte-specific user. This user will require read and write permissions to objects in the staging bucket.
- Secret Access Key
- Corresponding key to the above key id.
- Part Size
- Affects the size limit of an individual Redshift table. Optional. Increase this if syncing tables larger than 100GB. Files are streamed to S3 in parts. This determines the size of each part, in MBs. As S3 has a limit of 10,000 parts per file, part size affects the table size. This is 10MB by default, resulting in a default table limit of 100GB. Note, a larger part size will result in larger memory requirements. A rule of thumb is to multiply the part size by 10 to get the memory requirement. Modify this with care.
- S3 Filename pattern
- The pattern allows you to set the file-name format for the S3 staging file(s), next placeholders
combinations are currently supported:
{date}
,{date:yyyy_MM}
,{timestamp}
,{timestamp:millis}
,{timestamp:micros}
,{part_number}
,{sync_id}
,{format_extension}
. Please, don't use empty space and not supportable placeholders, as they won't recognized.
- The pattern allows you to set the file-name format for the S3 staging file(s), next placeholders
combinations are currently supported:
Optional parameters:
- Bucket Path
- The directory within the S3 bucket to place the staging data. For example, if you set this to
yourFavoriteSubdirectory
, we will place the staging data insides3://yourBucket/yourFavoriteSubdirectory
. If not provided, defaults to the root directory.
- The directory within the S3 bucket to place the staging data. For example, if you set this to
- Purge Staging Data
- Whether to delete the staging files from S3 after completing the sync. Specifically, the
connector will create CSV files named
bucketPath/namespace/streamName/syncDate_epochMillis_randomUuid.csv
containing three columns (ab_id
,data
,emitted_at
). Normally these files are deleted after theCOPY
command completes; if you want to keep them for other purposes, setpurge_staging_data
tofalse
.
- Whether to delete the staging files from S3 after completing the sync. Specifically, the
connector will create CSV files named
NOTE: S3 staging does not use the SSH Tunnel option for copying data, if configured. SSH Tunnel supports the SQL connection only. S3 is secured through public HTTPS access only. Subsequent typing and deduping queries on final table are executed over using provided SSH Tunnel configuration.
Step 1: Set up Redshift
- Log in to AWS Management console. If you don't have a AWS account already, you’ll need to create one in order to use the API.
- Go to the AWS Redshift service.
- Create and activate AWS Redshift cluster if you don't have one ready.
- (Optional) Allow connections from Airbyte to your Redshift cluster (if they exist in separate VPCs).
- (Optional) Create a staging S3 bucket (for the COPY strategy).
Permissions in Redshift
Airbyte writes data into two schemas, whichever schema you want your data to land in, e.g. my_schema
and a "Raw Data" schema that Airbyte uses to improve ELT reliability. By default, this raw data schema
is airbyte_internal
but this can be overridden in the Redshift Destination's advanced settings.
Airbyte also needs to query Redshift's
SVV_TABLE_INFO table for
metadata about the tables airbyte manages.
To ensure the airbyte_user
has the correction permissions to:
- create schemas in your database
- grant usage to any existing schemas you want Airbyte to use
- grant select to the
svv_table_info
table
You can execute the following SQL statements
GRANT CREATE ON DATABASE database_name TO airbyte_user; -- add create schema permission
GRANT usage, create on schema my_schema TO airbyte_user; -- add create table permission
GRANT SELECT ON TABLE SVV_TABLE_INFO TO airbyte_user; -- add select permission for svv_table_info
Optional Use of SSH Bastion Host
This connector supports the use of a Bastion host as a gateway to a private Redshift cluster via SSH Tunneling. Setup of the host is beyond the scope of this document but several tutorials are available online to fascilitate this task. Enter the bastion host, port and credentials in the destination configuration.
Step 2: Set up the destination connector in Airbyte
For Airbyte Cloud:
- Log into your Airbyte Cloud account.
- In the left navigation bar, click Destinations. In the top-right corner, click + new destination.
- On the destination setup page, select Redshift from the Destination type dropdown and enter a name for this connector.
- Fill in all the required fields to use the INSERT or COPY strategy.
- Click
Set up destination
.
For Airbyte Open Source:
- Go to local Airbyte page.
- In the left navigation bar, click Destinations. In the top-right corner, click + new destination.
- On the destination setup page, select Redshift from the Destination type dropdown and enter a name for this connector.
- Fill in all the required fields to use the INSERT or COPY strategy.
- Click
Set up destination
.
Supported sync modes
The Redshift destination connector supports the following sync modes:
- Full Refresh
- Incremental - Append Sync
- Incremental - Append + Deduped
Performance considerations
Synchronization performance depends on the amount of data to be transferred. Cluster scaling issues can be resolved directly using the cluster settings in the AWS Redshift console.