This page provides you with instructions on how to extract data from a Google Cloud SQL (GCS) database and load it into Amazon Redshift on an ongoing basis. (If this manual process is a bit more involved than you’d prefer, check out Stitch, which can do all the heavy lifting for you in just a few clicks.)
What is Google Cloud SQL?
Google Cloud SQL is an object-relational database management system (ORDBMS). The service itself is part of the larger Google Cloud Platform. Google Cloud Platform handily supports PostgreSQL and MySQL databases. The primary function of this feature is to store data securely, supporting best practices, and to allow for retrieval at the request of other software applications. It can handle workloads ranging from small single-machine applications to large Internet-facing applications with many concurrent users.
Pulling Data Out of Google Cloud SQL
There are several methods for extracting data from Google Cloud SQL, detailed below:
Most people retrieve data from databases by simply writing queries. In PostgreSQL and MySQL SELECT queries allow you to pull exactly the data you want by specifying filters, ordering, and limiting results. If you have a specific subset of data in mind or are looking to continuously monitor a subset of a specific table, SELECT queries may be a good fit.
If you’re just looking to export data in bulk, however, there is a much easier and faster way to do it.
For GCS PostgreSQL, you can use the command-line tool called pg_dump that allows you to export data from a Postgres database as a flat file (i.e. CSV) or a script that can be run to restore the database on any Postgres server.
Preparing Google Cloud SQL Data for Redshift
Here’s the tricky part: for every table in GCS, you need a matching table in Redshift to receive the data. Thankfully, Redshift is the perfect match for this since its syntax is actually based on the syntax of Postgres.
If you’ve used pg_dump to export a restore script, most if not all of the code it generates can be simply executed against a Redshift instance to restore the data (or just the structure if that’s what you exported– this may be preferable if you are going to load data using the COPY command as detailed below).
If you haven’t used pg_dump, you can look to the information_schema database, which contains all of the metadata information you’ll need to recreate your tables in another environment.
For Google Cloud SQL MySQL you can use a handy command-line tool called mysqldump that allows you to export entire tables and databases in a format you specify (i.e. delimited text, CSV, or SQL queries that would restore the database if run).
Inserting Google Cloud SQL Data into Redshift
Once you have a flat file, it may seem like the easiest way to add your exported data (especially if there isn’t much of it), is to build INSERT statements to add data to your Redshift table row-by-row. If you have any experience with SQL, this will be your gut reaction. But beware! Redshift isn’t optimized for inserting data one row at a time, and if you have any kind of high-volume data being inserted, you would be much better off loading the data into Amazon S3 and then using the COPY command to load it into Redshift.
Keeping Data Up-To-Date
So, now what? You’ve built a script that pulls data from a GCS database and loads it into Redshift, but what happens tomorrow when you have new and updated records in your GCS database?
Depending on how you’ve built your script, you may be forced to load your entire database into Redshift again. This might be slow and painful depending on your setup, and it can introduce costly delays if latency is of value to you.
The key is to build your script in such a way that it can also identify incremental updates to your data. If your GCS database tables have fields like modified_at or auto-incrementing primary keys, you can build a script that can quickly identify records that are new or changed since your last update (or since the newest record you’ve copied into Redshift). You can set your script up as a cron job or continuous loop to keep pulling down new data as it appears.
Easier and Faster Alternatives
If you made it this far without your brain getting fried, chances are building and maintaining a script like this isn’t a very high-leverage use of your time.
Thankfully, products like Stitch were built to solve this exact problem automatically. With just a few clicks, Stitch starts extracting your GCS data, structuring it in a way that is optimized for analysis, and inserting that data into your Amazon Redshift data warehouse.