Install Pivotal Greenplum MPP Data Warehouse on Amazon AWS Cloud
In this AWS guide for Data Warehouse developers and administrators I want to show the steps to install Pivotal Greenplum Data Warehouse platform software using Amazon Marketplace on AWS cloud. Pivotal Greenplum is a Massively Parallel Processing (MPP) Data Warehouse platform for analytics which scales up to petabytes of data volumes. For big data analytics Pivotal Greenplum is one of the major data warehouse platforms on the market.
If as an enterprise company, you have a hybrid cloud platform with on-premise data sources connected with resources on AWS Cloud, you may want to extend your data warehouse platform on cloud with the possibilities provided with powerful data lake features and tools of AWS. Amazon Redshift is the Massively Parallel Processing or MPP data warehouse platform served by AWS to its customers. Besides its Spectrum feature which enables defining external tables to query the data stored in AWS S3 object store using SQL, Redshift has a few drawbacks when compared with Pivotal Greenplum Data Warehouse solution. For example, Greenplum has a standby master node for high availability in case the active master node fails the stand-by node continues serving to data clients and manages data nodes.
In this Pivotal Greenplum guide I will not write on features of Greenplum, I just want to show the steps to setup Pivotal Greenplum on AWS using Amazon Marketplace service.
First login to your AWS account.
Switch to AWS Marketplace which is available in N.Virginia region.
In "Discover Products" page within AWS Marketplace service, search for "Greenplum"
Greenplum software has a few different offerings for customers. If you choose below Pivotal Greenplum (BYOL), you can test the data warehouse software for a limited period of time only by paying for AWS resources.
For more on Greenplum offer on AWS Marketplace, please refer to Product Overview page
Subscribe to Pivotal Greenplum by pressing "Continue to Subscribe" button.
You will be directed to a page and requested to accept terms before continue Pivotal Greenplum setup.
You will wait for a few minutes before continue installing Pivotal Greenplum Data Warehouse platform within your AWS account. You will be sent an email giving information about your subscription during your subscription request is being processed on AWS Marketplace service.
When your subscription is processed successfully you can continue to configuration of your Pivotal Greenplum Data Warehouse platform setup.
The first step in configuring setup for Greenplum is choosing the target VPC for your new data warehouse.
For example, if you want to install Pivotal Greenplum in a new VPC in your AWS account, choose the fulfillment option as "New VPC Cluster".
Otherwise, if you want to setup Greenplum data warehouse in an existing VPC like in my case, you can choose the "Existing VPC Cluster" option.
Since I had a VPC which has a VPN configured to on-premise resources, I preferred to launch the Pivotal Greenplum data warehouse in that specific VPC. Because of this reason, I chosed the existing VPC option. Since for setup, a CloudFormation template will be used to create the AWS resources, in case you wish to remove your Pivotal Greenplum data warehouse platform after you test it, for example, you can delete the CloudFormation stack and remove resources created for the Greenplum.
Then select the Pivotal Greenplum software version to install.
Another configuration option is to choose the target AWS region where you will create your Greenplum Cluster.
Click on Continue to Launch button for next steps.
Review the configuration details before you continue.
And choose "Launch CloudFormation" as the action to launch the Pivotal Greenplum configuration through AWS CloudFormation console.
I believe using a CloudFormation template makes installation quite easy so I prefer to continue with this option.
Be default, the Pivotal Greenplum subscription provides a CloudFormation template for your configuration within a shared Amazon S3 bucket for you.
Keep the default option "Amazon S3 template URL" and continue to next step in setup wizard.
In below screen, as an AWS developer you can define a CloudFormation stack name. If you type a descriptive name later this can help you to identify the stack for setup output parameters or for deleting the stack resources.
More important parameter set here is the network parameters.
If you want your Pivotal Greenplum Data Warehouse platform reachable from internet, choose InternetAccess option as True, otherwise set its value as False.
For a more secure configuration, AWS cloud architects or data architects will prefer preventing internet access to their data warehouse platform.
Choose the target VPC among the listed ones in which you decided to create the Greenplum cluster.
If you decide to create the Greenplum cluster in a private subnet to keep it in a secure way, choose the target private subnet among the listed ones.
You should select the same private subnet in the PublicSubnet entry instead of leaving it blank for a successfull setup.
Other CloudFormation template parameters are options like KeyPair for the cluster nodes, tenancy, placement group and disk encryption selection. These options are listed under AWS Configuration options.
Another option enables to choose the Greenplum database version for your MPP cluster. The current most recent database version is GP6
Other options enables configuring the master instance details including instance type and disk size. Minimum disk size is 500 GB.
Segment instances are configured on the same configuration page. If you want a single node cluster, you can keep segment instance count parameter value as 0.
Optional Installs section enables the installer to choose the Greenplum components to install with the cluster at the same time.
For example, you can choose to install Pivotal Greenplum Command Center, MADlib, Python or R for Data Science, PL/R or PostGIS components by selecting Install in the combobox.
You should definetely set a number of tags to make easier of management of your AWS resources. So please provide a set of tags according to your tag strategy and naming conventions.
For "Permissions", "Rollback Triggers", etc, you can keep them as default.
The last but most important option you choud mark to start a successfull installation is the checkbox "I acknowledge that AWS CloudFormation might create IAM resources."
Please check or mark the checkbox and then press Create button.
You will be redirected to your AWS account CloudFormation service. The CloudFormation stack will be created and start creating the resources for your Pivotal Greenplum cluster configuration. This step will take some time, you can take a coffee during the installation process.
After the stack completes creation of all required resources in your VPC, in the output tab you can find details of your MPP data warehouse cluster
For example, the AdminUserName used for ssh and database is gpadmin
You can find the phpPgAdmin URL which is using the port 28090 by default.
The database listening port is 5432 by default as most PostgreSQL based databases.
MonitorUserName used for Command Center is gpmon
You can also find the Command Center URL using the port 28080.
And the most important is the password for Admin and Monitor users listed above.