Connect to Exasol Database from Greenplum using PXF Extension Framework
PXF Extension Framework (PXF) enables Pivotal Greenplum Data Warehouse professionals to connect external data sources on demand to query data stored on those remote databases using federated queries without replicating data physically. Exasol or ExaSolution Data Warehouse is the TPC-H benchmark leader among other databases on the market. This Greenplum tutorial shows SQL developers and Pivotal Greenplum administrators how to configure Greenplum database server for connecting Exasol databases via PXF Extension Framework. Both Greenplum Data Warehouse and Exasol Data Warehouse platforms are installed on an AWS account.
To summarize the process, Greenplum cluster administrators can follow below steps for enabling SQL developers to access Exasol database tables using PXF framework and create external tables.
First, download Exasol JDBC driver which will enable connection between Greenplum database to Exasol data warehouse.
As second step, connection properties configuration file will be created. The jdbc-site.xml file contains target Exasol database server host information, database user and password, etc.
Last step on Greenplum server configuration is restarting the PXF service.
After this point, SQL programmers can create external tables on Greenplum databases.
It is possible to use these external tables for data migration between Exasol database and Greenplum database, too.
As seen in below image PXF Platform Extension Framework enables Pivotal Greenplum to connect various data sources easily. For example please read referred tutorial to connect SQL Server database from Pivotal Greenplum using PXF.
Let's start our Greenplum tutorial.
Download Exasol JDBC Driver
As mentioned before, for connecting Greenplum database to Exasol cluster, JDBC drivers are used by Greenplum PXF Extension Framework. SQL developers can download Exasol JDBC driver from official Exasol Download portal. For not to experience any problems, it is better to know the version of the target Exasol Data Warehouse cluster.
Exasol Data Warehouse version can be controlled on EXASolution web-based tool.
On Software Configuration page EXASolution database version can be easily detected.
In this Greenplum tutorial, I will show how data professionals can connect Exasol 6.2.1 from Greenplum database.
Various downloads including JDBC drivers for Exasol 6.2.1 can be displayed at Exasol Version 6.2.1 Downloads
Since Pivotal Greenplum PXF is running on a Linux server, the required download file is EXASOL_JDBC-6.2.1.tar.gz
Administrators can choose the correct download file from the downloads section easily.
After downloading EXASOL_JDBC-6.2.1.tar.gz on a Windows computer, I extract the compressed file to a folder.
Then among extracted file contents, upload exajdbc.jar Exasol JDBC driver file to an Amazon S3 bucket which will be a bridge for JDBC file copy.
The AWS S3 bucket folder and full path which I used for this Greenplum tutorial is "database-jdbc-drivers/exajdbc.jar"
Let's now continue with Greenplum cluster PXF configuration for Exasol connection.
Greenplum PXF Configuration for Exasol Database
For configuring PXF virtualization framework to connect Greenplum and Exasol databases, there are three steps:
First upload JDBC driver to Greenplum server,
Then configure database JDBC connection file for connecting to Exasol,
And finally, restarting the PXF service
But before all these step, let's connect to Greenplum cluster, let's assume the server IP address is 10.159.82.30
Using PuTTY over an SSH connection, you will need the private key file (.ppk file) for the gpadmin Pivotal Greenplum cluster administrator user.
Copy Exasol JDBC Driver to Pivotal Greenplum Server
After we have connected to the Greenplum cluster, it is time to copy the JDBC driver from its location on S3 bucket to the appropriate file folder.
Within "lib" folder, Greenplum server administrators can store JDBC drivers' .jar files which will enable PXF framework to connect external data sources including Exasol database.
List command "ls" displays the JDBC driver files in "lib" folder for Greenplum PFX framework.
In my case, I had ngdbc-2.4.70.jar for SAP HANA database driver, mssql-jdbc-7.4.1.jre8.jar for SQL Server JDBC driver and RedshiftJDBC42-1.2.37.1061.jar for connecting to Amazon Redshift via its native JDBC driver.
Now copy the exajdbc.jar Exasol JDBC driver file into Greenplum server "lib" server using below AWS CLI command.
Now the JDBC driver required for Greenplum PXF data virtualization framework is copied into the correct folder on Greenplum server.
Connection Configuration
Greenplum administrators can now define JDBC connections to Exasol databases using the uploaded driver in previous step.
To define connection properties, admins have to create connection configuration files on the server.
While we are still on the PuTTY screen connected to the Greenplum server, let's go up 1 level in directory structure and then into "servers" subfolder as seen below.
Each sub-folder within "servers" directory represents a separate database connection to a remote data source.
For example, in "sqlserver" subfolder I have provided JDBC connection details to connect to a specific SQL Server database instance.
Similarly, the "redshift" subfolder includes a file storing connection string properties for accessing the Amazon Redshift cluster in my current account.
Now create a new separate folder for our Exasol database connection. You can rename it "exasol" if you have only one target Exasol database connection.
Additionally, we are copying the template for JDBC connection properties file jdbc-site.xml into recently created "exasol" folder.
Now, Greenplum server administrator can edit the jdbc-site.xml connection properties file using "vi" editor by command "vi jdbc-site.xml"
Below screenshot displays the 4 important connection properties required for Greenplum data warehouse administrator to enable a successfull connection to Exasol database.
These 4 important JDBC connection properties are:
jdbc.driver: Class name of the JDBC driver
jdbc.url: The URL that the JDBC driver can use to connect to the database
jdbc.user: User name for connecting to the database
jdbc.password: Password for connecting to the database
The main class that will be used for Exasol JDBC driver is com.exasol.jdbc.EXADriver
The JDBC connection string is in the format of jdbc:exa:10.159.82.11:8563
If you have more than one nodes in your cluster, the correct connection URL for the JDBC will be in following format:
jdbc:exa:10.159.82.11,10.159.82.13:8563
For authentication on the Exasol database, a valid user and password is required.
As JDBC user, you can use "sys" user with its password in jdbc.password section.
If you have any other database user, you can define it in the connection properties file instead of sys as well.
Restart PXF Service
After JDBC file is copied into lib folder and under servers folder the configuration settings is created successfully for the target Exasol database, we can restart the Greenplum PXF service to take the changes into account.
Change your current directory to $GPHOME/pxf/bin and restart PXF service using "pxf cluster stop/start" commands as follows:
After this point, we are completed with Pivotal Greenplum cluster node. We can now connect to Greenplum database using a database management tool like DBeaver in order to create database objects that will enable SQL developers to query remote data sources' data.
Create External Table on Pivotal Greenplum Server
In our Greenplum database tutorial, now SQL programmers can create database objects which are used as bridges between Greenplum and Exasol database tables. The Pivotal Greenplum PXF Extension Framework enables database developers create external tables just like "virtual tables" in some other data platforms.
Here is how Greenplum SQL developer can create an external table using "exasol" JDBC connection which is defined before in previous steps.
The syntax of "CREATE EXTERNAL TABLE" can be simplified as:
public.pxf_exasol_cities is the schema name and external table name on current Greenplum database.
Location is pointint to the remote data source and target table or view.
Profile is JDBC since we are using JDBC connection for target Exasol database.
Server value is "exasol" which is the name of the folder we have created in previous steps on Greenplum server including the jdbc-site.xml connection configuration file for Exasol database.
After "//" you can see "kodyaz.cities". This is the schema name and table/view name that is being queried on target Exasol database via PXF connection.
So by creating above external table on Greenplum database, SQL developers can access to kodyaz.cities table on remote Exasol Data Warehouse easily.
External table now enables Pivotal Greenplum Data Warehouse SQL developers to query data stored on remote data source, our target Exasol database table "kodyaz.cities".
Import Exasol Database Table into Greenplum Database using PXF
To migrate data physically into Greenplum database from Exasol DWH, a database table can be created with same data structure. Data migration from remote Exasol data source can be done via "INSERT INTO ... SELECT FROM ..." SQL command shown in this section below.
Greenplum database administrator can create a table for the data to be imported from Exasol database table. We have already table definition and DDL script in previous step. Instead of creating an external table, this time Greenplum SQL developers can create a physical table within the database.
After the table is created on Greenplum database, following SQL INSERT statement can be used to insert data into physical Greenplum database table from external table linked to Exasol Data Warehouse.
Now the same data in remote Exasol database table is replicated into current Greenplum database table physically. This is an easy method of data replication or data migration from Exasol database into Greenplum data warehouse platform.
I hope this Greenplum tutorial is useful for SQL developers to understand Pivotal Greenplum Platform Extension Framework (PXF) and how to use it for connecting Exasol to Greenplum in order to create external tables and query data virtually, or for data migration between these two data warehouse platforms.