Collect Statistics About Greenplum Database Tables using Analyze and AnalyzeDB
Pivotal Greenplum collect statistics about a database and about the table and table columns in that Greenplum database using analyze command and analyzedb utility. The collected statistics data is stored in pg_statistic system table. All database platforms use collected statistics to build a performant SQL query execution plan by deciding on the least resource using execution plan and fastest result. For the correctness and efficiency of the execution plans created the statistics should be up to date and updated continuously
In this Greenplum tutorial, I want to show how analyze command and analyzedb utility can be executed by SQL developers.
Both Analyze and AnalyzeDB can be executed both on psql utility by connecting to a database on Greenplum Data Warehouse or on a SQL Editor like DBeaver, DbVizualizer or pgAdmin etc.
Let's execute the commands using psql.
First connect to the master node of the Greenplum cluster and connect to the target database as seen below where I connected to dev database.
[gpadmin@mdw ~]$ psql
psql (9.4.24)
Type "help" for help.
dev=# \c dev
You are now connected to database "dev" as user "gpadmin".
dev=# analyze verbose poc.vbkd;
INFO: analyzing "poc.vbkd"
INFO: Executing SQL: select pg_catalog.gp_acquire_sample_rows(25709, 2500, 'f');
ANALYZE
dev=#
As seen above SQL developers can execute "analyze verbose" command followed by the database table name or directly "analyze" with the table name.
Here is how analyze is executed to collect statistics about an other database table on Greenplum.
dev=# analyze poc.vbak;
ANALYZE
dev=#
The same commands can be executed on DBeaver SQL editor screen too.
analyze storm.vbap;
Instead of collecting and updating statistics one by one about all database tables, Greenplum SQL administrators can execute Analyze command for the whole database at once.
analyze;
For more on analyze command please review the Greenplum documentation.
By connecting to master node, all database table statistics collection can be triggered by executing the analyzedb utility
After you logon to the Greenplum cluster master node, call analyzedb utility with -d and pass the database name as I did for dev database below
analyzedb -d dev
For more on analyzedb please refer to official documentation.
To check the version information about the analyzedb utility, execute:
[gpadmin@mdw ~]$ analyzedb --version
analyzedb version 1.0
[gpadmin@mdw ~]$
Especially after high number of data inserts or updates and bulk imports, it is critical to collect statistics and keeping statistics up-to-date for performance and for better execution plan creation for database users queries. Database adminsitrators can manually execute Analyze command and AnalyzeDb utility or execute them automatically by a schedule for more healthy statistics about database table and table columns.