Below is the working material for my presentation about pg_reorg at the San Francisco Postgres Meetup on August 9, 2011 and PgWest 2011.
Here are the actual slides with fewer words, but more pictures.
What is pg_reorg?
pg_reorg can re-organize tables online on a production database without long-lasting locks, so that you can retrieve or update rows in reorganized tables.
The project’s home page:
http://pgfoundry.org/projects/reorg/
The latest version is 1.1.7, August 7, 2011
Project Admins:
Masahiko Sakamoto
Toru SHIMOGAKI
Takahiro Itagaki
Available packages:
pg_reorg-1.1.5-1.pg84.rhel5.i386.rpm
pg_reorg-1.1.5-1.pg84.rhel5.x86_64.rpm
pg_reorg-1.1.5-1.pg90.rhel5.i386.rpm
pg_reorg-1.1.5-1.pg90.rhel5.x86_64.rpm
pg_reorg-1.1.5.tar.gz
pg_reorg-1.1.7.tar.gz
You can choose one of the following methods to reorganize.
- Order by an index’ filed(s) (online CLUSTER), or any set of specified columns
- Packing tuples without ordering (online VACUUM FULL ), works faster than with ordering
How pg_reorg Works
In the beginning:
- tries to acquire an ACCESS EXCLUSIVE lock nowait in a loop till it succeeds.
- creates a DML tracking trigger on the reorganized table,
- creates a new table using CTAS (the SELECT may include ORDER BY clause if sorting option is chosen),
After the new table is created:
- replays all the accumulated DML on the new table,
- acquires an exclusive lock on the table,
- applies the latest DML,
- swaps the old and new tables
- drops the old table.
- analyzes the newly built table.
What it is good for:
Remove Database Bloat
The table has 62M records, size 53 GB
Identify bloat
#/usr/lib/nagios/plugins/contrib/check_postgres_bloat -H localhost -u postgres –db=feed
POSTGRES_BLOAT CRITICAL: table public.product_feed_data rows:60614560 pages:6622583 shouldbe:5356924 (1.2X) wasted size:10368278528 (9888 MB) — 9.8GB+ (20%) of disk space is wasted
Eliminate the bloat
$ pg_reorg -t product_feed_data -n -T 86400 -d feed
The results:
#/usr/lib/nagios/plugins/contrib/check_postgres_bloat -H localhost -u postgres –db=feed
POSTGRES_BLOAT WARNING: table public.product_feed_data rows:62703028 pages:5891010 should be:5535050 (1.1X) wasted size:2916024320 (2781 MB) – reduced from 9GB+ to 2.7GB
A copy of the full table before and after reorg took 14min 48 sec and 12 min 17 sec – 17% improvement.
Reduce IO and Improve Performance
For a random single row access, the rows’ order in a table is unimportant. However, for queries on an indexed value that has multiple matching rows, pre-ordered physical data allocation would drastically reduce required disk IO and hence improve performance. Once the index identifies the table page for the first row that matches, all other rows that match are probably already on the same cached page.
Consider a query returning 3.2M records using indexed access:
select * from product_feed_data where merchant_id=’XYZ’;
How many pages will need to be read from a disk to fetch all the records?
Before reorg: 275482, execution time 6,677.249 ms
After reorg sorted on merchant_id: 153280, execution time 2251.190 ms
If we run a lot of queries based on merchant_id, the total required disk IO may drastically reduce, and We may serve more traffic on the same hardware.
Manage database shards and massive deletes
Deleting a large part of a huge table may create a lot of problems. At the very least, it would create a lot of bloat and make b-tree indexes unbalanced. pg_reorg, with a little tweaking helps solve this problem. The CTAS definitions used to create the new tables are saved in a reorg.tables view.
Tweak the definition as you need, run a reorg, and the newly reorged table will have only a subset of the original data.
select create_table from reorg.tables where relname=’product_feed_data’::regclass;
create_table —————————————————————————–
CREATE TABLE reorg.table_21845 WITH (oids=false) TABLESPACE pg_default AS SELECT * FROM ONLY product_feed_data
Change it to:
CREATE TABLE reorg.table_21845 WITH (oids=false) TABLESPACE pg_default AS SELECT * FROM ONLY product_feed_data where merchant_id not in ( ‘ABC’,’PQR’,’XYZ’)
pg_reorg may be a life saver if partitions on a specific shard become too busy.
It allows manual table re-sharding with very little downtime. To split a partition A into sub-partitions A1 and A2 do:
- Replicate the table A to a different host/shard
- Rename the original partition into A1 and the new one A2, switch the application to use the two new partitions. This is the only required downtime. Should be very short, depending on how good you are managing partitions in the application.
- Tweak the reorg.tables view on the original and new shards
- Run pg_reorg on both partitions and remove stale A2 data in A1 and A1 data in A2.
Compare with VACUUM FULL and CLUSTER
- pg_reorg’s functionality overlaps with both VACUUM FULL and CLUSTER and works in a similar fashion. However, the latter commands require an ACCESS EXCLUSIVE lock on a table for the duration of a command, which makes then unfit for many production databases.
- pg_reorg only requires short-living locks in the beginning and the end of the processing. It acquires the locks opportunistically mostly without serious disruption of the service.
- pg_reorg offers additional functionality on top of VACUUM FULL and CLUSTER overlap, especially with some tweaking.
According to the tool’s creators, pg_reorg may be faster than clusterdb.
Execution speed performance comparison with clusterdb on 16.5M records table
http://reorg.projects.postgresql.org/index.html
Pitfalls and things to watch for
- Like VACUUM FULL and CLUSTER, pg_reorg requires extra disk space, since it writes a new copy of the table and indexes and doesn’t release the old copy until the operation is complete.
- Only superusers can use the utility.
- Target table must have PRIMARY KEY.
- pg_reorg acquires two short-living ACCESS EXCLUSIVE locks in the beginning and the end of the reorg. It may create some problems on databases with very long running transactions, though I still could execute reorg on a database concurrently with 12+ hours long transactions.
- In the end, if pg_reorg cannot acquire a lock in a specified time, it will cancel the long running transactions preventing it from getting a lock. Choose the waiting time carefully.
- Sometimes, it’s better to kill a pg_reorg session which waited for a lock for too long, and start again.
- As anything, test pg_reorg for yourself before using it in production.
- Works nice with trigger based replication. Did not test with WAL shipping or streaming replication, but should work, too.
The changes in the latest version 1.1.7:
- Bugfix: VIEWs and FUNCTIONs could be corrupted that used a reorganized table which has a dropped column.
- Supports PostgreSQL 9.1 and 9.2dev. (but EXTENSION is not yet)