A report from PGCon 2015

LWN.net needs you!

Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

June 24, 2015

This article was contributed by Josh Berkus

PGCon 2015, the PostgreSQL international developer conference, took place in Ottawa, Canada from June 16 to 20. This PGCon involved a change in format from prior editions, with a "developer unconference" in the two days before the main conference program. Both the conference and the unconference covered a wide range of topics, many of them related to horizontal or vertical scaling, or to new PostgreSQL features.

In previous years, the PostgreSQL developer meeting was a full-day, closed meeting, involving about 20 of the top contributors to PostgreSQL. This year, the developer meeting was shortened to a half-day covering only non-technical, project-coordination topics; the technical topics were moved to a three-track, one-and-a-half day developer-focused unconference. The format change was intended by the unconference organizer (committer Stephen Frost) to allow more people to be involved in pivotal development discussions.

As usual, there were far too many talks to cover or even to attend. Among the more interesting topics in the unconference were autonomous transactions, bi-directional replication, sharding, and the impact of shingled magnetic recording (SMR) drives. The conference main program included schema binding using LLVM, the future of JSONB, and tracing in PostgreSQL. First, however, there were several community decisions made at the developer meeting.

Developer meeting

The first topic addressed in the developer meeting was the reliability issues that have led to three PostgreSQL update releases in the last two months. Core team member Bruce Momjian wanted to know whether this was a special case, or if it shows a failure in the PostgreSQL development process, saying: "We're super-reliable, but we're so used to it that we haven't tried to focus on reliability." Suggestions for improvement included new kinds of testing, such as automated crash-recovery testing.

The developers also discussed whether it was time to get a bug tracker. The PostgreSQL project has not had a central bug tracker, instead it tries to fix or reject bugs immediately upon receiving the report. However, with the recent bugs, it was not possible to fix them quickly and several of the individual issues were lost in long discussions on the pgsql-hackers mailing list. Argument ensued about whether having a bug tracker would be a good idea and whether it would have helped keep track of the recent data integrity issues at all, though no conclusion was reached.

Robert Haas asked what the project needs to worry about in the 9.5 release. The risky changes are the ones that change persistent state, which can lead to data loss. The only major change of that sort is to the Write Ahead Log (WAL) format, but that change has been well tested.

Those present at the meeting decided to create a separate Release team that would be in charge of determining when an update release of PostgreSQL was needed. Historically, the Core team has taken care of this, but as the PostgreSQL code base has grown, the six-member core team no longer has the breadth of expertise to know about the severity of fixed bugs in all areas. This new Release team will include the Core team plus the most active committers.

The group discussed having a policy on adding and removing committers, and established a new email-list consisting only of committers and Core team members just to handle this. PostgreSQL committers will have their commit rights revoked after 24 months with no commits.

The developers also decided on the release schedules for 9.5 and 9.6. First, the project will release PostgreSQL 9.5 Alpha as soon as possible after PGCon. Then the project will release beta versions of 9.5 every month until it's stable enough for a release, which the developers hope will be in mid-October. The group decided to shift the target month for releases from September to October in order to make it easier to hit the target date. They scheduled the "CommitFests" for the 9.6 development cycle on July 1, September 1, November 1, January 2, and March 1. The final feature freeze is scheduled for April 15, a beta release for mid-June, and the final 9.6 release is targeted for mid-October 2016.

There was also some inconclusive discussion around making the CommitFest process more efficient and less frustrating for developers. There was also discussion of moving the developer meeting for 2016 to a venue other than Ottawa.

Autonomous transactions

One of the features that Oracle has and PostgreSQL does not is called "autonomous transactions". These are one transaction that can be executed inside another transaction or stored procedure and commit persistently or rollback independently of the outer transaction. Currently users are approximating these using various workarounds, like reconnecting back to the same database.

According to Simon Riggs of 2ndQuadrant and Kumar Rajeev Rastogi of Huawei, who presented a prototype of this feature, autonomous transactions primarily serve two use cases: audit triggers, which need to write even if the transaction fails; and large batch jobs that users want to commit incrementally, such as loading a million rows in batches of a thousand.

Participants disagreed as to whether these two use cases actually have the same solution, or might be better supported by two different features. Various contributors also discussed locking issues, transaction context, and other technical details. Riggs plans to solicit more feedback and use cases online. He hopes to be able to implement this feature for PostgreSQL 9.6.

Bi-directional replication

Simon Riggs also led the discussion on the current state of development of Bi-directional replication (BDR), which is a new replication system for PostgreSQL aimed at efficiently supporting the "multi-master" use case. It allows all nodes to accept writes and replicate individual rows between servers. BDR requires resolving conflicts in the replication manager. This system relies on Logical Decoding (or Data Change Streaming), a feature introduced in PostgreSQL 9.4.

Currently, BDR exists in the form of a set of patches against PostgreSQL and a loadable extension. There is a subset of BDR, called UDR for "Uni-directional replication", that currently works for PostgreSQL 9.4 and can be used to do no-downtime upgrades. The project is in active development, having recently released version 0.9.1.

The development team has been pushing all of the features in those patches into mainstream PostgreSQL, starting with version 9.4, and is expected to continue through version 9.6. Attendees discussed some features that didn't quite make it into 9.5, including an access method for distributed sequences that assign unique integer IDs between different replication nodes. Additionally, the ability to pass messages from one node to another using the transaction log was held back.

This discussion then went on to the design of BDR, including questions of where and how metadata should be stored, and the API design. Participants also went over some existing bugs and limitations of BDR.

Sharding and foreign data wrappers

Developers from Citus Data presented pg_shard, a PostgreSQL tool for "sharding", which is partitioning tables among several servers. Sharding is commonly used in databases that have grown too large to have acceptable performance on one server and to improve availability. pg_shard uses programming hooks in PostgreSQL's query planner and executor to send requests aimed at a "master" table alias to multiple nodes.

Currently, pg_shard requires having a single "head" node that keeps these table aliases and stores the metadata about the locations and health status of shards and nodes. In pg_shard 2.0, the developers want to have metadata on all servers, so that a separate head node is not required and the system will scale well for writes. However, making the metadata consistent between all nodes is a problem; if one server thinks a particular shard is offline but another server does not, it could result in write conflicts and data loss.

The programmers in the session discussed various potential solutions to this. One was an "eventual consistency" plan in which all metadata updates would be written to be commutative and would be shared asynchronously between nodes. However, this proposal would require all data writes to the database to be appends (INSERTs) so that all data backends could be made consistent without conflicts. While this works for the log storage use case, it does not work for other use cases. A second plan involves grouping nodes in small failover groups that are relatively autonomous from each other. This is similar to the "multi-Paxos" [PDF] design for distributed consistency. A design based on the RAFT algorithm was also discussed.

In the meantime, developers from NTT Open Source and EnterpriseDB presented their own, different plan for sharding. Their planned feature is based on extending foreign data wrappers (FDWs) to support all of the features needed for sharding, particularly the ability to push down various query operations, such as sorts, aggregations, and JOINs onto remote nodes. A few features, such as allowing FDW tables to be part of a partitioning relationship on the local server, have already been committed to PostgreSQL 9.5. For PostgreSQL 9.6, they think they can make distributed transactions involving several shards work correctly.

The team would then write a special PostgreSQL-to-PostgreSQL FDW that would be sharding-aware. Overall, the work on this would take two or three more years. The primary question asked by Ahsan Hadi of EnterpriseDB was "are we going in the right direction with this plan?" Several people in the session seemed to feel that they were not, particularly members of the BDR and pg_shard teams. Ozgun Erdogan pointed out that their team had already tried and given up on using FDWs before developing the current design of pg_shard. Others pointed out the lack of any plans to implement high availability. Momjian countered that all of these FDW features are worthwhile on their own. This allows incremental work, which the PostgreSQL project handles better than it does with all-at-once giant features.

Impact of SMR drives

Jeff Davis of Teradata used an unconference session to raise the need for PostgreSQL design changes in order to accommodate the characteristics of SMR drives, which are a new variety of magnetic hard drive that are designed to have super-high densities not achievable with older designs. Current SMR drives on the market start at eight or ten terabytes per drive. Data warehousing service and product vendors, like Teradata, are seeing their customers buy many of these drives.

In order to achieve such high densities, SMR drives have a read head that is one-quarter the size of the write head. When data is written to the drive, the write head overlaps, or "shingles" write stripes so that it ends up with smaller narrow tracks the size of the read head. If data needs to be rewritten, though, the SMR drive has to rewrite large blocks of the drive in order to redo the multiple layers of shingling. This makes random writes very expensive on these drives, as much as twenty times more than conventional hard drives.

The reason this is a problem for PostgreSQL is that currently we do a lot of random writing, even in database workloads that are append-only on the application layer. The project will need to change these behaviors or be abandoned by users who want the high density of SMR drives. Linux developers have also discussed SMR support in various forums, including at the 2015 and 2014 Linux Storage, Filesystem, and Memory Management Summit.

One of the places PostgreSQL does a lot of random writes is for the "hint bits", which consist of a small array of bits set on each data page to let the query executor quickly know certain things about the data on the page. For example, there is a flag called PD_ALL_VISIBLE that indicates that all rows on the page are old enough to be visible to all sessions. Since these hint bits are usually set when the page is being read, they are a large source of random writes. Davis was proposing that PostgreSQL modify or eliminate most of these hint bits.

The other large source of random writes is the need to "freeze" old data pages so that PostgreSQL can recycle transaction IDs. This eventually leads to rewriting all of the pages in any table that grows continuously. There have been proposals on how to do freezing without overwriting, but no major contributors are currently leading that work. From there, the database hackers speculated on various ways to address the need for data page format changes, including adopting log-structured merge-tree (LSM tree) based storage and other models.

LLVM and schema-binding

Huawei's Rastogi presented his current work on using LLVM for one form of query compilation. He is aiming to address performance on databases that are CPU-bound. Increasingly, thanks to large main memory and faster storage, database workloads are limited by CPU throughput instead of by I/O as query optimizer programmers have traditionally assumed. The only way to make CPU usage more efficient, though, requires sending fewer instructions to the CPU for the same database operation.

One way to solve this is with "code specialization", also called "native compilation", which is supported by the LLVM project. This works by taking generalized functions in the application, swapping out variables and macros, and compiling them as specialized functions that get called in place of the generalized function. There are several places where native compilation can be used with databases, including tables and indexes, stored procedures, expressions, and query plans. The last is being done by vendor Vitesse as a proprietary extension to PostgreSQL.

Rastogi pointed out that the problem with native compilation is that, while it speeds up each execution of the specialized function, compiling that function is quite expensive. In a database, this means only using it for things where the developer knows the system will be executing the function many times. This is why he decided to start with compiling tables and indexes, otherwise known as "schema-binding".

When scanning a table, PostgreSQL executes many macros and expressions for each row. For example, the data type and its length are checked many times even though these don't change on a per-row basis. All of this checking can be compiled into a specialized table-access function for that particular table. This can decrease the number of CPU instructions for a table scan by 50% to 80%. Rastogi covered some alternate methods of implementing this with PostgreSQL.

He chose the least-invasive method to test because other methods required substantial changes to the PostgreSQL data page format. He compiled a function using fixed offsets for the fixed-length attributes. Variable-length attributes, like strings, must still be looped over one at a time. He tested this improvement using the TPC-H benchmark and found that it improved execution for seven of the nineteen queries in the benchmark by 20% to 35% without slowing down any query.

Next, he plans to test rearranging columns to optimize for the compilation pattern by putting the fixed-length columns first in the table definition. He is unsure when a finished feature might be contributed to PostgreSQL.

The future of JSONB

Several developers presented on current plans and ideas for PostgreSQL's JSON and JSONB features. JSONB is a binary tree representation of JSON that is used internally by PostgreSQL. Andrew Dunstan talked about the new features in PostgreSQL 9.5, and some of the development he and Dmitry Dolgov are doing to make JSONB more easily updatable. Later in the conference, Oleg Bartunov and Alexander Korotkov proposed an ambitious plan for new syntax and features.

In 9.4, users can get an element of a JSONB value from deep inside the nested data structure. But to change that value, they need to rewrite the entire serialized value on the client side and send it to the server, instead of just directly changing the one element. In 9.5, we will have the jsonb_set() function, which allows modifying or appending arbitrary elements within a JSONB value, even deeply nested ones. This makes PostgreSQL much more useful as a JSON document database.

PostgreSQL 9.5 will also have element deletion and concatenation functions and operators, as well as a host of aggregation and composition functions for creating JSON data. Many of these features are already available as extensions for PostgreSQL 9.4, such as the jsonbx extension.

For the future, Dunstan would like to implement "deep merge", allowing for concatenation of JSONB values at any level of nesting. He also wants to create a "diff" function for two documents. Other plans include implementing the JSON Pointer and JSON Patch specifications. The latter allows making multiple modifications to a JSON document in one operation. Dunstan would also like to improve the syntax of manipulating JSONB by implementing what PostgreSQL calls "lvalue syntax", which allows a more compact and intuitive representation like:

    myjsonb['key1'][1]['key2']

Korotkov and Bartunov have even more ambitious plans for JSONB. They announced that they have completed work on JsQuery, the PostgreSQL extension for doing advanced querying of JSONB. This project implements a specialized search syntax for document columns, supported by Generalized Inverted Indexes (GIN), similar to PostgreSQL's full text search feature. However, they aren't satisfied with it.

The two are unhappy with the limitations imposed by the PostgreSQL operator API, which is restricted to expressions in the form of "column operator value". This API prohibits operations that GIN indexes could otherwise support. To address this, they want to implement a grammar of indexable expressions and extend PostgreSQL's dialect of SQL to support it.

While the primary use of this new grammar would be with JSONB, they intend it to support any data that is GIN-indexable, including arrays, hstore, trigrams, and full text search. They gave some examples of what such a grammar would look like, and they will try to implement it for PostgreSQL 9.6.

In more modest plans, Bartunov discussed ways to make JSONB smaller and more efficient. Currently other ways of storing complex types are more compact and compress better. Two proposals included having a dictionary of keys so that they could be included in the values as IDs instead of many repeated strings, or, alternatively, to have a catalog of JSON schemas. Either of these proposals would reduce space requirements for JSONB while making it more costly to add new keys or new types of documents. Attendees pointed out that these two approaches have different use cases, since some document populations have more repeated keys, and some have a more fixed document schema.

Tracing PostgreSQL

Ilya Kosmodemiansky presented on the need for better tracing support in PostgreSQL. He started by explaining the difference between tracing and profiling. Profiling gives us overall usage stats without really knowing what's going on with a specific query. He shared this example dialog:

Developer: My query is slow.

Database administrator: The load average is OK, so it must be blocking on something.

Without good tools, PostgreSQL administrators do a lot of guessing about what might be the root cause of performance problems. Oracle, thanks to its Wait Events interface, is well ahead of PostgreSQL in this area. PostgreSQL's tools have been improving, but most tools are based on profiling, so they are designed to show us general problems instead of specific ones.

Next he explored using operating system tracing tools with PostgreSQL, such as DTrace, SystemTap, and perf. One problem with these tools is that using them on userspace probes on PostgreSQL can be unsafe in production. SystemTap has been known to crash PostgreSQL. Perf is safer, but really primarily shows kernel events. And while knowing the number and type of system calls made by a query can give some information for troubleshooting, it's really designed for Linux kernel hackers and isn't very comprehensible to database administrators. Further, it's very easy to take down the system if you enable too many probes.

Instrumenting tracing for everything in PostgreSQL would increase code complexity too much, and developers are concerned about the overhead of doing many gettimeofday() calls to record timing information. So Kosmodemiansky plans to start small by instrumenting only PostgreSQL's lightweight locks [SlideShare presentation] (LWLocks), since they already have probes and are generally mission-critical.

He described some of the issues around this and the workarounds that he found. By leaving out the LWLocks used for shared_buffers overwrite protection, he can cut the amount of data collected substantially. PostgreSQL can also simulate checking wait times either by calculating against the polling interval, or by using a local clock in each backend, instead of calling gettimeofday(). Finally, if the developers add an array in the memory of each backend and poll it, the database can avoid a lot of write contention on a general catalog.

Kosmodemiansky plans to attempt to implement LWLock wait events for PostgreSQL 9.6.

Conclusion

There were, of course, many other interesting talks presented at the conference. Staff from Salesforce.com shared the technical difficulties they've had with porting Oracle PL/SQL stored procedures to PostgreSQL. A delegation from the PostgreSQL Enterprise Consortium of Japan gave the results of a series of surveys the organization has been doing about PostgreSQL adoption there that included over 5000 companies. Heikki Linnakangas of Pivotal introduced pg_rewind, which is a tool that makes database failover and failback more reliable and easier to automate. Several other speakers went over other new 9.5 features.

As a special guest, Joe Celko of the original ANSI SQL committee spoke to attendees about encoding schemes, weights, and measures because they are things that many developers get wrong. Josh McDougall ran the annual PGCon Schemaverse tournament, this time doing one-on-one brackets instead of a free-for-all. For the first time in four years, it was not won by Chris Browne.

With all of the news being generated by newer databases, it's easy to assume that little is happening with relational databases anymore. The number of new ideas and new projects introduced at PGCon is a reminder that there's still a lot going on with PostgreSQL. If even half the features proposed at this year's conference are completed, version 9.6 will be a release to anticipate.

Index entries for this article
GuestArticles	Berkus, Josh
Conference	PGCon/2015

(Log in to post comments)

A report from PGCon 2015

Posted Jun 24, 2015 21:05 UTC (Wed) by mm7323 (subscriber, #87386) [Link]

developers are concerned about the overhead of doing many gettimeofday() calls

I thought on Linux gettimeofday() was a vsyscall so very fast already. I guess this must be a cross-platform concern?

A report from PGCon 2015

Posted Jun 24, 2015 22:48 UTC (Wed) by dlang (guest, #313) [Link]

fast != free

Also remember that Postgres runs on things other than Linux

Back around 2006 when I started working with rsyslog, I discovered that it was doing a half dozen or more gettimeofday() calls for each message as it was being processed through the internal layers. Just eliminating these redundant calls increased the throughput significantly (I don't remember if it was a 50% or 100% increase, but something eye-opening)

I don't remember when the vsyscall was created, it was near the same timeframe, but it's possible that this was just before it was available.

A report from PGCon 2015

Posted Jun 25, 2015 17:35 UTC (Thu) by jberkus (guest, #55561) [Link]

Yeah, one thing I dropped from the article was the figures about the overhead of gettimeofday() on different systems; it can be surprisingly expensive some places. And even on Linux, if you're calling it 1000's of times per second the overhead is significant.

A report from PGCon 2015

Posted Jun 25, 2015 23:57 UTC (Thu) by LawnGnome (subscriber, #84178) [Link]

There are still plenty of paravirtualised Linux servers in use out there where gettimeofday() is still startlingly slow. I'm sure that kernel and distribution updates will eventually fix that, but it's still definitely a problem today (and, as others have said, there's always some overhead).

A report from PGCon 2015

Posted Jun 29, 2015 16:15 UTC (Mon) by SEJeff (guest, #51588) [Link]

The answer is that it depends on if you're using hpet or tsc for as your clocksource. tsc is very fast and not as accurate. hpet aka "high precision event timer" is more accurate, and measurably much slower.

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm

You can echo any of those 3 into the current_clocksource file to change it on the fly. Then use a canned benchmark like libMicro or something to measure gettimeofday(). It absolutely matters and slows down some applications, which are then forced to make rtdsc calls natively (gross).

Reference: I've worked on low latency financial environments the last 7.5ish years of my career.

A report from PGCon 2015

Posted Jun 25, 2015 17:10 UTC (Thu) by ncm (guest, #165) [Link]

Collecting hint bits and other metadata into common blocks, with the goal to avoid actually reading or writing the blocks they refer to, is often a big win, even though sometimes it turns one block operation into two. Given such a separation, they can be put on a different spindle.

But the modern way to do this is to put all changes in the log, put the log somewhere fast to write, let the log get big, and batch up writes later. It means you have to keep (or be ready to reconstruct, from the log) in-memory copies of the blocks not written. But generally we make big storage systems with fast-to-write persistent cache storage in front anyway. So this is really about the places where there's no proper storage system, just one of these big disks. Again, the solution is a big log, because even where regular writes are slow you should be able to append quickly.

Database use of SMR drives

Posted Jun 26, 2015 18:25 UTC (Fri) by butlerm (subscriber, #13312) [Link]

To start with, it is difficult to think of any random access storage device less suitable for database use than an SMR drive. The use of such drives in a busy production database sounds like a good way to reduce write transaction throughput by a factor of one hundred or more.

That said, it is puzzling to me that PostgreSQL would go out of its away to rewrite a block on a mass storage device simply to update a hint bit of the sort that ought to be easily derived by examining the contents of the block itself. Of course, if it is not rewriting blocks merely to update a hint bit, then there shouldn't be a problem.

Database use of SMR drives

Posted Jun 27, 2015 13:54 UTC (Sat) by kleptog (subscriber, #1183) [Link]

The hint bits are there because while it is possible to determine the state of the tuple by examining the tuple itself, that doesn't mean this check is free. There's a whole file dedicated to visibility checks and they aren't simple functions. These need to be called on every single row in every single scan. Hint bits help short circuit that and help performance. Most useful is the bit that says "this tuple is visible to everyone" because it helps everyone and is true most of the time, just not when the tuple was first inserted..

The other related situation is when the transaction counter wraps, which it does once every 2^32 transactions. The XIDs in the tuples need to be updated with a special marker to indicate it's not valid any more.

There are solutions to these problems but they generally cost in either disk space or performance elsewhere. You could for example switch to a 64-bit transaction IDs and never remove any old transaction logs (the clog) and you would never need to rewrite old tuples. Of course, you'd never be able to recover the disk space used by the transaction logs either.

SMR for databases would be useful for the WAL logging, since they are write once, but for the tables it seems unlikely.

Database use of SMR drives

Posted Jun 27, 2015 17:08 UTC (Sat) by andresfreund (subscriber, #69562) [Link]

> The hint bits are there because while it is possible to determine the state of the tuple by examining the tuple itself, that doesn't mean this check is free. There's a whole file dedicated to visibility checks and they aren't simple functions. These need to be called on every single row in every single scan. Hint bits help short circuit that and help performance. Most useful is the bit that says "this tuple is visible to everyone" because it helps everyone and is true most of the time, just not when the tuple was first inserted..

Correct.

Minor nitpick: It's not primarily the cost of these functions, they're called in many situations regardless. It's that another on-disk file (the 'clog', a integer indexed file listing whether a transaction committed or not) has to be consulted. That's often where much of the time is spent. Especially if you have significant throughput, and access older, uncached, regions of the clog.

> SMR for databases would be useful for the WAL logging, since they are write once, but for the tables it seems unlikely.

I think SMR for the WAL would not be a good idea - due the the need of somewhat frequent fsync you'll likely end up with horrible performance.

But for some workloads where you have append-only data that is only infrequently read it's not unrealistic to use SMR. It's quite common that only the last few months worth of data will be modified, but that you have to archive years worth for regulatory and reporting reasons. Moving older partitions to storage with different characteristics is not unreasonable.

Database use of SMR drives

Posted Jun 30, 2015 21:27 UTC (Tue) by snuxoll (guest, #61198) [Link]

> You could for example switch to a 64-bit transaction IDs and never remove any old transaction logs (the clog) and you would never need to rewrite old tuples. Of course, you'd never be able to recover the disk space used by the transaction logs either.

I'm confused, how would not rolling over the XID prevent you from removing old WAL segments?

Database use of SMR drives

Posted Jun 30, 2015 21:51 UTC (Tue) by kleptog (subscriber, #1183) [Link]

> I'm confused, how would not rolling over the XID prevent you from removing old WAL segments?

You can always eventually remove old WAL segments, but not the transaction log, the one that tracks for each transaction whether it committed or not. It's only 2 bits per transaction, but after 2^32 transactions that's still 1GB of disk space. To get rid of those logs you have to at some point go back to mark the tuple either permanently committed or rolled back. This isn't done with the hint bits though, but by using a special Frozen-XID marker.

The rolling over is only slightly related. If you're going to truncate the transaction log anyway to less than 2^32 transactions, then you don't need to remember more than 2^32 transactions and so you can save the space by only using 32-bit XIDs. If you use larger XIDs then you have more flexibility for people who don't mind the few GB of disk space to remember the last 2^36 transactions for example.