Wednesday, April 02, 2014

Subtly Bad Things Linux May Be Doing To PostgreSQL

In addition to talking about PostgreSQL at LSF/MM and Collab, I also learned a few things about the Linux kernel that I had not known before, some of which could have implications for PostgreSQL performance.  These are issues which I haven't heard discussed before in the PostgreSQL community, and they are somewhat subtle, so I thought it would be worth writing about them.

1. Regardless of how you set vm.zone_reclaim_mode, the kernel page cache always prefers to use pages from the local NUMA node.  Many PostgreSQL community members have already determined that vm.zone_reclaim_mode = 1 is bad for PostgreSQL workloads, because the system will essentially always allocate from the local node even if large amounts of memory are available on other nodes.  What I learned last week is that even with vm.zone_reclaim_mode = 0, we're not out of the woods: the kernel will overflow to another node when no pages are available, but it will also wake up kswapd to reclaim more pages on the local node.  This can result in the reclaim rate being much higher on the local node than on other nodes, so that relatively hot pages on the local node are evicted in preference to relatively cold pages on other nodes.  This probably won't cause big problems for pgbench-type workloads, where we run lots of short queries concurrently, but it could cause problems when a single process, or several processes on the same node, touch a large number of pages and some of them more than once.  Also, it might not be obvious that the problem traced back to suboptimal page eviction decisions on the part of the kernel.  I fear that there could be a significant number of users suffering from this issue who aren't able to diagnose this as the cause.

2. If you perform a buffered (that is, no O_DIRECT) write of a complete block, and that block is not in the page cache, there is some suggestion that Linux will read the block back from disk before overwriting it.   Actually, two different kernel developers gave us opposite answers about whether or not Linux is smart enough to optimize away the reads, but testing by Andres Freund seemed to suggest that, whether for this reason or some other, writing data not in cache can lead to a large volume of read I/O, equal to the write I/O.  Some reads might be expected, since metadata might need to be brought into cache, but if only metadata is being read, this should be small compared to the volume of writes.  I'm still on the fence about whether this is a real problem, but if it is, it will hurt people who set large values of shared_buffers in the hopes of making their entire workload stay within PostgreSQL's cache: unless there's still enough memory remaining at the OS-level to double-buffer everything that lives in our cache, such users will incur additional (unnecessary) read I/O.

3. Not that it's particularly bad for PostgreSQL, but as a point of possible future interest, I found out that the Linux kernel does not (and most likely will not in the future) recommend mixing O_DIRECT and non-O_DIRECT I/O on a single file.  Doing this is expected to result in very poor performance, because the use of O_DIRECT causes page cache invalidations.  I haven't been able to think of a scenario in which this will actually hurt PostgreSQL users today, because we only use O_DIRECT for WAL, and then only if wal_sync_method = open_sync or open_datasync, and then only neither archiving nor streaming replication is in use.  However, it's certainly worth keeping in mind if we ever consider expanding the use of O_DIRECT.

Again, thanks to all of the kernel folks who spent time discussing these and other issues with us last week.

16 comments:

  1. So with regards to number 1, what are the kernel devs doing about it, and/or what are the workarounds?

    ReplyDelete
    Replies
    1. Mel Gorman told me that there is a way to detect the problem using information from /proc, but he couldn't rattle off the details offhand, so I don't have them. There was some discussion of possibly rejiggering things so that the kernel notices when the reclaim rate for one zone is much higher than any other zone and behaves differently in that case, but it didn't appear there was broad consensus on what the behavior should be. I think a useful first step would be to try to find out whether and to what degree this problem is actually affecting PostgreSQL workloads, but I don't (yet!) know how to do that.

      Delete
  2. I'd love to see a FreeBSD kernel hacker chime in here.

    ZFS enables some interesting things for pgsql:

    * http://open-zfs.org/wiki/Performance_tuning#PostgreSQL - it seems like the primarycache setting prevents the double buffering problem that Linux' page cache has
    * http://citusdata.com/blog/64-zfs-compression

    ReplyDelete
  3. Replies
    1. sorry, let me more verbose:

      hell no.

      Delete
    2. Why "hell no"? What's the problem with SmartOS in this context?

      Delete
  4. Robert, thanks for the writeup!

    What is the exact scenario that triggers #2?

    You've got a page in shared_buffers, you want to flush it to disk, Kernel notices that page is _NOT_ in page cache _BUT_ in shared buffers and then triggers a superfluous read before writing the page? Or is this unrelated to shared_buffers?

    ReplyDelete
    Replies
    1. More or less what you said, except that the kernel has no idea whether or not the page is in shared_buffers, and it doesn't matter either way. As it turns out, most cases where we're overwriting a page in an existing relation are going to be coming from shared_buffers, but there are other similar cases, e.g. it could be in temp_buffers, or it could be a page from an SLRU, or we could repeatedly overwrite the current WAL page (though it'd be pretty unlucky if the kernel evicted something that hot). At any rate the kernel doesn't care: the issue is simply whether an 8kB-aligned write from swap-backed memory into a file on disk will fault the underlying filesystem blocks in from disk if they're not already in the kernel's page cache, or whether it will determine that the read can be optimized away since the entire page is slated for overwrite.

      Delete
    2. Gotcha - just did a quick test [1] with dd which seemed fine for 8kb writes, would be good to get some confirmation if/on which systems this is an issue.

      [1] http://nopaste.narf.at/show/3075/

      Delete
    3. This comment has been removed by the author.

      Delete
    4. I would guess dd opens the output file with O_WRONLY and this could potentially not trigger the issue.
      I assume PostgreSQL opens it with O_RDWR. Is this correct, Robert?

      Delete
    5. dd also opens with O_RDWR in my example: http://nopaste.narf.at/show/3077/, so that shouldn't be a differentiator.

      Delete
  5. THP compaction should be mentioned. I have strong circumstantial evidence (but not proof) that it is causing issues with high memory systems.

    ReplyDelete
  6. The fsync() issue sounds like a much bigger performance problem for a database app than NUMA scheduling or cache misses.

    I wonder if one could put the WAL files on one disk and the actual db files on another. That way queries would not block when committing.

    Though I believe this is only an issue for read AND write intensive databases. Most db setups would be either or, again imho.

    ReplyDelete
  7. So I've noticed my kswapd waking up and chewing up tons of CPU and swapping like mad before on a machine with 512GB RAM. The only solution on this class machine is to turn off swap, which fixed it immediately. Could this be related to #1 in your list?

    ReplyDelete