Solaris and FreeBSD filesystems_PostgreSQL 9.0 High Performance-QQ阅读男频武侠网

上QQ阅读APP看书，第一时间看更新

Solaris and FreeBSD filesystems

While not identical, the common BSD heritage of Solaris and FreeBSD have their respective filesystems much in common, and both implement the same basic ZFS code as their current most advanced filesystem. Choosing between the older UFS options and ZFS involves the usual performance and reliability tradeoffs found in so many other disk related options. In general, ZFS is particularly good at handling very large databases, while UFS can perform better on smaller ones. The feature sets are different enough that this may not be the deciding factor for your installation though.

Solaris UFS

The original Unix File System (UFS) implementation, also called the Berkley Fast File System or UFS1, originated in BSD UNIX. It later appeared in several commercial UNIX variations, including Solaris. The current Solaris UFS adds two major features not found in the original UFS: support for larger files and filesystems (up to 16 TB) and logging.

The logging here is again similar to the database write-ahead logging and the journaling used in Linux, and like Linux's ext3 journal mode, there are known situations where having logging on turns out to be a performance improvement for UFS, because it turns what would otherwise be random writes into a series of sequential ones to the log. Logging is turned on by default on current Solaris releases, in ones before Solaris 9 (04/04) U6 it had to be specifically enabled by adjusting filesystem options in /etc/vfstab to include it:

/dev/dsk/c0t0d0s1 /dev/rdsk/c0t0d0s1 / ufs 1 yes logging

In current versions where it's on by default, it can be disabled (which is not recommended) by using the no_logging mount option.

UFS is not tuned very well out of the box for PostgreSQL use. One major issue is that it only tries to cache small files, which means it won't do what the database expects on the probably large database ones. And the maximum amount of RAM used for caching is tiny—12% of SPARC systems and only 64 MB on Intel/AMD x64 systems. Reads and writes are not done in particularly large blocks either, which can be a problem for some workloads.

The last of those is the most straightforward to fix. When executing physical I/O, requests are broken up into blocks no larger than the maxphys parameter. This also serves as a limiter on read-ahead, and is therefore quite important to adjust upward. Normal practice is to adjust this to match the maximum allowed by the typical device drivers, 1MB. The klustsize works similar for reads and writes to swap.

$ set maxphys=1048576
$ set klustsize=1048576

The code that stops UFS from caching larger files in memory is named freebehind. For PostgreSQL, you want this turned off altogether by setting it to 0, rather than trying to tune its threshold. The total size of the UFS filesystem cache, called the segment map, is set differently based on whether you're using a SPARC or x64 Solaris version.

On SPARC Solaris systems:

$ set freebehind=0
$ set segmap_percent=60

This size (60% of total RAM) presumes you're setting shared_buffers to 25% of total RAM and have a dedicated database server, so 85% of RAM is available for effective database cache. You might want a smaller value on some systems.

On Intel/AMD x64 Solaris systems:

$ set ufs:freebehind=0
$ set segmapsize=1073741824

This example sets the cache size to 1 GB. You'll need to adjust this fixed value on each server based on the amount of RAM in your server, rather than being able to use a percentage.

Solaris systems allow turning off access time tracking using noatime as a mounting option, similarly to Linux. This is a useful small optimization to enable.

A final Solaris specific option to consider with UFS relates to the WAL writes. Since the WAL isn't read from except after a crash, any RAM used to cache it is wasted. The normal way to bypass that, is to use UNIX direct I/O, which doesn't go through the filesystem cache. But that isn't fully implemented in PostgreSQL on Solaris. If you separate out the pg_xlog directory onto its own filesystem, you can mount that using the forcedirectio option to get optimal behavior here. This will quietly convert all the WAL writes to direct I/O that bypasses the OS cache.

FreeBSD UFS2

The FreeBSD community improved basic UFS in its own way, into what it calls UFS2. This also expanded filesystem capacity to far beyond 2 TB, although there are still some user tools that may not support this yet; be sure to investigate that carefully before presuming you can host a large database on FreeBSD.

Instead, implement a journal (Linux) or logging (Solaris), FreeBSD's solution to filesystem integrity issues during a crash and resulting long filesystem recovery times is a technique called soft updates. This orders writes such that the only type of integrity issue after a crash are blocks marked as used but not actually "claimed" by any file. After an unclean shutdown, the filesystem can be brought up almost immediately. What's called a background fsck then runs against a static snapshot of the filesystem, searching for unclaimed blocks to clean them up. This removes the overhead of journaling, while avoiding the worst of the traditional non-journaled filesystem issues—long integrity check times that hold up booting after a crash. While not as common as journaling, this technique has been in use on FreeBSD for ten years already without major issues.

Given that PostgreSQL layers its own integrity checks on top of what the filesystem implements, UFS2 certainly meets the requirements of the database. You just need to be aware that the fsck activity related to the crash recovery will be a background activity competing with database reads and writes, and that might happen during the worst time—just after the server has started, when all of its caches are empty.

Just like on Linux, proper read-ahead significantly improves sequential read speed on FreeBSD. The parameter is sized using the filesystem block size, which defaults to 8 KB. The common useful range for this parameter is 32 to 256. Adjust the value by adding a line to /etc/sysctl.conf like the following:

vfs.read_max = 32

To make this active, execute:

$ /etc/rc.d/sysctl start

The size of the write cache is set by the vfs.hirunningspace parameter. The guidelines in the FreeBSD handbook suggest only increasing this to at most a small number of megabytes. You may want to increase this value to something larger if you are using a system with a battery-backed write controller, where the kernel can likely dump much larger amounts of writes onto disk in a batch without significant latency issues.

ZFS

Few filesystems have ever inspired the sort of zealous advocacy fans of ZFS regularly display. While it has its weak points, in many respects ZFS is a reinvention of the filesystem concept with significant advantages. One thing that's different about ZFS is that it combines filesystem and RAID capabilities into an integrated pair. The RAID-Z implementation in ZFS is a worthwhile alternative to standard RAID5 and RAID6 installations.

ZFS defaults to working in records of 128 KB in size. This is much larger than a PostgreSQL block, which can cause a variety of inefficiencies if your system is regularly reading or writing only small portions of the database at a time (like many OLTP systems do). It's only really appropriate if you prefer to optimize your system for large operations. The default might be fine if you're running a data warehouse that is constantly scanning large chunks of tables. But standard practice for ZFS database installations that do more scattered random I/O is to reduce the ZFS record size to match the database one, which means 8 K for PostgreSQL:

$ zfs set recordsize=8K zp1/data

You need to do this before creating any of the database files on the drive, because the record size is actually set per file. Note that this size will not be optimal for WAL writes, which may benefit from a larger recordsize like the default.

One important thing to know about ZFS is that unlike Solaris's UFS, which caches almost nothing by default, ZFS is known to consume just about all the memory available for its Adaptive Replacement Cache (ARC). You'll need to reduce those amounts for use with PostgreSQL, where large blocks of RAM are expected to be allocated for the database buffer cache and things like working memory. The actual tuning details vary based on the Solaris release, and are documented in the "Limiting the ARC Cache" section of the ZFS Evil Tuning Guide at http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide.

For FreeBSD, refer to http://wiki.freebsd.org/ZFSTuningGuide for similar information. One of the scripts suggested there, arc_summary.pl, is a useful one in both its FreeBSD and Solaris incarnations, for determining just what's in the ARC cache and whether it's using its RAM effectively. This is a potentially valuable tuning feedback for PostgreSQL, where the OS cache is used quite heavily, but such use is not tracked for effectiveness by the database.

ZFS handles its journaling using a structure called the intent log. High performance systems with many disks commonly allocate a dedicated storage pool just to hold the ZFS intent log for the database disk, in the same way that the database WAL is commonly put on another drive. Though, there's no need to have a dedicated intent log for the WAL disk too. There is more information on this and related topics in the ZFS for Databases documentation at http://www.solarisinternals.com/wiki/index.php/ZFS_for_Databases.

Similarly to XFS, if you have a system with a non-volatile write cache such as a battery-backed write controller, the cache flushing done by ZFS will defeat some of the benefit of that cache. You can disable that behaviour by adjusting the zfs_nocacheflush parameter; the following line in /etc/system will do that:

set zfs:zfs_nocacheflush = 1

And you can toggle the value to 1 (no cache flushing) and back to 0 (default, flushing enabled) with the following on a live filesystem:

echo zfs_nocacheflush/W0t1 | mdb -kw
echo zfs_nocacheflush/W0t0 | mdb –kw

ZFS has a few features that make it well suited to database use. All reads and writes include block checksums, which allow ZFS to detect the sadly common situation where data is quietly corrupted by RAM or disk errors. Some administrators consider such checksums vital for running a large database safely. Another useful feature is ZFS's robust snapshot support. This makes it far easier to make a copy of a database you can replicate to another location, backup, or even to create a temporary copy you can then rollback to an earlier version. This can be particularly valuable when doing risky migrations or changes you might want to back out.

Because of the robustness of its intent log and block checksum features, ZFS is one filesystem where disabling PostgreSQL's full_page_writes parameter is a candidate for optimization with little risk. It's quite resistant to the torn pages issue that makes that parameter important for other filesystems. There is also transparent compression available on ZFS. While expensive in terms of CPU, applications that do lots of sequential scans of data too small to be compressed by the PostgreSQL TOAST method might benefit from reading more logical data per physical read, which is what should happen if compression is enabled.