Page Checksum Protection in PostgreSQL

PostgreSQL does support a feature called page checksum that, if enabled at a cluster-wide level, protects the database from data corruption on the disk. The protection does not involve automatic data recover, rather a way to discard a piece of data that is considered no more reliable. In this short article, readers will see the effect of data corruption and how PostgreSQL reacts to such event.

You will learn

  • How to enable page checksums
  • What page checksum protects you from
  • How to simulate a page corruption
  • How to erase the damaged page

You need to know

  • How to interact with a PostgreSQL (9.6) database
  • How PostgreSQL sets up disk layout
  • How to write and run a small Perl program

Introduction

PostgreSQL supports the page checksum feature since version 9.3; such feature allows the cluster to check for every checked-in data page to determine if the page is reliable or not. A reliable page is a page that has not been corrupted during the path from memory to the disk (writing data to the disk) or the opposite (reading back the data).

As readers probably know, a data corruption can happen because of a bug or a failure in the disk controller, in the memory, and so on. What is the risk of a data corruption from a PostgreSQL point of view? A corrupted data page contains wrong tuples such that the data within the tuples is possibly wrong. Using such wrong data could make a single SELECT statement to report wrong data or, worst, such data can be used to build up other tuples and therefore “import” corrupted data within the database.

It is important to note that the page checksum feature does not enforce the database consistency: the latter is provided by the Write Ahead Logs (WALs) that have always been strongly protected from corruption with several techniques including a checksum on each segment. But while WALs ensure that data is made persistent, they don’t protect you from a silent data corruption that hits a tuple (or alike), and again this “silent” corruption will be checked in the database in future.

What can the database do when a corruption in a data page lays around? There are two possible scenarios:

  1. The data page is checked in and used as if it was reliable (i.e., the corruption is not detected at all);
  2. The data page is discarded. Therefore, the data contained in it is not considered at all. Without page checksums , PostgreSQL will default to scenario 1), that is the detection is not perceived. Hence, possible corrupted data (e.g., a tuple, a field, a whole range of an index or table) is used in live operations and can corrupt other data.

With page checksums enabled, PostgreSQL will discard the page and all the data within it. For instance all the tuples of a table stored in such a page. Is it  the administrator’s duty to decide what to do with such a data page? However, there is nothing PostgreSQL can do automatically since it is unknown what the real corruption is and what caused it.

Enabling page checksums

This feature can be enabled only at cluster initialization via initdb: the –data-checksum option instruments the command to enable data pages from the very beginning of the database. It is worth noting that a page checksum means a little more resource consumption to  compute, store, and check the checksums on each page. More resource consumption means fewer throughputs. Therefore, the database administrator must decide what is more important: performance or data reliability. Usually, the latter is the right choice for pretty much any setup. Therefore, it is important to  understand that there is no protection at all against external data corruption without page checksum.

Consequently, to enable page checksums, initdb has to be run with the –data- checksum option. For instance a new cluster can be initialized as follows in Table 1:

$ initdb --data-checksum

-D /mnt/data1/pgdata/

Table 1. A new cluster can be initialized

Once the database has been instrumented as shown above, the user can interact with it in the same way as if page checksums was disabled. The whole feature is totally transparent to the database user or administrator.

Forcing a corruption

Readers must not execute the following steps in production!

The main aim of this section is to provide an insight on what happens when data is corrupted, but readers must understand that these four steps will deliberately destroy their data!

First of all, find out a table to corrupt. The following query will show you all the user tables order by descending page occupation. Hence the first table that will show up is a “large” table (see Table 2).

# SELECT relname, relpages, reltuples, relfilenode FROM pg_class

WHERE relkind = 'r'

AND relname NOT LIKE 'pg%' ORDER BY relpages DESC;

-[ RECORD 1 ]------------------------

relname | event

relpages | 13439 reltuples | 2.11034e+06 relfilenode | 19584

...

Table 2. All the user tables order by descending page occupation

As readers can see, the event tables have 13439 data pages, and a two millions tuple, so it is a large enough table to play with.

In order to find out the real file on the disk, it is important to get the path of the database which can be obtained with the following query (see Table 3).

# SELECT datname, oid FROM 

pg_database;

datname |oid

-----------+-------

postgres       | 12758

template1      | 1]

template0      | 12757

luca           | 16389

foodb          | 18936

testdb         | 19554


Table 3. The path of the database

Since the event table is within the testdb database, the file on disk will be in

$PGDATA/baase/19554/19584. The utility oid2name(1) can be used to extract the very same information for databases and tables.

 

Corrupting a data page

The following simple Perl script will corrupt a data page (see Table 4).

#!env perl

open my $db_file, "+<", $ARGV[ 0 ]

|| die "Impossibile aprire il file!\n\n"; seek $db_file, ( 8 * 1024 ) + $ARGV[ 1 ], 0;


print { $db_file } "Hello Corrupted Database!"; close $db_file;

 

Table 4. The simple Perl script

The idea is simple:

      • Open the specified data file (the one named relname in the previous SQL query);
      • Move to the specified data page (please note that data pages are usually 8kb in size for a default PostgreSQL installation);
      • Print out a string to corrupt the data;
      • Close the file and flush to disk.

To perform the corruption, you have to launch the program with something like you can see on Table 5.

% sudo perl corrupt.pl /mnt/data1/pgdata/base/19554/19584 20

Table 5. To launch the program

 

The above will corrupt the 20th page of the event table. This can be done when the database is running or is stopped.

 

See the corruption

When you try to access the relation, PostgreSQL will clearly state that there is a corruption in the data page (see Table 6).

> SELECT * FROM event;

...

ERROR: invalid page in block 20 of relation base/19554/19584

Table 6. PostgreSQL will clearly state that there is a corruption in the data page

 

So far, the database has no chance to recover the data, but at least it is not checking in wrong data!


Cleaning the damaged page

Since PostgreSQL can do nothing about data recovery, the only choice it has is to zero the damaged page. In other words, unless you really need the page to inspect the corruption, you can instrument PostgreSQL to clean the page and make it reusable (as a fresh new page). Data will still be lost, but at least you will not waste space on the disk.

PostgreSQL provides the zero_damaged_pages option that can be set either in the configuration file postgresql.conf or in the running session. For instance, if a session performs the same extraction from the table with zero_damaged_pages enabled, PostgreSQL will not warn on anything (see Table 7).

# SET zero_damaged_pages TO 'on';

# SELECT * FROM event;

...

-- the query runs to the end

Table 7. PostgreSQL will not warn on anything

 

But in the cluster logs, there will be a notice about the cleanup of the page (see Table 8).

WARNING: page verification failed, calculated checksum 61489 but expected 61452

WARNING: invalid page in block 20 of relation base/19554/19584; zeroing out page

Table 8. The cleanup of the page

 

Moreover, the relation will have a page less than it was before (see Table 9).

# SELECT relname, relpages, reltuples, relfilenode FROM pg_class

WHERE relkind = 'r'

AND relname NOT LIKE 'pg%' ORDER BY relpages DESC;

-[ RECORD 1 ]------------------------

relname | event

relpages | 13438 reltuples | 2.11015e+06 relfilenode | 19841

...

Table 9. The relation will have a page less

 

The number of pages is now 13438,a page less than the original size, 13439. PostgreSQL did find out a page was not reliable and discarded it.

 

Vacuum and autovacuum

The same effect would have taken place in the case where a vacuum was run against the table (see Table 10).

# SET zero_damaged_pages TO 'on';




# VACUUM FULL VERBOSE event;

INFO: vacuuming "public.event"

WARNING: page verification failed, calculated checksum 22447 but expected 19660

WARNING: invalid page in block 1 of relation base/19554/19857; zeroing out page

INFO: "event": found 0 removable, 2109837 nonremovable row versions in 13437 pages

Table 10. A vacuum was run

 

However, do not expect autovacuum to work the same: it is a design choice to not allow autovacuum to clean up damaged pages, as you can read in the source code of the autovacuum process (see Table 11).

/*
  • Force zero_damaged_pages OFF in the autovac process, even if it is set
  • in postgresql.conf. We don't really want such a dangerous option being
  • applied non-interactively.
*/

SetConfigOption("zero_damaged_pages", "false", PGC_SUSET, PGC_S_OVERRIDE);


Table 11. The autovacuum process

 

As you can see, the option zero_damaged_pages is always set to false, so that an autovacuum process will not zero (or clean) a page. The idea is that such an operation is so important that an administrator should be notified and decide manually to perform a cleanup. In fact, a page corruption often means there is a problem with hardware (or filesystem or other software) that requires more investigation, and also a recovery from a reliable backup.

 

Conclusions

The page checksum feature allows PostgreSQL to detect silent data corruption that happened outside the WALs, i.e., on real data pages. The database cannot decide automatically how to recover such data. Therefore, the only choice left to the administrator is to clean up the wrong page or not. However, once a corruption is detected, PostgreSQL will refuse to check-in such a page thus protecting the other data pages from being polluted.

 

About Luca Ferrari

Luca lives in Italy with his beautiful wife, his great son, and two female cats.

Computer science passionate since the Commodore 64 age, he holds a master degree and a PhD in Computer Science. He is a PostgreSQL enthusiast, a Perl lover, an Operating System passionate, a UNIX fan, and performs as much tasks as possible within Emacs. He considers the Open Source the only truly sane way of interacting with software and services.

His website is available at http://fluca1978.github.io

 

References

0 responses on "Page Checksum Protection in PostgreSQL"

© HAKIN9 MEDIA SP. Z O.O. SP. K. 2013