• LOGIN
  • No products in the cart.

Using PostgreSQL Foreign Data Wrapper to Keep Track of Files

You will learn …

  • How to using PostgreSQL Foreign Data Wrapper to keep track of files
  • Compiling and Installing the Foreign Data Wrapper
  • How to create the Extension
  • How to create the File System Table
  • Creating a Snapshot of files 

In this paper, you will see how PostgreSQL can be extended to pull data out of special /data sources that allow the database cluster to query the outside world called Foreign Data Wrapper/s. There are many implementations of FDW that allow PostgreSQL to live-query other databases, as well as other data sources like web pages, files, processes, and so on.

This paper proposes a simple setup of a File System FDW that allows a system administrator or an application to query the filesystem to get information about files, as well as storing at least one historical version of the latter. The approach presented here is not meant, to any extent, to substitute the traditional and better suited Source Control Management software (like RCS and alike). Moreover, all the examples provided aim only to present the reader with a simple background on the capabilities that FDW allow.

Introduction

Imagine you want to store some information about your system configuration file in a database. The solution is straightforward: build an application that can perform some DML (Data Manipulation Language) against a database.

Another approach is to use a File System FDW. A layer that connects your database directly to a File System Data Source so that instead of the database waiting for new data to be stored, it can (to some extent) pull the data automatically.

In this article, I will show you how to use the Multicorn FDW to achieve a poor-man database-SCM.

To execute the code snippets, you need:

  • git and gmake installed;
  • python version 2.7 or higher;
  • PostgreSQL (a recent version, for this article, I used version 9.6.5);
  • Access to privileged user capabilities (e.g., using sudo).

You will also need some basic knowledge about PostgreSQL, how to create a database, a superuser role, and so on. You can get more information reading the online documentation or my previous articles on the matter.

Compiling and Installing the Foreign Data Wrapper

There are several Foreign Data Wrappers (FDW) available for PostgreSQL. In this example, we are going to use the Multicorn FDW, a set of Python modules that provide several FDW implementations within the same installation. One of such implementation is the File System FDW.

The first step is to get the latest Multicorn implementation. In this example, you will install the development version obtained via Git:

% git clone git://github.com/Kozea/Multicorn.git

Before you can actually compile Multicorn, you need to adjust it to compile on FreeBSD:

  1. Edit the preflight-chech.sh file, and change the first line with the current available Bash, that is:
% head -n1 preflight-check.sh

#!/usr/local/bin/bash
  1. Remember to run gmake instead of make, so:
% gmake && sudo gmake install

Create the Extension

To create the extension, you need to connect to the PostgreSQL database as superuser, and then load the Multicorn extension. After that, you need to define a Data Server, an entry point for external data to come into the database.

Therefore:

# CREATE EXTENSION multicorn;



# CREATE SERVER filesystem_server

  FOREIGN DATA WRAPPER multicorn

  OPTIONS ( wrapper 'multicorn.fsfdw.FilesystemFdw' );

Create the File System Table

Suppose we want to collect information about the /usr/local/etc/ configuration files. Therefore, you need to define a table that will contain various data:

  • the filename;
  • the content (as text);

We can elaborate a little more by adding a hash column, and the date the file has been inspected.

Therefore, the table will be defined as:

# CREATE FOREIGN TABLE usr_local_etc (

    full_file_name text,

    content        text,

    service        text



) SERVER filesystem_server

  OPTIONS( root_dir '/usr/local/etc',

           pattern '{service}.conf',

           content_column 'content',

           filename_column 'full_file_name' );

Now, you can try it with a simple SELECT statement:

# SELECT service, full_file_name

  FROM usr_local_etc;



 service  | full_file_name

----------+----------------

 pkg      | pkg.conf

 tcsd     | tcsd.conf

 pcp      | pcp.conf

 pgpool   | pgpool.conf

 pool_hba | pool_hba.conf

 idn      | idn.conf

 idnalias | idnalias.conf

However, there is a hidden problem: while the user can run simple stat commands on the filesystem, he/she cannot get the content of the files. In fact, if you try to get the content of a file you’ll get an error:

# SELECT service, content FROM usr_local_etc;

ERROR:  Error in python: OSError

DETAIL:  [Errno 13] Permission denied: '/usr/local/etc/tcsd.conf'

The problem arises from the fact that /usr/local/etc/tcsd.conf has no world-readable flag. A quick solution is to allow another user to read by either changing the file mode (e.g., 644) or to invite the user running the PostgreSQL server to the group of the file owner (in this case _tss), and setting the mode to 640.

% id postgres

uid=770(postgres) gid=770(postgres) groups=770(postgres)



% sudo pw usermod -n postgres -G _tss



% id postgres

uid=770(postgres) gid=770(postgres) groups=770(postgres),601(_tss)

Once the above problem is solved and the trick applied to any problematic file, you can query the table to get living data from the underlying file system:

# SELECT service, content

  FROM usr_local_etc

  WHERE service = 'pkg';



 service |                               content

---------+---------------------------------------------------------------------

 pkg     | # System-wide configuration file for pkg(8)                        +

         | # For more information on the file format and                      +

         | # options please refer to the pkg.conf(5) man page                 +

         |                                                            +

         | # Note: you don't need to have a pkg.conf file.  Many installations+

         | # will work well with no pkg.conf at all or with an empty pkg.conf +

         | # (other than comment lines).  You can also override any of these  +

         | # settings from the environment.                                   +

         |                                                            +

         | # Configuration options -- default values.                         +

         |                                                            +

         | #PKG_DBDIR = "/var/db/pkg";                                        +

         | #PKG_CACHEDIR = "/var/cache/pkg";                                  +

...

Creating a Snapshot of files

Using the Foreign Data Wrapper, the database will query the filesystem each time you issue a query, and this means the data in the usr_local_etc table will change accordingly to changes performed outside the database. If you need to keep a snapshot of the file content, let’s say to implement a poor-man file control management, you can use a materialized view.

A materialized view is a view over data that is populated by a snapshot of data pulled out from a table. Each time you refresh the view, new data is pulled out of the table. Otherwise, the view will provide a static snapshot of the data at the time it was last updated.

To better explain it, let’s create a materialized view to get the content of the files into the file system:

# CREATE MATERIALIZED VIEW usr_local_etc_snapshot AS

    SELECT service, full_file_name, content,

       current_timestamp AS ts,

       md5( content ) AS hash

    FROM usr_local_etc

    ORDER BY service

    WITH NO DATA;

When you decide to pull updated data from the filesystem into your snapshot, do the following:

# REFRESH MATERIALIZED VIEW usr_local_etc_snapshot;

Let’s check that the data into the view is coherent with what is in the database:

# SELECT full_file_name, hash, ts

  FROM usr_local_etc_snapshot;



 full_file_name |               hash |  ts

----------------+----------------------------------+-------------------------------

 pkg.conf       | 84925257b233f69068214cdaf3f630a2 | 2017-11-09 16:54:30.668574+01

...

Also, check the MD5 outside of the database:

% sudo md5 /usr/local/etc/pkg.conf                                                                                       ~

MD5 (/usr/local/etc/pkg.conf) = 84925257b233f69068214cdaf3f630a2

As you can see, the MD5 is the same. Therefore, the data in the materialized view does  represent the current snapshot of the filesystem.

Now, imagine you modify the pkg.conf file so that it is updated outside of the database:

% sudo emacs /usr/local/etc/pkg.conf

...

% sudo md5 /usr/local/etc/pkg.conf

MD5 (/usr/local/etc/pkg.conf) = a82431a939e221dd5fc8b702542a30d4

Then, let’s see what the materialized view reports:

# SELECT full_file_name, hash, ts

  FROM usr_local_etc_snapshot

  WHERE service = 'pkg';



 full_file_name |               hash |  ts

----------------+----------------------------------+-------------------------------

 pkg.conf       | 84925257b233f69068214cdaf3f630a2 | 2017-11-09 16:54:30.668574+01

As expected, it does still report the old hash, the data within the materialized view which has not been modified. What this means is that the content column of the view also has a track of the old (i.e., before editing) content of the same file, allowing for a quick (and dirty) restore of the file content.

What has Changed?

The fact that the materialized view contains the snapshot of the filesystem allows for querying the status of the filesystem itself against the previous (last) snapshot:

# WITH current AS (

    SELECT service, md5( content ) AS hash

    FROM usr_local_etc

  )

 SELECT service, ts AS ModifiedSince

    FROM usr_local_etc_snapshot snapshot

    WHERE snapshot.hash <> (

                  SELECT hash

                  FROM current

                  WHERE service = snapshot.service )



 UNION



 SELECT service, ts AS ModifiedSince

    FROM usr_local_etc_snapshot snapshot

    WHERE NOT EXISTS (

                 SELECT service

                 FROM current

                 WHERE service = snapshot.service );

The above query is made up of three parts:

  1. current is a CTE (Common Table Expression), a sub-query that computes the hash on the current file system data (i.e., querying the FDW);
  2. the first SELECT extracts all files that have been modified since the last snapshot (i.e., since the last REFRESH MATERIALIZED VIEW);
  3. the second SELECT extracts all files deleted since the last snapshot.

Running the above query provides the following result:

 service |         modifiedsince

---------+-------------------------------

 pkg     | 2017-11-09 16:54:30.668574+01

meaning that the pkg service has been modified since the last time it was taken into the materialized view.

Conclusions

This article has demonstrated a concrete application of PostgreSQL Foreign Data Wrappers feature to allow the database to query other data sources, in particular, a file system to get and track file information. There are a lot of FDW implementations allowing even more, like web browsing and parsing, other database querying, web service interactions and so on. These can all be used as building blocks for a more complex layer of data management.

Meet the author

Luca Ferrari lives in Italy with his beautiful wife, his great son, and two female cats.

Computer science passionate since the Commodore 64 era, he holds a master degree and a PhD in Computer Science. He is a PostgreSQL enthusiast, a Perl lover, an Operating System passionate, a UNIX fan, and performs as much tasks as possible within Emacs. He considers the Open Source the only truly sane way of interacting with software and services. His website is available at http://fluca1978.github.io

 

References

 

Leave a Reply

Be the First to Comment!

avatar
  Subscribe  
Notify of
© HAKIN9 MEDIA SP. Z O.O. SP. K. 2013