PostgreSQL--the Linux under the Databases

A close look at the PostgreSQL database, including programming interfaces and using it for WWW applications.

by Rolf Herzog

One milestone in the spreading of Linux for general use is the availability of office applications. Besides text-processing, spreadsheets and graphic applications, databases are an important part of this type of application. Until now, the big players in the database market like Sybase and Oracle haven't released a Linux version of their products; nevertheless, there are several database packages available for Linux, mostly commercial applications. There is one database which deserves special attention, because it is now developed similarly to Linux. This database is PostgreSQL, formerly known as Postgres95. Last October, version 6.2.1 was published, and this is a good time to take a closer look at this project.

PostgreSQL has a long history beginning in the year 1986 when Michael Stonebraker began its development as a successor to the Ingres database system. The main goal of his project was to show that a relational database system could cope with modern demands of extensibility as well as an object-oriented system. The resulting product was called an object-relational database because it was a relational database system with some object-oriented features such as inheritance and user-defined functions, operators and access methods. In 1994, with the release of version 4.2, the work on Postgres stopped because Stonebraker's project ended. However, some staff members decided to continue with their work and in 1995 released Postgres95 with SQL as the query language rather than PostQuel. Now the development is carried on under the name PostgreSQL by a team of volunteers on the Internet, coordinated by Marc G. Fournier.

The package is not released under the GPL. It has its own license that allows the distribution of modified releases even without source code, as long as the copyright notice remains untouched.

Features of PostgreSQL

Client/Server Database for a multi-user environment
Networking with TCP/IP
Three Authentication methods: Kerberos, host/user based and username/password authentication
SQL as query language (It is not fully ANSI SQL92 compliant; options not available are nested subqueries, HAVING clause in an ORDER BY statement, OUTER JOIN, PRIMARY and FOREIGN KEY statements during table creation.)
Multiple index types, unique indexes and multi-column indices
User-defined functions (SQL, C), operators and data types
User-defined sequences and trigger functions
Language interfaces for C, C++, Objective-C, Java, Perl, Tcl/Tk and Python
Available third party ODBC driver
Ported for Linux/Intel, Linux/Solaris, Linux/Alpha, AIX, DEC Alpha AXP, FreeBSD, BSDI, DG/UX, HP-UX, Nextstep, Solaris x86, Solaris Sparc, SunOS Sparc, SGI Irix, SCO, Intel SVR4, Ultrix

The Client/Server Model

Client/Server databases are different from the monolithic systems familiar to users of MS-DOS and Windows machines. It is not sufficient to start only one program to work with the database. There are at least three different applications running simultaneously. One is the server, which in PostgreSQL is called the postmaster. It is a daemon which observes a TCP port (usually port 5432) for a connection from the second: a user client. This user client could be psql, a command-line tool for sending SQL queries that comes with PostgreSQL, or specialized applications, like a CGI program that wishes to store data gathered from an HTML form. Once a client connects, the postmaster starts the third application: the back-end server. This is the actual database engine, because only the back-end server has direct access to the stored records of the database. The postmaster connects the client with the back-end server and then waits for other connections. In the meantime, the client sends its queries to the back-end and receives its reply, usually a lot of data. When the client has no other query, the connection drops and the back-end server exits.

The consequence of this procedure is that the client, the program that the user works with, has no direct access to the database. It doesn't know how many records the back end sends in response to a query. It doesn't know if the inserted data were successfully processed, unless the back end signals success. Therefore, fancy table editors such as those in dBase or Access are difficult to implement and don't currently exist.

Installation

The installation of PostgreSQL is a little tedious at present, but it becomes easier with every release. Version 6.2.1 comes with configure-support and detailed and comprehensive installation instructions with specific notes for Linux. I will mention some possible pitfalls during the installation process. Make sure you have a recent version of flex installed on your system. In particular v2.5.3 does not work with PostgreSQL whereas v2.5.2 and v2.5.4 work fine. For the client psql, you need the developer libraries from GNU Readline, which are not installed by default on some Linux distributions.

On Linux, PostgreSQL is compiled with a shared library named libpq.so, which is used by all client applications. You need to make sure that the dynamic linker finds this library by adding the directory $POSTGRESDIR/lib to the file ld.so.config. Especially with older versions of ld-linux.so (e.g., v1.8.5), it is necessary to also add this directory to the environment variable LD_LIBRARY_PATH. I do this in my rc.local file, which starts up the postmaster, so it looks like this:

echo "starting PostgreSQL postmaster..."
export LD_LIBRARY_PATH=/opt/lib/pgsql/lib
su postgres -c "postmaster -D/opt/lib/pgsql/data \
-d 1 &> /opt/lib/pgsql/postmaster.log &"

In the contrib/linux directory of the source distribution, a more sophisticated startup script named postgres.init can be found, that integrates the postmaster startup into the RedHat runlevel system.

Before starting the postmaster, edit the file pg_hba.conf in the data directory. This file holds information, indicating which hosts and users are allowed to connect to the database back end. The minimal file necessary for local operation is:

# TYPE  DATABASE  IP_ADDRESS  MASK       USERAUTH  MAP
  host   all      127.0.0.1   255.255.255.255	trust
# The above allows any user on the local system to
# connect to any database under any user name.

As mentioned in the documentation, it is essential not to start the postmaster as root, but rather to start it in a special postgres user ID.

For clients to work, they must know where to find the databases. Therefore, you have to set the environment variables PGDATA and PGLIB for every user. I do it in /etc/profile. Additionally, every user who should have access to PostgreSQL needs to have appropriate access permissions in the system databases. This is done with the createuser program, run by the postgres account. createuser offers you the option of giving the user the right to create databases and/or the right to add other users. It is possible to say no to both questions, thus giving the user access only to the existing databases. If you are running PostgreSQL as a WWW database back end, you must set access permissions for the user ID of the web server.

Short Introduction to SQL

The best way to get started with PostgreSQL is to read the user manual. It is written for Postgres95 v1.0 and dates back to September 1995, but the information provided there is still useful. Other valuable sources of information are the tutorials, the man pages and the various files in the /doc directory. I will be using the examples from the tutorial in this article whenever possible.

To begin, create a database, i.e., a named container for several tables and accompanying data such as indices and views. To use the database named tutorial, type this command:

createdb tutorial

Now, you are automatically the owner of the database and have full access to it. Other users have access only if you grant them the appropriate rights.

Next, connect that database to a client program. We will use psql, which comes with PostgreSQL. If you prefer a graphical interface, there is also a Motif client available called mpsql (with pre-compiled binaries for Linux). mpsql has great editing capabilities but doesn't provide special local commands for listing existing tables and databases. It also lacks a help system that is available in psql. To use psql, type:

psql tutorial

This command provides you with a shell-like environment in which you can issue SQL commands. Due to the Readline support, you have command history and file name completion with the same key bindings as the bash shell. You can also enter local commands which are processed by the client first if you prefix them with a backslash. Enter \? to get a list of local commands. All other commands are sent directly to the back end. Commands can be typed on separate lines. They will be stored in a local buffer until you enter a line terminated with a ; (semicolon), then the buffer is sent to the back end. Help on SQL commands is available by typing \h.

SQL commands can be passed on the command line for shell scripting. The \i command reads a file from disk and executes its contents as SQL commands. Be sure to always use absolute pathnames with psql, because the back end knows nothing about the current working directory of the client.

Next, create the following two tables:

create table cities (
  name	text,
  population	float8,
  altitude	int--this is a comment
);

create table capitals (
  state	char2
) inherits (cities);

The text type is a string of characters of variable length. If you enter int, you get a four-byte integer value. PostgreSQL comes with 43 predefined data types including several types for time and date values, many types for geometrical objects such as point, circle and polygon and a boolean type. Arrays are also supported. All types are described in the pgbuiltin(l) manual page. If you need additional types, you can add your own. Note that identifier names have not been case sensitive since v6.1.

This example also illustrates a special feature of PostgreSQL: object inheritance. The second table inherits all the fields from the cities table and adds one more field. Later I'll show how to take advantage of that feature.

One often needed feature is missing from PostgreSQL, that is, the ability to define primary keys in the create clause. Primary keys are used to define a default sort order for the tuples and to ensure that the field with that key can't hold duplicate values. It is not supported due to the method used to store the records (or tuples). Every tuple in the database gets a unique object identifier (oid) value, which is unique not only in the table but also in the whole database. There is no way to guarantee a specific order in the table. As of version 6.0 of PostgreSQL it is possible to create a unique index, so that the same effect can be achieved with indices. To create a unique index for our example table, type:

create unique index on cities
using btree (name);

There are three methods available for index creation: btree, rtree and hash. The method can be specified after the keyword ``using''. Only the btree method allows multiple key indices with up to seven keys. Note that not all data types are supported by all index types. In particular, rtree indices are available only for geometric types. If no index type is specified, btree is used as the default. Indices increase the access speed to tables significantly and should be used whenever possible.

The maximum size of a tuple is 8192 bytes. In reality, it is somewhat smaller, because PostgreSQL needs some place for storing internal data. The amount of this space varies from platform to platform. If you need larger fields, use the large objects interface, which provides unlimited fields of transparent data, like MEMO or BLOB fields in other databases; however, you need special functions to access them.

To enter data in the tables, use the insert command:

INSERT INTO cities VALUES ('San Francisco',
	7.24E+5, 63);
INSERT INTO cities VALUES ('Las Vegas', 2.583E+5,
	2174);
INSERT INTO cities VALUES ('Mariposa', 1200,
	1953);

INSERT INTO capitals VALUES ('Sacramento', 
	3.694E+5, 30, 'CA');
INSERT INTO capitals VALUES ('Madison', 1.913E+5, 
	845, 'WI');

To get the data out of the tables, use the select command. It's a very powerful command, so I will demonstrate only some of its characteristics.

-- this will return all records in the table
select * from cities;
select * from capitals;

-- to get also the records of the 
-- inherited tables, use this syntax:
select * from cities*;

-- here are some variants to limit the returned
-- data:
select name, altitude
from cities
where altitude > 500;

To change some values in the table, use the update command, which has a similar syntax as the select command:

update cities
-- population grows by 10%
set population = population * 1.1 
where name = 'Mariposa';

When you update your data regularly, you will notice that the tables grow continuously, even if you haven't added new tuples. This is not a bug--it's another special feature called time travel. PostgreSQL keeps a history of all data changes in the table. To access this data, you have to use a special qualifier:

select name, population
from cities['epoch', 'now']
where name = 'Mariposa';

This example will list all the values of the two fields, name and population of Mariposa, from the creation of the database up to the present. If you don't wish to retain the history data, you can delete it using the vacuum command. In the future version 7.0, the time-travel feature will vanish, but at this time you need to vacuum your databases regularly. The vacuum command also has the additional purpose of updating the internal data in order to make faster querys possible. Therefore, it is a good idea to define a cron job that runs vacuum every night.

Time and Date Functions

Time and date values are handled in a very flexible manner. I will demonstrate this on a database that I use for tracking my telephone costs for connecting to my Internet provider. The structure of my database and a sample query look like this:

create table telephone (
 datum    abstime,
 online_secs int4,
 units    int2,
 costs    float8
);

select * from pppcosts
where datum >= 'yesterday'::abstime;


datum	         	|online_secs|units|cost
Tue Jul 08 00:16:22 1997 MET DST|  486|  3|0.36
Tue Jul 08 20:18:52 1997 MET DST| 1476| 10| 1.2
Wed Jul 09 01:06:33 1997 MET DST| 3317| 14|1.68

This query returns my on-line events from yesterday at 00:00:00 until now. The term 'yesterday'::abstime is an explicit type conversion that makes sure the parser gets the right type for literal values. In this case it is not necessary, but sometimes the input type might be guessed wrong and an error returned. Explicit notation avoids such errors.

It is also possible to get aggregates of data. From my telephone costs table, I can get the costs for the last 30 days by giving the following query:

select sum(online_secs) as seconds,
    sum(units) as units,
    sum(costs) as costs
 where datum > ('now'::abstime - 
	'30 days'::reltime);

The keyword as sets the column titles to the specified values; otherwise, they would all read sum.

Also supported is the SQL date type which accepts dates in the form mm/dd/yyyy. There is even a mechanism to change the date format for foreign languages (currently only European and US formats are supported). Use the set command as follows:

set datestyle to 'european';
set datestyle to 'us';

The date format is changed to dd/mm/yyyy and then back to mm/dd/yyyy. In v6.1 a comma had to follow the keywords ``european'' and ``us''. In later releases this bug has been fixed.

Sequences

Another nifty feature is the sequence command. A frequent demand to a database is that some column contains a sequence of numbers which is automatically increased when new records are added. To do this, create sequences by giving the following statements:

create sequence id start 1000 increment 10;
insert into table values (nextval(id),...);

This example will create a sequence, with the name ID, which starts at 1000 and increases by 10 every time the function nextval() returns a unique number.

User-Defined Functions

The next example shows how to implement a user-defined function. This technique can be used to overcome a limitation of PostgreSQL--the database system is not able to process subqueries. Querys that contain another query, usually in a where clause, are called subquerys. PostgreSQL supports them only indirectly, i.e., when the subquery is hidden in a function.

-- I will use the cities table again
create function big_city() returns set of cities
 as 'select * from cities* where population > 700000'
 language 'sql';

select name(big_city()) as highpop;

In the example, the function returns a tuple from the cities table. It is also possible to return only single values by writing e.g., returns int4. The language command in the example tells the parser that this is a query-language function. With programming-language functions (e.g., language 'c'), you can program your own functions in C and load them as dynamic modules directly into the back end. This mechanism is also used to create user-defined types. (Examples can be found in the tutorial directory of the source tree.) A function is permanently stored in the same way as a table. So, when it is no longer needed, it must be explicitly deleted by giving the drop function command.

I have presented only some highlights of the SQL language here, with an emphasis on special features of PostgreSQL. I also omitted examples for joining data from different tables, for creating views and for enclosing your operation in transaction blocks to ensure that either all actions are performed or none at all. I also omitted mentioning cursors, which allow you to limit the amount of tuples returned from a query. These and other features can be learned best from a book about the SQL query language. I have used the book A Visual Introduction to SQL from J.H. Trimble, D. Chappel and others, published by Wiley in 1989 for my learning. It is aimed at novices and easy to understand.

Programming Interfaces

One advantage of the long history of PostgreSQL is the availability of many programming interfaces. An Interface for C, C++, Perl5, Tcl and Java JDBC is shipped with the release. Also available are packages for Python and Objective-C (the latter from the GNUStep project). With these libraries you can develop your own client applications.

The C-library is the basic interface to PostgreSQL, because it is used by most of the other libraries. The best example for using it is the psql client. If you are writing clients in C, this program will teach you much of the internal working method of the C-API.

I will now discuss a small program that writes all tuples from the cities table to the screen. (Source code can be obtained from ftp://ftp.ssc.com/pub/lj/listings/issue46/2245.tgz.)

The C-Interface is located in the libpq library. If you have installed PostgreSQL properly, the linker should find the library. Link your program with the -lpq options set. In the source code, you have to include the header files with the line:

#include <libpq-fe.h>

Make sure the compiler finds the header files; they are usually located in the $POSTGRESDIR/include directory.

The first step in the dialog with the back end is to establish a connection to the database. This is done with the command:

char* dbname;
strcpy(dbname, "tutorial");
conn = PQsetdb(host, port, options, tty, dbname);

All parameters are of type char*. If one is NULL, a default value is used. Usually it is only necessary to specify the database name and a hostname if it's not the local host. The function returns a pointer, which must be used for further access to this connection.

To test the success of a PQsetdb operation the following code can be used:

if (PQstatus(conn) == CONNECTION_BAD) {
 printf(stderr,
"Connection to database '%s' failed.\n", dbname);
 fprintf(stderr, "%s", PQerrorMessage(conn));
 PQfinish(conn);
 exit(1);
}

PQerrorMessage returns a detailed error message. With PQfinish the connection is terminated and all internal buffers are freed.

After a successful connection, a query can be sent to the database in this way:

result = PQexec(conn, "select * from cities");
if ((!result) || (PGRES_TUPLES_OK != 
               PQresultStatus(result))) {
  fprintf(stderr, 
"Error sending query.\nDetailed report: %s\n",
PQerrorMessage(conn));
  PQfinish(conn);
  exit(1);
}

Again, the result of the operation has to be checked, now with the function PQresultStatus which returns different codes dependent on the database operation that is to be performed. In our case, we expect to get some tuples back so we test against this condition. Another possible value is PGRES_COMMAND_OK, which is returned when a query such as INSERT was sent that returns no data. The returning pointer points to a structure within the record data. The structure could be very large. Don't try to access this structure directly; use the functions of the libpq instead, because the internal working methods of the library is subject to change. As a demonstration, here is code that prints out the contents of the cities database:

printf("name\t\tpopulation\taltitude\n\n");
for (i = PQntuples(result)-1; i >= 0; i--) {
 printf("%s\t%s\t\t%s\n", PQgetvalue(result,i,0),
 PQgetvalue(result,i,1), PQgetvalue(result,i,2));
}

The function PQgetvalue returns the data as null-terminated ASCII strings, regardless of the field type. To determine the field type, the function PQftype could be used. This function is a weakness of the library, because it returns only the internal coding of the type which is difficult to handle. In order to interpret it correctly, one has to know that all data types are stored as separate tuples in the system database of PostgreSQL. The return value of PQftype is the OID, the unique identifier, of that tuple. To identify the type, you have to query the system database:

-- get the type with internal number 16 (bool)
select oid, typname from pg_type
where oid = '16'::oid;

Other Programming Interfaces

If you prefer a language other than C, there is a good chance that it is supported by PostgreSQL. The programming techniques are very similar to the C-library; in fact, some object-oriented interfaces make the programming simpler.

One such interface is the libpq++ library, written in C++ and included in the distribution. It has undergone a major revision in the v6.1 release. The class PgEnv is provided for manipulating the environment such as setting the port number for connections and the pathnames for the libraries and databases. The main class is PgDatabase, used to establish a connection and execute querys. It provides very convenient methods to make internal checks for successful operation. There are also classes for accessing large objects (PgLargeObject), for using transaction blocks (PgTransaction) and for defining cursors (PgCursor). The libpq++ is not built by default, you have to compile it with the provided Makefile.

The Perl5 interface Pgsql_perl5 also must be compiled with the Makefile. It provides two different interfaces. One is nearly identical to libpq so that porting C applications to Perl is easy. The other interface uses the object-oriented features of Perl5. Listing 1 demonstrates the use of that interface. Its purpose is to give a listing of all users who are owners of databases. To run this script, you need to have the proper access rights in the system database. As you can see, with the object-oriented interface, it is not necessary to close the connection explicitly. The module handles this automatically when the connection object is destroyed.

PostgreSQL and the WWW

Together with the Apache web server and some separately available Apache modules, PostgreSQL can be used to serve databases to the Web. One of these modules is mod_auth_pg95, which allows user authentication with PostgreSQL databases. To make it work, you have to install the web server with its own user and group ID rather than with the default ``nouser'' and ``nogroup''. This can be done with the configuration statements User and Group in the file httpd.conf. It is then necessary to tell PostgreSQL of this account, so that it accepts connections from the web server with createuser. In the Apache configuration file access.conf, you must tell the web server where to find the authentication data. The following is a sample configuration:

<Directory /DocumentRoot/MySecrets>
Auth_PGhost localhost
Auth_PGport 5432
Auth_PGdatabase www
Auth_PGpwd_table apache_user
Auth_PGuid_field user
Auth_PGpwd_field password
AuthType Basic
AuthName My Secrets
require valid-user
</Directory>

Now, create a database named ``www'' with a table ``apache_user'' that contains the fields user and password. I suggest you do not create the table under the web server account for security reasons. If you create the table under another account and grant access with the command:

grant select on apache_user to apache;

then the server can't be used to insert new users or to delete the whole table. To populate the table, you can use the Perl script pg95passwd.pl, which ships with mod_auth_pg95. It also encrypts the passwords for you.

The second Apache module I want to introduce is PHP/FI. It is also available as a stand-alone CGI-Script, but it is more secure to compile it as a module. PHP/FI allows you to embed scripts in your web pages in a similar way to Microsoft's Active Server Pages. It can be configured to allow script access to PostgreSQL. PHP/FI uses its own script language which is easy to learn. The PostgreSQL part is based on libpq. If you know libpq, you should have no problems using PHP/FI for PostgreSQL access. I can't present all the features of PHP/FI, but I will provide an example script in Listing 2, which reads the contents of the cities table and puts it into an HTML table environment.

PHP/FI provides automatic conversion for some data types. Unlike the C interface, table values are returned in their proper format for the types integer, boolean, oid, float and real. Arrays are always returned as strings.

When you compile PHP/FI for use with PostgreSQL, defining the compiler switch MAGIC_QUOTES in php.h makes life easier. This option causes all single and double quote characters in GET and POST data from HTML forms to be automatically escaped.

The Further Development of PostgreSQL

PostgreSQL is now developed by a couple of volunteers, who coordinate their efforts via the Internet. A mailing list is used for discussion of implementation details. At the FTP site (ftp://ftp.postgresql.org/), a tar archive file with the latest sources is provided every night. The programmers use the FreeBSD utility sup to synchronize their source trees. There is also a documentation project where the man pages and the user guide are

It is planned to achieve full SQL92 compliance in the future and to constantly speed up the database operation. Version 6.2.1, which was released in October 1997 brings PostgreSQL a big step nearer to this goal. Now default values and constraints can be specified at table creation, to check newly inserted data for specific conditions. It is also possible to write trigger functions, that are executed whenever a row of table data is selected, inserted or updated. There are also new string functions like trim() substring() and position() which makes manipulating of strings very convenient. A new Server Programming Interface gives the users the aility to write server stored procedures and to implement integrity checks with triggers.

Compared to the widely used mSQL, PostgreSQL is much slower as long as only simple queries are involved. But for queries that include complex joins of multiple tables, PostgreSQL should be faster. mSQL is written as a program that gives programmers or webmasters a small and fast tool for simple database operations. In contrast to this, PostgreSQL is a full featured database system. It provides more data types, better extensibility and, with the two available query tools psql and mpsql, more user friendliness.

There are commercial database systems which are available for free as personal versions, e.g., Solid and Yard-SQL. These systems are fully SQL92 compliant and, in comparison to them, PostgreSQL's functionality is limited. The features that make PostgreSQL stand out among these competitors are that it is freely availabe and anyone who is interested can participate in its further development.

Resources

Rolf Herzog was fascinated by computers since his first steps with the early TRS-80. He came to Linux with kernel version 1.2.3, when searching for an alternative to Windows. Now he is working as a computer consultant for the Steinheim Institute in Duisburg, Germany. He is pursuing a degree in Political Science as the theory of social systems stimulates him as much as computer systems. He can be reached via e-mail at rolf@culthea.gun.de.