pgsql-tablespaces_on_the_way_for_7.5_also_wal_point_in_time_recovery

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.

Date: Thu, 26 Feb 2004 21:07:49 +1100 (EST)
From: Gavin Sherry <swm@linuxworld.com.au>
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] Tablespaces

Hi all,

I've been looking at implementing table spaces for 7.5. Some notes and
implementation details follow.

 ------

Type of table space:

There are many different table space implementations in relational
database management systems. In my implementation, a table space in
PostgreSQL will be the location of a directory on the file system in
which files backing database objects can be stored. Global tables and
non 'data' objects (WAL, CLOG, config files) will all remain in $PGDATA.
$PGDATA/base will be the default table space.

A given table space will be identified by a unique table space name. I
haven't decided if 'unique' should mean database-wide unique or
cross-database unique. It seems to me that we might run into problems
with CREATE DATABASE ... TEMPLATE = <database with table spaces> if the
uniqueness of table spaces is limited to the database level.

A table space parameter will be added to DDL commands which create
physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to
CREATE SCHEMA. The associated routines, as well as the corresponding DROP
commands will need to be updated. Adding the ability to ALTER <object>
TABLESPACE <name> seems a little painful. Would people use it? Comments?

When an object is created the system will resolve the table space the
object is stored in as follows: if the table space paramater is passed to
the DDL command, then the object is stored in that table space (given
validation of the table space, etc). If it is not passed, the object
inherits its "parent's" table space where the parent/child hierarchy is as
follows: database > schema > table > [index|sequence]. So, if you issued:

	create table foo.bar (...);

We would first not that there is no TABLESPACE <name>, then cascade to
the table space for the schema 'foo' (and possibly cascade to the table
space for the database). A database which wasn't created with an explicit
table space will be created under the default table space. This ensures
backward compatibility.


Creating a table space:

A table space is a directory structure. The directory structure is as
follows:

[swm@dev /path/to/tblspc]$ ls
OID1/	OID2/

OID1 and OID2 are the OIDs of databases which have created a table space
against this file system location. In this respect, a table space
resembles $PGDATA/base. I thought it useful to keep this kind of
namespace mechanism in place so that administrators do not need to create
hierarchies of names on different partitions if they want multiple
databases to use the same partition.

The actual creation of the table space will be done with:

	CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;

Before creating the table space we must:

1) Check if the directory exists. If it does, create a sub directory as
the OID of the current database.

2) Alternatively, if the directory doesn't exist, attempt to create it,
then the sub directory.

I wonder if a file, such as PG_TBLSPC, should be added to the table space
directory so that, in the case of an existing non-empty directory, we can
attempt to test if the directory is being used for something else and
error out. Seems like:

CREATE TABLESPACE tbl1 LOCATION '/var/'

which will result in something like '/var/123443' is a bad idea. Then
again, the user should know better. Comments?

If everything goes well, we add an entry to pg_tablespace with the table
space location and name (and and OID).


Tying it all together:

The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
field. This will be the OID of the table space the object resides in, or 0
(default table space). Since we can then resolve relid/relname, schema and
database to a tablespace, there aren't too many cases when extra logic
needs to be added to the IO framework. In fact, most of it is taken care
of because of the abstraction of relpath().

The creation of table spaces will need to be recorded in xlog in the same
way that files are in heap_create() with the corresponding delete logic
incase of ABORT.


Postmaster startup:

Ideally, the postmaster at startup should go into each tblspc/databaseoid
directory and check for a postmaster.pid file to see if some other
instance is touching the files we're interested in. This will require a
control file listing tblspc/databaseoid paths and it will need to plug
into WAL in case we die during CREATE TABLESPACE. Comments?


Creating a database

I think that createdb() is going to have to be reworked if pg_tablespace
isn't shared (ie, tablespaces are only database unique). The reason being
that if we create a database which has a table space, pg_tablespace in the
new database will have to be updated and that cannot be done atomically
with the `cp` based mechanism we currently use.

I think I'm going to have to get my hands dirty before I can tell the
extent to which createdb() will need reworking.


pg_dump

Obviously pg_dump will need to be able to dump table spaces. pg_dump
running against <7.5 will DDL commands without a table space parameter and
as such the database's physical layout, when loaded into 7.5, will be the
same as for <7.5.

 ---

Comments? Questions? Suggestions?

===
From: Richard Huxton <dev@archonet.com>
To: Gavin Sherry <swm@linuxworld.com.au>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces
Date: Thu, 26 Feb 2004 11:25:23 +0000

On Thursday 26 February 2004 10:07, Gavin Sherry wrote:
>
> CREATE TABLESPACE tbl1 LOCATION '/var/'
>
> which will result in something like '/var/123443' is a bad idea. Then
> again, the user should know better. Comments?

The LOCATION should have the same owner and permissions as $PGDATA - that 
should catch mistyping.

Unless you're running as root, of course. In which case you clearly know 
better than everyone else, so off you go!

> Comments? Questions? Suggestions?

Presumably I'm using this to deal with performance/space issues, so there 
clearly needs to be something in the pg_stat_xxx system to show figures based 
on tablespace - not sure what you'd measure though - disk I/O, number of 
nodes?

Perhaps something in contrib/ too - tablespace_disk_usage.pl or some such.

===
Date: Thu, 26 Feb 2004 12:25:24 +0100
From: =?ISO-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= <postgres@cybertec.at>
To: Gavin Sherry <swm@linuxworld.com.au>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces

Gavin Sherry wrote:
> Hi all,
> 
> I've been looking at implementing table spaces for 7.5. Some notes and
> implementation details follow.
> 
> ------
> 
> Type of table space:
> 
> There are many different table space implementations in relational
> database management systems. In my implementation, a table space in
> PostgreSQL will be the location of a directory on the file system in
> which files backing database objects can be stored. Global tables and
> non 'data' objects (WAL, CLOG, config files) will all remain in $PGDATA.
> $PGDATA/base will be the default table space.

Is it possible to put WALs and CLOGs into different tablespaces? (maybe 
different RAID systems). Some companies want that ...


> A given table space will be identified by a unique table space name. I
> haven't decided if 'unique' should mean database-wide unique or
> cross-database unique. It seems to me that we might run into problems
> with CREATE DATABASE ... TEMPLATE = <database with table spaces> if the
> uniqueness of table spaces is limited to the database level.


I strongly vote for database cluster wide unique names because somebody 
could have a tablespace "webusers" or something like that. To me this 
makes far more sense.


> A table space parameter will be added to DDL commands which create
> physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to
> CREATE SCHEMA. The associated routines, as well as the corresponding DROP
> commands will need to be updated. Adding the ability to ALTER <object>
> TABLESPACE <name> seems a little painful. Would people use it? Comments?

I think people won't need it in first place because this seems to be 
really painful.
What really matters is that the number of tablespaces and file / 
tablespace is unlimited. SAP DB has limited the number of devspaces to 
32 (I think). This is real bull.... because if your database grows 
unexpectedly you are in deep trouble (expert database design by SAP, 
MySQL and 100000....0000 others).


> When an object is created the system will resolve the table space the
> object is stored in as follows: if the table space paramater is passed to
> the DDL command, then the object is stored in that table space (given
> validation of the table space, etc). If it is not passed, the object
> inherits its "parent's" table space where the parent/child hierarchy is as
> follows: database > schema > table > [index|sequence]. So, if you issued:
> 
> 	create table foo.bar (...);
> 
> We would first not that there is no TABLESPACE <name>, then cascade to
> the table space for the schema 'foo' (and possibly cascade to the table
> space for the database). A database which wasn't created with an explicit
> table space will be created under the default table space. This ensures
> backward compatibility.

Will users automatically be assigned to a certain table space? How is 
this going to work?

> Creating a table space:
> 
> A table space is a directory structure. The directory structure is as
> follows:
> 
> [swm@dev /path/to/tblspc]$ ls
> OID1/	OID2/
> 
> OID1 and OID2 are the OIDs of databases which have created a table space
> against this file system location. In this respect, a table space
> resembles $PGDATA/base. I thought it useful to keep this kind of
> namespace mechanism in place so that administrators do not need to create
> hierarchies of names on different partitions if they want multiple
> databases to use the same partition.
> 
> The actual creation of the table space will be done with:
> 
> 	CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;
> 
> Before creating the table space we must:
> 
> 1) Check if the directory exists. If it does, create a sub directory as
> the OID of the current database.
> 
> 2) Alternatively, if the directory doesn't exist, attempt to create it,
> then the sub directory.
> 
> I wonder if a file, such as PG_TBLSPC, should be added to the table space
> directory so that, in the case of an existing non-empty directory, we can
> attempt to test if the directory is being used for something else and
> error out. Seems like:
> 
> CREATE TABLESPACE tbl1 LOCATION '/var/'
> 
> which will result in something like '/var/123443' is a bad idea. Then
> again, the user should know better. Comments?
> 
> If everything goes well, we add an entry to pg_tablespace with the table
> space location and name (and and OID).
> 
> 
> Tying it all together:
> 
> The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
> field. This will be the OID of the table space the object resides in, or 0
> (default table space). Since we can then resolve relid/relname, schema and
> database to a tablespace, there aren't too many cases when extra logic
> needs to be added to the IO framework. In fact, most of it is taken care
> of because of the abstraction of relpath().
> 
> The creation of table spaces will need to be recorded in xlog in the same
> way that files are in heap_create() with the corresponding delete logic
> incase of ABORT.
> 
> 
> Postmaster startup:
> 
> Ideally, the postmaster at startup should go into each tblspc/databaseoid
> directory and check for a postmaster.pid file to see if some other
> instance is touching the files we're interested in. This will require a
> control file listing tblspc/databaseoid paths and it will need to plug
> into WAL in case we die during CREATE TABLESPACE. Comments?
> 
> 
> Creating a database
> 
> I think that createdb() is going to have to be reworked if pg_tablespace
> isn't shared (ie, tablespaces are only database unique). The reason being
> that if we create a database which has a table space, pg_tablespace in the
> new database will have to be updated and that cannot be done atomically
> with the `cp` based mechanism we currently use.
> 
> I think I'm going to have to get my hands dirty before I can tell the
> extent to which createdb() will need reworking.
> 
> 
> pg_dump
> 
> Obviously pg_dump will need to be able to dump table spaces. pg_dump
> running against <7.5 will DDL commands without a table space parameter and
> as such the database's physical layout, when loaded into 7.5, will be the
> same as for <7.5.
> 
> ---
> 
> Comments? Questions? Suggestions?
> 
> Thanks,
> 
> Gavin


Do you plan support for limiting the size of a tablespace? ISPs will 
vote for that because they can limit the size of a database on the 
database level rather than on the operating system level.  Of course 
this can and (should???) be done on the operation system level but 
people will definitely ask for that.
If sizing is not supported we should definitely provide minor 
documentation which tells people how to do that on the operating system 
level (at least poting to some useful information).

	Best regards,

		Hans


===
Date: Thu, 26 Feb 2004 22:58:25 +1100 (EST)
From: Gavin Sherry <swm@linuxworld.com.au>
To: =?X-UNKNOWN?Q?Hans-J=FCrgen=5FSch=F6nig?= <postgres@cybertec.at>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces

> Is it possible to put WALs and CLOGs into different tablespaces? (maybe
> different RAID systems). Some companies want that ...

I wasn't going to look at that just yet.

There is of course the temporary hack of symlinking WAL else where.

I'd be interested to see the performance difference between WAL and data
on the same RAID/controller and WAL and data on different RAID/controller
with Jan's improvements to the buffer management.

===
Date: Thu, 26 Feb 2004 23:22:28 +1100 (EST)
From: Gavin Sherry <swm@linuxworld.com.au>
To: Dennis Bjorklund <db@zigo.dhs.org>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces

On Thu, 26 Feb 2004, Dennis Bjorklund wrote:

> On Thu, 26 Feb 2004, Gavin Sherry wrote:
>
> > Comments? Questions? Suggestions?
>
> Is that plan that in the future one can split a single table into
> different table spaces? Like storing all rows with year < 1999 in one
> tablespace and the rest in another?

These are called partitions in oracle. You can approximate this with table
spaces by using a partial index and putting it in a different table space.
The problem, of course, is seq scans.


>
> With the rule system and two underlying tables one could make it work by
> hand I think.
>
> I've never used tablespaces in oracle so I don't know what it can offer. I

Certainly, table spaces are used in many ways in oracle, db2, etc. You can
mirror data across them, have different buffer sizes for example.
In some implementations, they can be raw disk partitions (no file system).
I don't intend going this far, however.

> If you don't want to discuss this now, I understand. It's not part of the
> design as it is now. I'm just curious at what direction we are moving and
> what is possible to do.

Well, partitions are something else entirely. Mirroring would be
interesting, but RAID designers are better at parallelisation of IO than
(some) database developers. Might be better to keep the problem seperate.

===
Date: Thu, 26 Feb 2004 04:25:28 -0800 (PST)
From: "Joshua D. Drake" <jd@commandprompt.com>
To: =?ISO-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= <postgres@cybertec.at>
Cc: Gavin Sherry <swm@linuxworld.com.au>, <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Tablespaces

> Is it possible to put WALs and CLOGs into different tablespaces? (maybe 
> different RAID systems). Some companies want that ...

You can do this now, but it would be nice to be able to have it actually 
configurable versus the hacked linked method.

===

Date: Thu, 26 Feb 2004 13:33:54 +0100
From: =?ISO-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= <postgres@cybertec.at>
To: Gavin Sherry <swm@linuxworld.com.au>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces

Gavin Sherry wrote:
>>Is it possible to put WALs and CLOGs into different tablespaces? (maybe
>>different RAID systems). Some companies want that ...
> 
> 
> I wasn't going to look at that just yet.
> 
> There is of course the temporary hack of symlinking WAL else where.

that's what we do now.
we symlink databases and wals ...


> I'd be interested to see the performance difference between WAL and data
> on the same RAID/controller and WAL and data on different RAID/controller
> with Jan's improvements to the buffer management.

yes, that's what i am looking for. i should do some testing.

in case of enough i/o power additional cpus scale almost
linearily (depending on the application of course; i have
done some testing on a customer's aix box ...).  it would be
interesting to see what jan's buffer strategy does (and bg
writer) ...

===
Date: Thu, 26 Feb 2004 11:14:37 -0500
From: "Alex J. Avriette" <alex@posixnap.net>
To: Gavin Sherry <swm@linuxworld.com.au>
Cc: Dennis Bjorklund <db@zigo.dhs.org>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces

On Thu, Feb 26, 2004 at 11:22:28PM +1100, Gavin Sherry wrote:

> Certainly, table spaces are used in many ways in oracle, db2, etc. You can
> mirror data across them, have different buffer sizes for example.
> In some implementations, they can be raw disk partitions (no file system).
> I don't intend going this far, however.

Perhaps now would be a good time to bring up my directio on Solaris question
from a year or so back? Is there any interest in the ability to use raw
disk?

Alex (who is overjoyed to hear discussion of tablespaces again)

===
To: Gavin Sherry <swm@linuxworld.com.au>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces 
Date: Thu, 26 Feb 2004 11:46:13 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

Gavin Sherry <swm@linuxworld.com.au> writes:
> A table space is a directory structure. The directory structure is as
> follows:
> [swm@dev /path/to/tblspc]$ ls
> OID1/	OID2/
> OID1 and OID2 are the OIDs of databases which have created a table space
> against this file system location. In this respect, a table space
> resembles $PGDATA/base. I thought it useful to keep this kind of
> namespace mechanism in place ...

Actually, this is *necessary* AFAICT.  The case that forces it is DROP
DATABASE.  Since you have to execute that from another database, there's
no reasonable way to look into the target database's catalogs.  That
means that the OID of the database has to be sufficient information to
get rid of all its files.  You can do this fairly easily if in each
tablespace (whose locations you know from the shared pg_tablespace
table) you can look for a subdirectory matching the target database's
OID.  If we tried to put the database's files just "loose" in each
tablespace directory then we'd be in trouble.

I think this is also an implementation reason for favoring cluster-wide
tablespaces over database-local ones.  I'm not sure how you drop a
database from outside if you can't see where its tablespaces are.

I believe that it will be necessary to expand RelFileNode to three OIDs
(tablespace, database, relation).  I had once hoped that it could be
kept at two (tablespace, relation) but with a physical layout like this
you more or less have to have three.

One issue that needs to be agreed on early is how the low-level file
access code finds a tablespace.  What I would personally like is for
$PGDATA to contain symlinks to the tablespace top directories.  The
actual access path for any relation could then be built trivially from
its RelFileNode:
	$PGDATA/tablespaces/TBOID/DBOID/RELFILENODE
        -------------------------
The underlined part references a symlink that leads to the directory
containing the per-database subdirectories.

I am expecting to hear some bleating about this from people whose
preferred platforms don't support symlinks ;-).  However, if we don't
rely on symlinks for this then the tablespace-OID-to-physical-path
mapping has to be explicitly known at very low levels of the system
(md.c in particular).  We can't expect md.c to get that information by
reading pg_tablespace.  It would have to rely on some backdoor path,
such as a flat text file it could read at backend startup.  I think
this approach will leave us fighting a lot of problems with locking
and out-of-date information.

Speaking of locking, can we do anything to prevent people from shooting
themselves in the foot by changing active tablespaces?  Are we even
going to have a DROP TABLESPACE command, and if so what would it do?

===
Date: Thu, 26 Feb 2004 08:53:24 -0800
From: Barry Lind <blind@xythos.com>
To: Gavin Sherry <swm@linuxworld.com.au>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces

Gavin,

After creating a tablespace what (if any) changes can be done to it. 
Can you DROP a tablespace, or once created will it always exist?  Can 
you RENAME a tablespace?  Can you change the location of a tablespace 
(i.e you did a disk reorg and move the contents to a different location 
and now want to point to the new location)?  What are the permissions 
necessary to create a tablespace (can any use connected to the database 
create a tablespace, or only superuser, or ...)?

Overall this will be a great addition to postgres.  I am looking forward 
to this feature.

===

Date: Fri, 27 Feb 2004 08:22:25 +1100 (EST)
From: Gavin Sherry <swm@linuxworld.com.au>
To: "Alex J. Avriette" <alex@posixnap.net>
Cc: Dennis Bjorklund <db@zigo.dhs.org>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces

On Thu, 26 Feb 2004, Alex J. Avriette wrote:

> On Thu, Feb 26, 2004 at 11:22:28PM +1100, Gavin Sherry wrote:
>
> > Certainly, table spaces are used in many ways in oracle, db2, etc. You can
> > mirror data across them, have different buffer sizes for example.
> > In some implementations, they can be raw disk partitions (no file system).
> > I don't intend going this far, however.
>
> Perhaps now would be a good time to bring up my directio on Solaris question
> from a year or so back? Is there any interest in the ability to use raw
> disk?

I do not intend to undertake raw disk tablespaces for 7.5. I'd be
interested if anyone could provide some real world benchmarking of file
system vs. raw disk. Postgres benefits a lot from kernel file system cache
at the moment. Also, I believe that database designers have traditionally
made bad file system designers. Raw database partitions often lack the
tools essential to a scalable environment. For example, the ability to
resize partitions.

===

Date: Fri, 27 Feb 2004 08:30:28 +1100 (EST)
From: Gavin Sherry <swm@linuxworld.com.au>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces 

On Thu, 26 Feb 2004, Tom Lane wrote:

> Gavin Sherry <swm@linuxworld.com.au> writes:
> > A table space is a directory structure. The directory structure is as
> > follows:
> > [swm@dev /path/to/tblspc]$ ls
> > OID1/	OID2/
> > OID1 and OID2 are the OIDs of databases which have created a table space
> > against this file system location. In this respect, a table space
> > resembles $PGDATA/base. I thought it useful to keep this kind of
> > namespace mechanism in place ...
>
> Actually, this is *necessary* AFAICT.  The case that forces it is DROP
> DATABASE.  Since you have to execute that from another database, there's
> no reasonable way to look into the target database's catalogs.  That
> means that the OID of the database has to be sufficient information to
> get rid of all its files.  You can do this fairly easily if in each
> tablespace (whose locations you know from the shared pg_tablespace
> table) you can look for a subdirectory matching the target database's
> OID.  If we tried to put the database's files just "loose" in each
> tablespace directory then we'd be in trouble.

Ahhh. Yes.

>
> I think this is also an implementation reason for favoring cluster-wide
> tablespaces over database-local ones.  I'm not sure how you drop a
> database from outside if you can't see where its tablespaces are.

Naturally.

>
> I believe that it will be necessary to expand RelFileNode to three OIDs
> (tablespace, database, relation).  I had once hoped that it could be
> kept at two (tablespace, relation) but with a physical layout like this
> you more or less have to have three.

Yes. I agree.

>
> One issue that needs to be agreed on early is how the low-level file
> access code finds a tablespace.  What I would personally like is for
> $PGDATA to contain symlinks to the tablespace top directories.  The
> actual access path for any relation could then be built trivially from
> its RelFileNode:
> 	$PGDATA/tablespaces/TBOID/DBOID/RELFILENODE
>         -------------------------
> The underlined part references a symlink that leads to the directory
> containing the per-database subdirectories.
>
> I am expecting to hear some bleating about this from people whose
> preferred platforms don't support symlinks ;-).  However, if we don't

Actually, I think that's a pretty good idea :-). I'd solves a bunch of
issues in the backend (postmaster start up can recurse through
$PGDATA/tablespaces looking for postmaster.pid files) and will also assist
admins with complex configurations (perhaps).

> Speaking of locking, can we do anything to prevent people from shooting
> themselves in the foot by changing active tablespaces?  Are we even
> going to have a DROP TABLESPACE command, and if so what would it do?

Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
drop a table space until the directory is empty. We will need a shared
invalidation message so that backends do not attempt to create an object
just after we drop the table space.

===
Subject: Re: [HACKERS] Tablespaces
From: James Rogers <jrogers@neopolitan.com>
To: pgsql-hackers@postgresql.org
Date: 26 Feb 2004 13:41:34 -0800

On Thu, 2004-02-26 at 13:22, Gavin Sherry wrote:
> Postgres benefits a lot from kernel file system cache
> at the moment.


With the implementation of much smarter and more adaptive cache
replacement algorithm i.e. ARC, I would expect the benefit of using the
kernel file system cache to diminish significantly.  It appears to me,
and I could be wrong, that the reason Postgres has depended on the
kernel file system cache isn't that this is obviously better in some
absolute sense (though it might be depending on the deployment
scenario), but that the original cache replacement algorithm in Postgres
was sufficiently poor that the better cache replacement algorithms in
the kernel cache more than offset any sub-optimality that might result
from doing so.

I would expect that with ARC and the redesign of some of the buffer
management bits for more scalability, you might very well get better
performance by allocating most of the memory to the buffer cache rather
than leaving it to the kernel file cache.

I'm actually fairly curious to see what the new buffer management scheme
will mean in terms of real world performance and parameter tuning.

===
From: "Simon Riggs" <simon@2ndquadrant.com>
To: "'Gavin Sherry'" <swm@linuxworld.com.au>, <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Tablespaces
Date: Thu, 26 Feb 2004 21:56:49 -0000

>Gavin Sherry
> The creation of table spaces will need to be recorded in xlog in the
same
> way that files are in heap_create() with the corresponding delete
logic
> incase of ABORT.

Overall, sounds very cool.

Please could we record the OID of the tablespace in the WAL logs, not
the path to the tablespace? That way, we run no risks of having the WAL
logs not work correctly should things change slightly...

.. need to record drop tablespaces in the WAL logs also. 

I'm sure you meant both of those, just checking.

Can drop tablespace require a specific privelege too? It's too easy to
drop parts of a database without thinking...
 
===

To: Gavin Sherry <swm@linuxworld.com.au>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces 
Date: Thu, 26 Feb 2004 17:05:13 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

Gavin Sherry <swm@linuxworld.com.au> writes:
>> Speaking of locking, can we do anything to prevent people from shooting
>> themselves in the foot by changing active tablespaces?  Are we even
>> going to have a DROP TABLESPACE command, and if so what would it do?

> Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
> drop a table space until the directory is empty.

How would it get to be empty?  Are you thinking of some sort of "connect
database to tablespace" and "disconnect database from tablespace"
commands that would respectively create and delete the per-database
subdirectory?  That seems moderately reasonable to me.  We could then
invent a locking protocol that requires backends to lock a tablespace
before they can execute either of these operations (or delete the
tablespace of course).

===

Date: Thu, 26 Feb 2004 17:28:41 -0500
From: "Alex J. Avriette" <alex@posixnap.net>
To: Gavin Sherry <swm@linuxworld.com.au>
Cc: Dennis Bjorklund <db@zigo.dhs.org>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces

On Fri, Feb 27, 2004 at 08:22:25AM +1100, Gavin Sherry wrote:

> interested if anyone could provide some real world benchmarking of file
> system vs. raw disk. Postgres benefits a lot from kernel file system cache
> at the moment. Also, I believe that database designers have traditionally
> made bad file system designers. Raw database partitions often lack the
> tools essential to a scalable environment. For example, the ability to
> resize partitions.

The only reason I mentioned it to begin with was the recommendation of
directio for databases in the Sun Blueprint, _Tuning Databases on the
Solaris Platform_ (and being a Solaris geek, I asked, but apparently
nobody else is worried enough about performance or not using Solaris
enough to care).

It's not critical, of course. I think, however, that many of us would
like to see some of the features of Oracle and DB2 available to users
of postgres.  Some of these features are raw disk, tablespaces, and
replication. We're getting there, and making really terrific progress
(I fully expect replication to be ready for primetime in the 8-12 mos
timeframe), but we're not quite there yet.

As I said, I'm very glad to hear tablespaces mentioned again and see
what looks like work being done on it.

===
To: James Rogers <jrogers@neopolitan.com>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces 
Date: Thu, 26 Feb 2004 17:42:47 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

James Rogers <jrogers@neopolitan.com> writes:
> With the implementation of much smarter and more adaptive cache
> replacement algorithm i.e. ARC, I would expect the benefit of using the
> kernel file system cache to diminish significantly.  It appears to me,
> and I could be wrong, that the reason Postgres has depended on the
> kernel file system cache isn't that this is obviously better in some
> absolute sense (though it might be depending on the deployment
> scenario), but that the original cache replacement algorithm in Postgres
> was sufficiently poor that the better cache replacement algorithms in
> the kernel cache more than offset any sub-optimality that might result
> from doing so.

The question of optimality of replacement algorithm is only one of the
arguments for using a small buffer cache.  IMHO a considerably stronger
argument is that the kernel's memory management is more flexible: it can
use that memory for either disk cache or program workspace, and it can
change the allocation on-the-fly as load demands.  If you dedicate most
of RAM to Postgres buffers then you are likely to be wasting RAM or
swapping heavily.  Possibly both :-(

Another gotcha is that unless the OS allows you to lock shared memory
into RAM, the shared buffers themselves could get swapped out, which is
a no-win scenario by any measure.  Keeping the shared buffer arena small
helps prevent that by ensuring all the buffers are "hot".

Of course, this is all speculation until we get some real-world
experience with ARC.  But I don't expect it to be a magic bullet.

===
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces
From: Greg Stark <gsstark@mit.edu>
Date: 27 Feb 2004 00:26:06 -0500


> > I am expecting to hear some bleating about this from people whose
> > preferred platforms don't support symlinks ;-).  However, if we don't

Well, one option would be to have the low level filesystem storage (md.c?)
routines implement a kind of symlink themselves. Just a file with a special
magic number followed by a path.

I'm normally against reimplementing OS services but symlinks are really a very
simple concept and simple to implement. Especially if you can make a few
simplifying assumptions: they only ever need to appear as the final path
element not as parent directories and tablespaces don't need symlinks pointing
to symlinks. Ideally postgres also doesn't need to implement relative links
either.

===
Subject: Re: [HACKERS] Tablespaces
Date: Fri, 27 Feb 2004 10:00:29 +0100
From: "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at>
To: "Tom Lane" <tgl@sss.pgh.pa.us>, "Gavin Sherry" <swm@linuxworld.com.au>
Cc: <pgsql-hackers@postgresql.org>


> > Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
> > drop a table space until the directory is empty.

Agreed.

> 
> How would it get to be empty?  Are you thinking of some sort of "connect
> database to tablespace" and "disconnect database from tablespace"
> commands that would respectively create and delete the per-database
> subdirectory?  That seems moderately reasonable to me.  We could then

I would only allow the drop if the directory only contains empty db oid 
directories.

===

Subject: Re: [HACKERS] Tablespaces
Date: Fri, 27 Feb 2004 12:39:06 +0100
From: "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at>
To: "Gavin Sherry" <swm@linuxworld.com.au>,
Cc: "Dennis Bjorklund" <db@zigo.dhs.org>, <pgsql-hackers@postgresql.org>


> I do not intend to undertake raw disk tablespaces for 7.5. I'd be
> interested if anyone could provide some real world benchmarking of file
> system vs. raw disk. Postgres benefits a lot from kernel file system cache
> at the moment.

Yes, and don't forget that pg also relys on the OS for grouping and
sorting the physical writes and doing readahead where appropriate.

The use of raw disks is usually paired with the use of kernel aio.
The difference is said to be up to 30% on Solaris. I can assert, that
it made the difference between a bogged down system and a much better behaved
DB on Sun here.

My experience with kaio on AIX Informix is, that kaio is faster as long as IO
is not the bottleneck (disk 100% busy is the metric to watch, not Mb/s), while
for an IO bound system the Informix builtin IO threads that can be used instead
win. (Since they obviously do better at grouping, sorting and readahead
than the AIX kernel does for kaio)

Overall I think the price and komplexity is too high, especially since there are 
enough platforms where the kernel does a pretty good job at grouping, sorting and 
readahead. Additionally the kernel takes non PostgreSQL IO into account.

===

Date: Fri, 27 Feb 2004 08:27:02 -0700 (MST)
From: "scott.marlowe" <scott.marlowe@ihs.com>
To: Zeugswetter Andreas SB SD <ZeugswetterA@spardat.at>
Cc: Tom Lane <tgl@sss.pgh.pa.us>, Gavin Sherry <swm@linuxworld.com.au>,
Subject: Re: [HACKERS] Tablespaces

On Fri, 27 Feb 2004, Zeugswetter Andreas SB SD wrote:

> 
> > > Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
> > > drop a table space until the directory is empty.
> 
> Agreed.
> 
> > 
> > How would it get to be empty?  Are you thinking of some sort of "connect
> > database to tablespace" and "disconnect database from tablespace"
> > commands that would respectively create and delete the per-database
> > subdirectory?  That seems moderately reasonable to me.  We could then
> 
> I would only allow the drop if the directory only contains empty db oid 
> directories.

Wouldn't this be better tracked in the dependency tracking that's already 
built into postgresql?  Checking to see if the directory is empty is open 
to race conditions, but locking the dependency tracking while dropping a 
tablespace isn't.


===
Date: Fri, 27 Feb 2004 08:28:31 -0700 (MST)
From: "scott.marlowe" <scott.marlowe@ihs.com>
To: Gavin Sherry <swm@linuxworld.com.au>
Cc: "Alex J. Avriette" <alex@posixnap.net>, Dennis Bjorklund <db@zigo.dhs.org>,
Subject: Re: [HACKERS] Tablespaces

On Fri, 27 Feb 2004, Gavin Sherry wrote:

> On Thu, 26 Feb 2004, Alex J. Avriette wrote:
> 
> > On Thu, Feb 26, 2004 at 11:22:28PM +1100, Gavin Sherry wrote:
> >
> > > Certainly, table spaces are used in many ways in oracle, db2, etc. You can
> > > mirror data across them, have different buffer sizes for example.
> > > In some implementations, they can be raw disk partitions (no file system).
> > > I don't intend going this far, however.
> >
> > Perhaps now would be a good time to bring up my directio on Solaris question
> > from a year or so back? Is there any interest in the ability to use raw
> > disk?
> 
> I do not intend to undertake raw disk tablespaces for 7.5. I'd be
> interested if anyone could provide some real world benchmarking of file
> system vs. raw disk. Postgres benefits a lot from kernel file system cache
> at the moment. Also, I believe that database designers have traditionally
> made bad file system designers. Raw database partitions often lack the
> tools essential to a scalable environment. For example, the ability to
> resize partitions.

Is possible / reasonable / smart and or dumb to look at implementing the 
tablespaces as riding atop the initlocation handled stuff.  I.e. 
postgresql can only create tablespaces in areas that are created by 
initlocation, thus keeping it in its box, so to speak?


===

To: "scott.marlowe" <scott.marlowe@ihs.com>
Cc: Zeugswetter Andreas SB SD <ZeugswetterA@spardat.at>,
Subject: Re: [HACKERS] Tablespaces 
Date: Fri, 27 Feb 2004 11:42:46 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

"scott.marlowe" <scott.marlowe@ihs.com> writes:
> On Fri, 27 Feb 2004, Zeugswetter Andreas SB SD wrote:
>>> How would it get to be empty?  Are you thinking of some sort of "connect
>>> database to tablespace" and "disconnect database from tablespace"
>>> commands that would respectively create and delete the per-database
>>> subdirectory?  That seems moderately reasonable to me.  We could then
>> 
>> I would only allow the drop if the directory only contains empty db oid 
>> directories.

That's subject to race conditions (ie, someone creating a table about
the same time you are deciding it's okay to drop the tablespace).  There
needs to be some interlock, and I think that associating that lock with
infrequently executed connect/disconnect operations would be good from a
performance standpoint.

> Wouldn't this be better tracked in the dependency tracking that's already 
> built into postgresql?

No, because dependencies are local to individual databases.

===

To: "scott.marlowe" <scott.marlowe@ihs.com>
Cc: Gavin Sherry <swm@linuxworld.com.au>,
Subject: Re: [HACKERS] Tablespaces 
Date: Fri, 27 Feb 2004 12:42:15 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

"scott.marlowe" <scott.marlowe@ihs.com> writes:
> Is possible / reasonable / smart and or dumb to look at implementing the 
> tablespaces as riding atop the initlocation handled stuff.

In my mind, one of the main benefits of this work will be that we'll be
able to get *rid* of the initlocation stuff.  It's a crock.

===

Date: Fri, 27 Feb 2004 10:56:32 -0700 (MST)
From: "scott.marlowe" <scott.marlowe@ihs.com>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Gavin Sherry <swm@linuxworld.com.au>,
Subject: Re: [HACKERS] Tablespaces

On Fri, 27 Feb 2004, Tom Lane wrote:

> "scott.marlowe" <scott.marlowe@ihs.com> writes:
> > Is possible / reasonable / smart and or dumb to look at implementing the 
> > tablespaces as riding atop the initlocation handled stuff.
> 
> In my mind, one of the main benefits of this work will be that we'll be
> able to get *rid* of the initlocation stuff.  It's a crock.

OK, that's fine, but I keep thinking that a superuser should have to 
create the tablespace itself, and then tables can be assigned by users 
based on the rights assigned by the dba / superuser.  Is that how we're 
looking at doing it, or will any user be able to create a tablespace 
anywhere postgresql has write permission, or will only dbas be able to 
create AND use table spaces. I'm just not sure how that's gonna be 
handled, and haven't seen it addressed.

===

Date: Fri, 27 Feb 2004 13:34:41 -0600 (CST)
Subject: Re: [HACKERS] Tablespaces
From: tswan@idigx.com
To: "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at>
Cc: "Gavin Sherry" <swm@linuxworld.com.au>,

>
>> I do not intend to undertake raw disk tablespaces for 7.5. I'd be
>> interested if anyone could provide some real world benchmarking of file
>> system vs. raw disk. Postgres benefits a lot from kernel file system
>> cache
>> at the moment.
>
> Yes, and don't forget that pg also relys on the OS for grouping and
> sorting the physical writes and doing readahead where appropriate.
>
>

Most people I know want tablespaces in order to limit or preallocate the
disk space used by a table or database in addition to controlling the
physical location of a table or database.

I know on linux, there is the option of creating an empty file or a
specific size using dd, mounting it through loopback, formatting it,
symlinking the appropriate OID/TID (or mounting the lpb device in the
appropriate directory) and then you control how much space that
directory/mount point can contain.

Of course, with MVCC you would have to vacuum frequently, as you could
miss some updates if there weren't enough tuples marked as free.  If there
were "in-place" updates, the preallocation and limitation much easier, but
that's not how PG works.

If the tablespace disk space allocation is exceeded there would need to be
some graceful reporting condition back to the client.  "UPDATE/INSERT
failed (tablespace size exceeded)", "(tablespace full)", "(disk full)" or
some other error may need to be handled/reported.

===

Date: Sat, 28 Feb 2004 11:03:12 +1100 (EST)
From: Gavin Sherry <swm@linuxworld.com.au>
To: tswan@idigx.com
Cc: Zeugswetter Andreas SB SD <ZeugswetterA@spardat.at>,
Subject: Re: [HACKERS] Tablespaces

On Fri, 27 Feb 2004 tswan@idigx.com wrote:

> >
> >> I do not intend to undertake raw disk tablespaces for 7.5. I'd be
> >> interested if anyone could provide some real world benchmarking of file
> >> system vs. raw disk. Postgres benefits a lot from kernel file system
> >> cache
> >> at the moment.
> >
> > Yes, and don't forget that pg also relys on the OS for grouping and
> > sorting the physical writes and doing readahead where appropriate.
> >
> >
>
> Most people I know want tablespaces in order to limit or preallocate the
> disk space used by a table or database in addition to controlling the
> physical location of a table or database.
>
> I know on linux, there is the option of creating an empty file or a
> specific size using dd, mounting it through loopback, formatting it,
> symlinking the appropriate OID/TID (or mounting the lpb device in the
> appropriate directory) and then you control how much space that
> directory/mount point can contain.
>
> Of course, with MVCC you would have to vacuum frequently, as you could
> miss some updates if there weren't enough tuples marked as free.  If there
> were "in-place" updates, the preallocation and limitation much easier, but
> that's not how PG works.

I do not intend to work on such a system for the initial introduction of
table spaces. The problem is, of course, knowing when you're actually out
of space in a table space in any given transaction. Given that WAL is on a
different partition (at least for the moment) the table space will not
have transaction X's data written to it until after transaction X is
finished. And we cannot error out a transaction which is already commited.

The solution is to keep track of free space and error out at some
percentage of free space remaining. But I don't want to complicate
tablespaces too much in 7.5.

===


To: Gavin Sherry <swm@linuxworld.com.au>
Cc: tswan@idigx.com, Zeugswetter Andreas SB SD <ZeugswetterA@spardat.at>,
Subject: Re: [HACKERS] Tablespaces 
Date: Fri, 27 Feb 2004 22:49:39 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

Gavin Sherry <swm@linuxworld.com.au> writes:
> I do not intend to work on such a system for the initial introduction of
> table spaces. The problem is, of course, knowing when you're actually out
> of space in a table space in any given transaction.

It should not be that hard, at least not on local filesystems.  When PG
realizes that a new page must be added to a table, it does a write() to
append a page of zeroes to the physical table.  This happens
immediately.  It's true that actual data may not be written into that
section of the file till long after commit, but the kernel should do
space allocation checking upon the first write.

I have heard tell that this may not happen when you are dealing with NFS
(yet another reason not to run databases across NFS) but on all local
filesystems I know of, out-of-space should result in a failure before
transaction commit.

I say "should" because I suspect this isn't a very heavily tested code
path in Postgres.  But in theory it should work.  Feel free to submit
bug reports if you find it doesn't.

===

Date: Mon, 1 Mar 2004 09:31:32 -0500
From: Andrew Sullivan <ajs@crankycanuck.ca>
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces

On Thu, Feb 26, 2004 at 05:28:41PM -0500, Alex J. Avriette wrote:

> The only reason I mentioned it to begin with was the recommendation of
> directio for databases in the Sun Blueprint, _Tuning Databases on the
> Solaris Platform_ (and being a Solaris geek, I asked, but apparently
> nobody else is worried enough about performance or not using Solaris
> enough to care).

That recommendation itself is a few years old.  While it may still be
true that directio is still fastest for Oracle on Solaris, I'd sure
like to see some recent evidence.  I've a funny feeling that this is
an old rule of thumb which is now true in the sense that everyone
believes it, but maybe not in the sense that a test would reveal it
to be a sensible rule.

> like to see some of the features of Oracle and DB2 available to users
> of postgres.  Some of these features are raw disk, tablespaces, and
> replication. We're getting there, and making really terrific progress

I don't think we want features for their own sake, though, and I'm
not convinced that raw filesystems are actually useful.  Course, it's
not my itch, and PostgreSQL _is_ free software.

===

From: "Simon Riggs" <simon@2ndquadrant.com>
To: "'Gavin Sherry'" <swm@linuxworld.com.au>, <tswan@idigx.com>
Cc: <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Tablespaces
Date: Tue, 2 Mar 2004 00:27:51 -0000

>Gavin Sherry
> On Fri, 27 Feb 2004 tswan@idigx.com wrote:
> > Most people I know want tablespaces in order to limit or preallocate
the
> > disk space used by a table or database in addition to controlling
the
> > physical location of a table or database.

> I do not intend to work on such a system for the initial introduction
of
> table spaces. The problem is, of course, knowing when you're actually
out
> of space in a table space in any given transaction. Given that WAL is
on a
> different partition (at least for the moment) the table space will not
> have transaction X's data written to it until after transaction X is
> finished. And we cannot error out a transaction which is already
commited.
> 
> The solution is to keep track of free space and error out at some
> percentage of free space remaining. But I don't want to complicate
> tablespaces too much in 7.5.

You're absolutely right about the not-knowing when you're out of space
issue. However, if the xlog has been written then it is not desirable,
but at least acceptable that the checkpoint/bgwriter cannot complete on
an already committed txn. It's not the txn which is getting the error,
that's all.

Hmmm...I'm not sure that we'll be able or should avoid the out of space
situation completely. The question is...what will we do when we hit it?
It doesn't matter whether you stop at 100% or 90% or whatever, you still
have to stop and then what? Stay up as long as possible hopefully: If
there wasn't enough space to write to the tablespace, going into
recovery won't help the situation either; youre still out of space until
you fix that. We now have the option not to crash, since it might be
perfectly viable to keep on chugging away on one Tablespace even though
all txn work on the out-of-space tablespace is frozen/barred etc. Sounds
like a refinement, but something to keep in mind at the design stage if
we can.

The problem is that tablespaces do complicate space management (that's
what people want though, so that's OK). That complicates admin and so pg
will hit many more out of space errors than we've seen previously.
Trying to work out how to spot these ahead of time, accept user defined
limits on each tablespace etc sounds like extra complexity for the
initial drop. I guess my own suggested approach is to start by handling
the error cases, then go back and try to avoid some of them.

All of this exposes for me the complication that doing PITR and
tablespaces at the same time is likely to be more complex for us both
than either had envisaged. The reduced complexity for PITR was what I
was shooting for, also! I'm happy to work together on any issues that
arise.

For PITR, I think we would need:
- a very accessible list of tablespace locations, so taking a full
physical database backup can be easily accomplished using OS utilities.
Hopefully a list maintained external to the database? We have the
equivalent now with env variables.
- decisions about what occurs when for-whatever-reason one or more
tablespaces are not recoverable from backup?
- it might be desirable to allow recovery with less than all of the
original tablespces
- it might also be desirable to allow recovery when the tablespaces txn
Ids don't match (though that is forbidden on many other dbms)

===

To: simon@2ndquadrant.com
Cc: "'Gavin Sherry'" <swm@linuxworld.com.au>, tswan@idigx.com,
Subject: Re: [HACKERS] Tablespaces 
Date: Mon, 01 Mar 2004 20:22:37 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

"Simon Riggs" <simon@2ndquadrant.com> writes:
> Gavin Sherry wrote:
>> I do not intend to work on such a system for the initial introduction of
>> table spaces. The problem is, of course, knowing when you're actually out
>> of space in a table space in any given transaction. Given that WAL is on a
>> different partition (at least for the moment) the table space will not
>> have transaction X's data written to it until after transaction X is
>> finished. And we cannot error out a transaction which is already
>> commited.

As long as the kernel doesn't lie about file extension, we will not
commit any transaction that requires a disallowed increase in the
allocated size of data files, because allocation of another table page
is checked with the kernel during the transaction.  So on most
filesystems (maybe not NFS) the problem Gavin is worried about doesn't
exist.

> You're absolutely right about the not-knowing when you're out of space
> issue. However, if the xlog has been written then it is not desirable,
> but at least acceptable that the checkpoint/bgwriter cannot complete on
> an already committed txn. It's not the txn which is getting the error,
> that's all.

Right.  This is in fact not a fatal situation, as long as you don't run
out of preallocated WAL space.  For a recent practical example of our
behavior under zero-free-space conditions, see this thread:
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00530.php
particularly the post-mortem here:
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00606.php
Barring one small bug, the database would likely have stayed up, and
continued to service at least the read-only transactions, until Chris
got around to freeing some disk space.

I think it is sufficient (at least in the near term) to expect people to
use partition size limits if they want to control database size --- that
is, make a partition of the desired size and put the database directory
in there.  Tablespaces as per the design we are discussing would make it
easier to apply such a policy to a sub-area of a database cluster than
it is today, but they needn't in themselves implement the restriction.

===

Date: Wed, 03 Mar 2004 11:33:50 +0800
From: Christopher Kings-Lynne <chriskl@familyhealth.com.au>
To: Gavin Sherry <swm@linuxworld.com.au>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces

> A table space parameter will be added to DDL commands which create
> physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to
> CREATE SCHEMA. The associated routines, as well as the corresponding DROP
> commands will need to be updated. Adding the ability to ALTER <object>
> TABLESPACE <name> seems a little painful. Would people use it? Comments?

How about allowing the specification on schemas and databases of 
different default tablespaces for TEMP, TABLE and INDEX??  Is there any 
point to that?

===

Date: Wed, 03 Mar 2004 11:35:28 +0800
From: Christopher Kings-Lynne <chriskl@familyhealth.com.au>
To: Gavin Sherry <swm@linuxworld.com.au>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces

> I've been looking at implementing table spaces for 7.5. Some notes and
> implementation details follow.

Ah sorry, other things you might need to consider:

Privileges on tablespaces:

GRANT USAGE ON TABLESPACE tbsp TO ...;

Different disk settings for different tablespaces (since they will 
likely be on different disks):

ALTER TABLESPACE tbsp SET random_page_cost TO 2.5;

===

To: Christopher Kings-Lynne <chriskl@familyhealth.com.au>
Cc: Gavin Sherry <swm@linuxworld.com.au>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces 
Date: Tue, 02 Mar 2004 22:52:49 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
> How about allowing the specification on schemas and databases of 
> different default tablespaces for TEMP, TABLE and INDEX??  Is there any 
> point to that?

TEMP tables are not local to any particular schema, so it wouldn't make
sense to have a schema-level default for their placement.

The other five combinations are at least theoretically sensible, but
do we need 'em all?  It seems to me that a reasonable compromise is to
offer database-level default tablespaces for TEMP, TABLE, and INDEX,
ignoring the schema level.  This is simple and understandable, and if
you don't like it, you're probably the kind of guy who will want to
override it per-table anyway ...

BTW, another dimension to think about is where TOAST tables and their
indexes will get placed.

===

From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [HACKERS] Tablespaces
To: Gavin Sherry <swm@linuxworld.com.au>
Date: Tue, 2 Mar 2004 23:23:50 -0500 (EST)
Cc: pgsql-hackers@postgresql.org

Gavin Sherry wrote:
> The actual creation of the table space will be done with:
> 
> 	CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;

Seems you should use CREATE TABLESPACE (no space) so it is more distinct
from CREATE TABLE.

===

From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [HACKERS] Tablespaces
To: Richard Huxton <dev@archonet.com>
Date: Tue, 2 Mar 2004 23:24:23 -0500 (EST)
Cc: Gavin Sherry <swm@linuxworld.com.au>, pgsql-hackers@postgresql.org

Richard Huxton wrote:
> On Thursday 26 February 2004 10:07, Gavin Sherry wrote:
> >
> > CREATE TABLESPACE tbl1 LOCATION '/var/'
> >
> > which will result in something like '/var/123443' is a bad idea. Then
> > again, the user should know better. Comments?
> 
> The LOCATION should have the same owner and permissions as $PGDATA - that 
> should catch mistyping.
> 
> Unless you're running as root, of course. In which case you clearly know 
> better than everyone else, so off you go!

FYI, you can't run the postmaster as root.

===

From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [HACKERS] Tablespaces
To: "Joshua D. Drake" <jd@commandprompt.com>
Date: Tue, 2 Mar 2004 23:25:04 -0500 (EST)
Cc: =?ISO-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= <postgres@cybertec.at>,

Joshua D. Drake wrote:
> > Is it possible to put WALs and CLOGs into different tablespaces? (maybe 
> > different RAID systems). Some companies want that ...
> 
> You can do this now, but it would be nice to be able to have it actually 
> configurable versus the hacked linked method.

Agreed, but because the system has to be down to move pg_xlog, I think
we should write a command-line utility to assist with this, perhaps.  It
could check permissions and stuff.

===

Date: Wed, 3 Mar 2004 15:26:21 +1100 (EST)
From: Gavin Sherry <swm@linuxworld.com.au>
To: Bruce Momjian <pgman@candle.pha.pa.us>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces

On Tue, 2 Mar 2004, Bruce Momjian wrote:

> Gavin Sherry wrote:
> > The actual creation of the table space will be done with:
> >
> > 	CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;
>
> Seems you should use CREATE TABLESPACE (no space) so it is more distinct
> from CREATE TABLE.

Oops. Typo.

===

From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [HACKERS] Tablespaces
To: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 2 Mar 2004 23:26:21 -0500 (EST)
Cc: Gavin Sherry <swm@linuxworld.com.au>, pgsql-hackers@postgresql.org

Tom Lane wrote:
> Gavin Sherry <swm@linuxworld.com.au> writes:
> > A table space is a directory structure. The directory structure is as
> > follows:
> > [swm@dev /path/to/tblspc]$ ls
> > OID1/	OID2/
> > OID1 and OID2 are the OIDs of databases which have created a table space
> > against this file system location. In this respect, a table space
> > resembles $PGDATA/base. I thought it useful to keep this kind of
> > namespace mechanism in place ...
> 
> Actually, this is *necessary* AFAICT.  The case that forces it is DROP
> DATABASE.  Since you have to execute that from another database, there's
> no reasonable way to look into the target database's catalogs.  That
> means that the OID of the database has to be sufficient information to
> get rid of all its files.  You can do this fairly easily if in each
> tablespace (whose locations you know from the shared pg_tablespace
> table) you can look for a subdirectory matching the target database's
> OID.  If we tried to put the database's files just "loose" in each
> tablespace directory then we'd be in trouble.
> 

Gavin, let us know if you want us to create the global pg_tablespace for
you.  Some of us have done a lot of system catalog work.

===

From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [HACKERS] Tablespaces
To: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 2 Mar 2004 23:27:46 -0500 (EST)
Cc: Gavin Sherry <swm@linuxworld.com.au>, pgsql-hackers@postgresql.org

Tom Lane wrote:
> Gavin Sherry <swm@linuxworld.com.au> writes:
> >> Speaking of locking, can we do anything to prevent people from shooting
> >> themselves in the foot by changing active tablespaces?  Are we even
> >> going to have a DROP TABLESPACE command, and if so what would it do?
> 
> > Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
> > drop a table space until the directory is empty.
> 
> How would it get to be empty?  Are you thinking of some sort of "connect
> database to tablespace" and "disconnect database from tablespace"
> commands that would respectively create and delete the per-database
> subdirectory?  That seems moderately reasonable to me.  We could then
> invent a locking protocol that requires backends to lock a tablespace
> before they can execute either of these operations (or delete the
> tablespace of course).

One crude solution would be to remove the tablespace oid directory only
when the database is dropped, and require an empty tablespace directory
to drop the tablespace.  This allows a lock only on tablespace creation,
and not a lock on object creation in each tablespace.

===

From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [HACKERS] Tablespaces
To: Greg Stark <gsstark@mit.edu>
Date: Tue, 2 Mar 2004 23:33:58 -0500 (EST)
Cc: pgsql-hackers@postgresql.org,

Greg Stark wrote:
> 
> > > I am expecting to hear some bleating about this from people whose
> > > preferred platforms don't support symlinks ;-).  However, if we don't
> 
> Well, one option would be to have the low level filesystem storage (md.c?)
> routines implement a kind of symlink themselves. Just a file with a special
> magic number followed by a path.
> 
> I'm normally against reimplementing OS services but symlinks are really a very
> simple concept and simple to implement. Especially if you can make a few
> simplifying assumptions: they only ever need to appear as the final path
> element not as parent directories and tablespaces don't need symlinks pointing
> to symlinks. Ideally postgres also doesn't need to implement relative links
> either.

I just checked from the MinGW console and I see:
	
	# touch a
	# ln -s a b
	# echo test >a
	# cat b
	# l ?
	-rw-r--r--    1 Bruce Mo Administ        5 Mar  2 23:30 a
	-rw-r--r--    1 Bruce Mo Administ        0 Mar  2 23:30 b
	# cat a
	test
	# cat b
	#

It accepts ln -s, but does nothing with it.

For tablespaces on OS's that don't support it, I think we will have to
store the path name in the file and read it via the backend.  Somehow we
should cache those lookups.

===

From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [HACKERS] Tablespaces
To: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 2 Mar 2004 23:34:55 -0500 (EST)
Cc: "scott.marlowe" <scott.marlowe@ihs.com>,

Tom Lane wrote:
> "scott.marlowe" <scott.marlowe@ihs.com> writes:
> > Is possible / reasonable / smart and or dumb to look at implementing the 
> > tablespaces as riding atop the initlocation handled stuff.
> 
> In my mind, one of the main benefits of this work will be that we'll be
> able to get *rid* of the initlocation stuff.  It's a crock.

Agreed. It should be ripped out once we have tablespaces, and if we keep
it for one extra release, there will be confusion over which to use.

===

From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [HACKERS] Tablespaces
To: josh@agliodbs.com
Date: Tue, 2 Mar 2004 23:37:15 -0500 (EST)
Cc: pgsql-hackers@postgresql.org, Gavin Sherry <swm@linuxworld.com.au>

Josh Berkus wrote:
> #3: ALTER TABLE .... CHANGE TABLESPACE:
> This is strictly in the class of "would be a very nice & useful feature if 
> it's not too difficult". ? 
> 
> Given how painful it is to drop & replace a table with multiple dependencies 
> (on some databases, only possible by droping & re-loading the entire 
> database) it would be nice to have an ALTER TABLE command that moved the 
> table to another tablespace. ? ?It doesn't *seem* to me that this would be a 
> very challenging bit of programming, as the operation would be very similar 
> to REINDEX in the manipulation of files. ? (But what I know, really?)
> 
> Once tablespaces are a feature and some users start using them for quota 
> management, there will quickly develop situations where the original 
> tablespace for a db runs out of room and can't be resized. ? Being able to 
> move the table "in situ" then becomes vital, especially on very large 
> databases ... and when someday combined with partitioned tables, will become 
> essential.
> 
> Further, we will get an *immediate* flurry of requests from users who just 
> upgraded to 7.5 and want to make use of the tablespaces feature on an 
> existing production database.

If we don't implement moving tables between tablespaces, we should add a
stub for it in the grammer and mention it is not implemented yet,
because if we don't, we will get tons of questions.

===

From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [HACKERS] Tablespaces
To: Gavin Sherry <swm@linuxworld.com.au>
Date: Tue, 2 Mar 2004 23:39:53 -0500 (EST)
Cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org

Gavin Sherry wrote:
> Actually, I think that's a pretty good idea :-). I'd solves a bunch of
> issues in the backend (postmaster start up can recurse through
> $PGDATA/tablespaces looking for postmaster.pid files) and will also assist
> admins with complex configurations (perhaps).

Why are you asking about postmaster.pid files.  That file goes in the
top level /data directory, no?

===

Date: Wed, 3 Mar 2004 15:46:55 +1100 (EST)
From: Gavin Sherry <swm@linuxworld.com.au>
To: Bruce Momjian <pgman@candle.pha.pa.us>
Cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces

On Tue, 2 Mar 2004, Bruce Momjian wrote:

> Gavin Sherry wrote:
> > Actually, I think that's a pretty good idea :-). I'd solves a bunch of
> > issues in the backend (postmaster start up can recurse through
> > $PGDATA/tablespaces looking for postmaster.pid files) and will also assist
> > admins with complex configurations (perhaps).
>
> Why are you asking about postmaster.pid files.  That file goes in the
> top level /data directory, no?

I was trying to be paranoid about users who have multiple postmasters on
the same machine and want to share a table space while both systems are
live. There'd be no mechanism to test for that situation if we didn't have
something like a postmaster.pid file. Is this being a little too paranoid?

===

From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [HACKERS] Tablespaces
To: Gavin Sherry <swm@linuxworld.com.au>
Date: Wed, 3 Mar 2004 00:00:40 -0500 (EST)
Cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org

Gavin Sherry wrote:
> On Tue, 2 Mar 2004, Bruce Momjian wrote:
> 
> > Gavin Sherry wrote:
> > > Actually, I think that's a pretty good idea :-). I'd solves a bunch of
> > > issues in the backend (postmaster start up can recurse through
> > > $PGDATA/tablespaces looking for postmaster.pid files) and will also assist
> > > admins with complex configurations (perhaps).
> >
> > Why are you asking about postmaster.pid files.  That file goes in the
> > top level /data directory, no?
> 
> I was trying to be paranoid about users who have multiple postmasters on
> the same machine and want to share a table space while both systems are
> live. There'd be no mechanism to test for that situation if we didn't have
> something like a postmaster.pid file. Is this being a little too paranoid?

Oh, yikes, I see.  Right now we have the interlock on the /data
directory, but once you start moving stuff out from under /data using
tablespaces, we do perhaps loose the interlock.  However, I assume the
CREATE TABLESPACE is going to create the tablespace directory, so I
don't see how two postmasters could both create the directory.

For example, if you say 

	CREATE TABLESPACE tb IN '/var/tb1'

I assume you have to create:

	/var/tb1/pgsql_tablespace

and then

	/var/tb1/pgsql_tablespace/oid1
	/var/tb1/pgsql_tablespace/oid2

or something like that, and set the proper permissions on
pgsql_tablespace.  We will have write permission on the directory they
pass to us, but we might not have permissions to change the mode of the
directory they pass, so we have to create a subdirectory anyway, and
that is our interlock.

For example:
	
	# run as root
	$ chmod a+w .
	$ ls -ld .
	drwxrwxrwx  2 root  wheel  512 Mar  2 23:51 .
	
	# run as the postmaster
	$ mkdir new
	$ ls -ld new
	drwxr-xr-x  2 postgres  wheel  512 Mar  2 23:52 new
	$ chmod 700 new
	$ chmod 700 .
	chmod: .: Operation not permitted
	chmod: .: Operation not permitted

As you can see, I have permission to create the /new directory, but no
ability to set its mode, so we have to create a directory that matches
the permissions of /data:

	drwx------  6 postgres  postgres  512 Mar  2 12:48 /u/pg/data/


We could require the admin to create a directory that we own instead of
just one that we have write permission in, but why bother when we can
use the new directory as an interlock from multiple postmasters anyway.

Right now we do require the directory used as /data be one where we can
create a /data subdirectory, so this seems similar.  We don't put the
data directly in the passed directory, but in /data under that.  In
fact, we could just call it /var/tb1/data instead of
/var/tb1/pgsql_tablespace.

===

From: Claudio Natoli <claudio.natoli@memetrics.com>
To: "'Bruce Momjian'" <pgman@candle.pha.pa.us>, Greg Stark <gsstark@mit.edu>
Cc: pgsql-hackers@postgresql.org,
Subject: Re: [HACKERS] Tablespaces
Date: Wed, 3 Mar 2004 21:12:35 +1100 


Bruce Momjian writes:
> I just checked from the MinGW console and I see:
> [snip]
> It accepts ln -s, but does nothing with it.

And even if it had worked, it wouldn't really matter, since we don't
actually want to *run* the system under MinGW/msys, just build it.

I think the idea of implementing in symlinks for non-compliant platforms in
md.c has some merit. FWIW, looks like that is how cygwin implements
symlinks...

===

Subject: Re: [HACKERS] Tablespaces
From: Oliver Elphick <olly@lfix.co.uk>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Gavin Sherry <swm@linuxworld.com.au>,
Date: Wed, 03 Mar 2004 11:58:51 +0000

On Wed, 2004-03-03 at 04:59, Tom Lane wrote:
>    What might make sense is some sort of marker file in a
> tablespace directory that links back to the owning $PGDATA directory.
> CREATE TABLESPACE should create this, or reject if it already exists.

It will not be enough for the marker to list the path of the parent
$PGDATA, since that path might get changed by system administration
action.  The marker should contain some sort of unique string which
would match the same string somewhere in $PGDATA.  Then, if either
tablespace or $PGDATA were moved, it would be possible to tie the two
back together.  It wouldn't be an issue on most normal systems, but
might be of crucial importance for an ISP running numerous separate
clusters.

===

To: olly@lfix.co.uk
Cc: Gavin Sherry <swm@linuxworld.com.au>,
Subject: Re: [HACKERS] Tablespaces 
Date: Wed, 03 Mar 2004 09:36:53 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

Oliver Elphick <olly@lfix.co.uk> writes:
> It will not be enough for the marker to list the path of the parent
> $PGDATA, since that path might get changed by system administration
> action.  The marker should contain some sort of unique string which
> would match the same string somewhere in $PGDATA.

We have already added a notion of a "unique installation identifier"
for PITR purposes (look in pg_control).  So we could use that for this
purpose if we wanted to.

But I'm not sure how important it really is.  AFAICS the behavior of
CREATE TABLESPACE will be "create marker file, if it already exists
then abort".  It has no need to actually look in the file and so there's
no need for the contents to be unique.

===

To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces
From: Greg Stark <gsstark@mit.edu>
Date: 03 Mar 2004 11:35:40 -0500


Bruce Momjian <pgman@candle.pha.pa.us> writes:

> Greg Stark wrote:
> > 
> > > > I am expecting to hear some bleating about this from people whose
> > > > preferred platforms don't support symlinks ;-).  However, if we don't
> > 
> > Well, one option would be to have the low level filesystem storage (md.c?)
> > routines implement a kind of symlink themselves. Just a file with a special
> > magic number followed by a path.

On further contemplation it doesn't seem like using symlinks really ought to
be necessary. It should be possible to drive everything off the catalog tables
while avoidin having the low level filesystem code know anything about them.

Instead of having the low level code fetch the pg_* records themselves, some
piece of higher level code would do the query and call down to storage layer
to inform it of the locations for everything. It would have to do this on
database initialization and on any subsequent object creation.

Basically maintain an in-memory hash table of oid -> path, and call down to
the low level code whenever that hash changes. (Or more likely oid->ts_id and
a separate list of ts_id -> path.)

===

Date: Wed, 03 Mar 2004 09:05:41 -0800
From: Barry Lind <blind@xythos.com>
To: olly@lfix.co.uk
Cc: Tom Lane <tgl@sss.pgh.pa.us>, Gavin Sherry <swm@linuxworld.com.au>,
Subject: Re: [HACKERS] Tablespaces



Oliver Elphick wrote:
> On Wed, 2004-03-03 at 04:59, Tom Lane wrote:
> 
>>   What might make sense is some sort of marker file in a
>>tablespace directory that links back to the owning $PGDATA directory.
>>CREATE TABLESPACE should create this, or reject if it already exists.
> 
> 
> It will not be enough for the marker to list the path of the parent
> $PGDATA, since that path might get changed by system administration
> action.  The marker should contain some sort of unique string which
> would match the same string somewhere in $PGDATA.  Then, if either
> tablespace or $PGDATA were moved, it would be possible to tie the two
> back together.  It wouldn't be an issue on most normal systems, but
> might be of crucial importance for an ISP running numerous separate
> clusters.

Taking this one step further would be to do something like Oracle does. 
  Every datafile in Oracle (because the Oracle storage manager stores 
multiple objects inside datafiles, one could say there is some 
similarity between Oracle datafiles and the proposed pg tablespaces), 
has meta info that tells it which database instance it belongs to and 
the last checkpoint that occured (It might actually be more granular 
than checkpoint, such that on a clean shutdown you can tell that all 
datafiles are consistent with each other and form a consistent database 
instance).  So Oracle on every checkpoint updates all datafiles with an 
identifier.  Now you might ask why is this useful.  Well in normal day 
to day operation it isn't, but it can be usefull in disaster recovery. 
If you loose a disk and need to restore the entire database from backups 
it can be difficult to make sure you have done it all correctly (do I 
have all the necessary files/directories? did I get the right ones from 
the right tapes?)  Especially if you have directories spread across 
various different disks that might be backed up to different tapes.  So 
by having additional information stored in each datafile Oracle can 
provide additional checks that the set of files that are being used when 
the database starts up are consistent and all belong together.  Oracle 
also ensures that all the datafiles that are suposed to exist actually 
do as well.

So what might this mean for postgres and tablespaces?  It could mean 
that on startup the database checks to verify that all the tablespaces 
that are registered actually exist.  And that the data in each 
tablespace is consistent with the current WAL status.  (i.e. someone 
didn't restore a tablespace from backup while the database was down that 
  is old and needs recovery.

A lot of what I am talking about here become PITR issues.  But since 
PITR and tablespaces are both potential features for 7.5, how they 
interact probably should be thought about in the designs for each.

===

From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [HACKERS] Tablespaces
To: Greg Stark <gsstark@mit.edu>
Date: Wed, 3 Mar 2004 12:52:40 -0500 (EST)
Cc: pgsql-hackers@postgresql.org

Greg Stark wrote:
> 
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> 
> > Greg Stark wrote:
> > > 
> > > > > I am expecting to hear some bleating about this from people whose
> > > > > preferred platforms don't support symlinks ;-).  However, if we don't
> > > 
> > > Well, one option would be to have the low level filesystem storage (md.c?)
> > > routines implement a kind of symlink themselves. Just a file with a special
> > > magic number followed by a path.
> 
> On further contemplation it doesn't seem like using symlinks really ought to
> be necessary. It should be possible to drive everything off the catalog tables
> while avoidin having the low level filesystem code know anything about them.
> 
> Instead of having the low level code fetch the pg_* records themselves, some
> piece of higher level code would do the query and call down to storage layer
> to inform it of the locations for everything. It would have to do this on
> database initialization and on any subsequent object creation.
> 
> Basically maintain an in-memory hash table of oid -> path, and call down to
> the low level code whenever that hash changes. (Or more likely oid->ts_id and
> a separate list of ts_id -> path.)

The advantage of symlinks is that an administrator could see how things
are laid out from the command line.

===

Date: Wed, 03 Mar 2004 20:31:16 -0600
From: Thomas Swan <tswan@idigx.com>
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces

Bruce Momjian wrote:
>Greg Stark wrote:
>>Bruce Momjian <pgman@candle.pha.pa.us> writes:
>>>Greg Stark wrote:

>>>>>>I am expecting to hear some bleating about this from people whose
>>>>>>preferred platforms don't support symlinks ;-).  However, if we don't

>>>>Well, one option would be to have the low level filesystem storage (md.c?)
>>>>routines implement a kind of symlink themselves. Just a file with a special
>>>>magic number followed by a path.

>>On further contemplation it doesn't seem like using symlinks really ought to
>>be necessary. It should be possible to drive everything off the catalog tables
>>while avoidin having the low level filesystem code know anything about them.
>>
>>Instead of having the low level code fetch the pg_* records themselves, some
>>piece of higher level code would do the query and call down to storage layer
>>to inform it of the locations for everything. It would have to do this on
>>database initialization and on any subsequent object creation.
>>
>>Basically maintain an in-memory hash table of oid -> path, and call down to
>>the low level code whenever that hash changes. (Or more likely oid->ts_id and
>>a separate list of ts_id -> path.)

>The advantage of symlinks is that an administrator could see how things
>are laid out from the command line.

That's a poor reason to require symlinks.  The administrator can just as
easily open up psql and query pg_tablespace to see that same
information.  Besides, the postgres doesn't know where to look on the
filesystem for the /path/to/oid without a system catalog lookup.  There
doesn't seem to be any sensible reason to force a filesystem requirement
when the core operations are diffferent to begin with. 

If a more global view of all databases is necessary, perhaps there ought
to be a system wide view which could display all of that information at
once: dbname, relation name, and physical location.

===

From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [HACKERS] Tablespaces
To: Thomas Swan <tswan@idigx.com>
Date: Wed, 3 Mar 2004 21:41:17 -0500 (EST)
Cc: pgsql-hackers@postgresql.org

Thomas Swan wrote:
> >The advantage of symlinks is that an administrator could see how things
> >are laid out from the command line.
> >
> >  
> >
> That's a poor reason to require symlinks.  The administrator can just as
> easily open up psql and query pg_tablespace to see that same
> information.  Besides, the postgres doesn't know where to look on the
> filesystem for the /path/to/oid without a system catalog lookup.  There
> doesn't seem to be any sensible reason to force a filesystem requirement
> when the core operations are diffferent to begin with. 
> 
> If a more global view of all databases is necessary, perhaps there ought
> to be a system wide view which could display all of that information at
> once: dbname, relation name, and physical location.

Who doesn't have symlinks these days, and is going to be using
tablespaces?  Even Win32 has them.

===

Date: Thu, 04 Mar 2004 00:39:29 -0600
From: Thomas Swan <tswan@idigx.com>
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces

Tom Lane wrote:

>Thomas Swan <tswan@idigx.com> writes:

>>Bruce Momjian wrote:

>>>The advantage of symlinks is that an administrator could see how things
>>>are laid out from the command line.

>>That's a poor reason to require symlinks.  The administrator can just as
>>easily open up psql and query pg_tablespace to see that same
>>information.

>Something to keep in mind here is that one of the times you would most
>likely need that information is when the database is broken and you
>*can't* simply "open up psql" and inspect system catalogs.  I like the
>fact that a symlink implementation can be inspected without depending on
>a working database.

That's a sufficient argument, to allow for it.  Recoverability would be
one reason.

>If we were going to build a non-symlink implementation, I'd want the
>highlevel-to-lowlevel data transfer to take the form of a flat ASCII
>file that could be inspected by hand, rather than some hidden in-memory
>datastructure.  But given the previous discussion in this thread,
>I cannot see any strong reason not to rely on symlinks for the purpose.
>We are not in the business of building replacements for OS features.

I do like the flat file output at least for a record of what went
where.  Regardless of whether or not symlinks are used, the admin would
need to know what directories/files/filesystems are to be backed up.

I am concerned as to what extent different filesystems do when you back
the directories up.    Would NTFS containing symlinks be able to be
backed up with a tar/zip command, or is something more elaborate needed?
  In the past, before upgrading, I have had to tar the pgdata directory
with the postmaster shutdown to insure a quick restoration of the
database in case an upgrade didn't proceed uneventfully.  Also, in the
event of a major version upgrade the restored information may or may not
proceed uneventfully.    I just wanted to point out something I thought
might be an issue further down the road.  

Perhaps the system catalog / flat file approach would be a more solid
approach, both of which would not involve replacing or duplicating OS
features.

===

Cc: pgsql-hackers@postgresql.org
From: Marko Karppinen <marko@karppinen.fi>
Subject: Re: [HACKERS] Tablespaces
Date: Mon, 8 Mar 2004 02:07:35 +0200
To: Bruce Momjian <pgman@candle.pha.pa.us>

On 3 March 2004, at 19:52, Bruce Momjian wrote:
> The advantage of symlinks is that an administrator could see how things
> are laid out from the command line.

One thing to keep in mind is that system administrators don't see
symlinks as being informational -- they see them as the actual UI
for the redirection in question. So their expectation is that they'll
be able to move the actual directory around at will (as long as they
update the symlink to match).

If symlinks are used, the rule of least surprise would mean that
no information whatsoever about the physical location of a tablespace
should be stored in the system catalogs. Otherwise their relationship
with the information stored in the symlink is ambiguous.

===

Date: Mon, 8 Mar 2004 08:32:49 -0500
From: Andrew Sullivan <ajs@crankycanuck.ca>
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces

On Mon, Mar 08, 2004 at 02:07:35AM +0200, Marko Karppinen wrote:
> One thing to keep in mind is that system administrators don't see
> symlinks as being informational -- they see them as the actual UI
> for the redirection in question. So their expectation is that they'll
> be able to move the actual directory around at will (as long as they
> update the symlink to match).

This is a good point.  It's worth keeping in mind, too, that in large
shops, the DBAs and the sysadmins often are in separate departments
with separate management, precisely because the database system has
traditionally been somewhat divorced from the OS (as an aside, I
suspect that this sort of separation is part of the reason for the
popularity of raw filesystems among DBAs.  Even if they didn't
provide better speed, it's just preferable not to have to involve
another department).  System administrators in such places have been
known to decide to "reorganise the disks", assuming that the database
just has its own home.  For such a sysadmin, a pile of symlinks would
be fair game for reorganisation.

===

Date: Mon, 08 Mar 2004 14:56:46 +0100
From: Andreas Pflug <pgadmin@pse-consulting.de>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Tablespaces

Andrew Sullivan wrote:

>eorganise the disks", assuming that the database
>just has its own home.  For such a sysadmin, a pile of symlinks would
>be fair game for reorganisation.
>  
>

Please take into consideration that symlinks might be every day work for 
*nix admins, but for win admins it's very uncommon. Additionally, win 
admins are accustomed to gui tools, and many of them will stumble if 
forced to use a command line. For worse, junctions are not 
distinguishable in explorer; only the shell's dir command knows about 
junctions. This obfuscation makes junctions quite a bad choice for admin 
purposes.

IMHO there are only two viable options:
- no tablespaces for win32, i.e. recommend *ix for bigger installations
- a different tablespace storage approach., e.g. simple desktop links 
(or alike) redirecting to a directory.


===

Date: Sat, 13 Dec 2003 04:19:46 -0500
From: "Keith C. Perry" <netadmin@vcsn.com>
To: Bruce Momjian <pgman@candle.pha.pa.us>
Cc: Brian Maguire <bmaguire@vantage.com>, pgsql-general@postgresql.org
Subject: Re: [GENERAL] tablespaces in 7.5?

Quoting Bruce Momjian <pgman@candle.pha.pa.us>:

> Brian Maguire wrote:
> > I am curious if tablespaces are going to be seriously targeted
> > for the next version. It really opens up new levels of scalability
> > and is a killer feature from an administration perspective.
> 
> I hope so!  

Excuse my ignorance but what will namespaces give us?  I though PG schema
provided the namespace functionality- 'least the way I am understanding the term.

===

Date: Sat, 13 Dec 2003 09:38:25 -0000 (GMT)
Subject: Re: [GENERAL] tablespaces in 7.5?
From: "John Sidney-Woollett" <johnsw@wardbrook.com>
To: "Keith C. Perry" <netadmin@vcsn.com>
Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>,

My (limited) understanding is that it will give you the ability to:

i) decide what data resides in what tablespace, (database, schema,
indexes, data [coarser -> finer grain]).
ii) where the tablespace data is physically located, allowing you to
distribute your database across disks, or disk arrays.

===

Subject: Re: [GENERAL] tablespaces in 7.5?
Date: Sat, 13 Dec 2003 07:50:27 -0500
From: "Brian Maguire" <bmaguire@vantage.com>
To: <johnsw@wardbrook.com>, "Keith C. Perry" <netadmin@vcsn.com>
Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>, <pgsql-general@postgresql.org>

That's right it's a big one from a performance and admin
perspective.  DB2, Oracle and Informix have tablespaces.  It
appears that it has been in the postgres crosshair for a few
years now.  I'm not sure how much has been completed so far.

 

Few scenarios where they are really important:  

1.  Right now a database can be as fast as one disk.
	 Tablespaces allow you to distribute database
	 objects across multiple physical locations.  A big
	 index or table can live on separate disks
	 distributing the io activity.
	
2.  Say you are close to running out of disk space and want
	to grow some of the data onto another disk.  Table
	spaces allow you to alter the table space and more
	easily move the big table or indexes onto a
	different disk rather than just moving the entire db
	to a bigger single disk.
	
3.  Say there is a part of a database that you want to
	backup every hour, but backing up entire database is
	overkill.  You can set it up so it backs up
	different table spaces at different times.
	
4.  Couple other features of tablespaces are that they allow
	you to allocate space to a specific tablespace and
	to take only part of a database offline or online
	very easily.

 
More detail on what they are how to mange them with oracle...
http://www.engin.umich.edu/caen/wls/software/oracle/server.901/a88856/c04space.htm
 
http://www.siue.edu/~dbock/cmis565/ch8-tablespaces.htm 
http://www-rohan.sdsu.edu/doc/oracle/server803/A54641_01/ch8.htm

===

Date: Sat, 13 Dec 2003 13:18:56 -0500
From: "Keith C. Perry" <netadmin@vcsn.com>
To: Brian Maguire <bmaguire@vantage.com>
Cc: johnsw@wardbrook.com, Bruce Momjian <pgman@candle.pha.pa.us>,
Subject: Re: [GENERAL] tablespaces in 7.5?

Ok, thats for the response.  I take it a PG namespace = Oracle table space (or
namespace is simply the generic term).  I can see some definite benefits
especially with disk i/o throughput though I thought database partitioning (I
think that is what its called) would provide the same thing.

This actually sounds like system that might fit well on a Plan 9 OS.

Anyway, thanks to all for the explanations.

===

From: Mike Nolan <nolan@gw.tssi.com>
Subject: Re: [GENERAL] tablespaces in 7.5?
To: netadmin@vcsn.com (Keith C. Perry)
Date: Sat, 13 Dec 2003 12:41:17 -0600 (CST)
Cc: bmaguire@vantage.com (Brian Maguire), johnsw@wardbrook.com,

> Ok, thats for the response.  I take it a PG namespace = Oracle table space (or
> namespace is simply the generic term).  I can see some definite benefits
> especially with disk i/o throughput though I thought database partitioning (I
> think that is what its called) would provide the same thing.

I could be wrong, but I think 'namespace' is an existing concept in
PG that is a way of organizing objects into logical groups.  

As I recall, the group working on it decided to call it a 'directory' rather 
than a 'tablespace', because of concerns that the latter word might be 
proprietary to Oracle.  I've lost touch with the rest of the members in
that group, though, since the computer I was using for PG development 
purposes got zapped by lightning in August.

===

Date: Sat, 13 Dec 2003 16:24:29 -0500
From: "Keith C. Perry" <netadmin@vcsn.com>
To: Mike Nolan <nolan@gw.tssi.com>
Cc: Brian Maguire <bmaguire@vantage.com>, johnsw@wardbrook.com,
Subject: Re: [GENERAL] tablespaces in 7.5?

Quoting Mike Nolan <nolan@gw.tssi.com>:

> > Ok, thats for the response.  I take it a PG namespace =
> > Oracle table space (or namespace is simply the generic
> > term).  I can see some definite benefits especially with
> > disk i/o throughput though I thought database
> > partitioning (I think that is what its called) would
> > provide the same thing.

> I could be wrong, but I think 'namespace' is an existing concept in
> PG that is a way of organizing objects into logical groups.  
> 
> As I recall, the group working on it decided to call it a 'directory' rather
> than a 'tablespace', because of concerns that the latter word might be 
> proprietary to Oracle.  I've lost touch with the rest of the members in
> that group, though, since the computer I was using for PG development 
> purposes got zapped by lightning in August.

I think that is what I was getting confused with before- schemas...

http://www.postgresql.org/docs/7.4/static/catalog-pg-namespace.html


===

To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] tablespaces in 7.5?
From: Greg Stark <gsstark@mit.edu>
Date: 13 Dec 2003 21:44:23 -0500

"Keith C. Perry" <netadmin@vcsn.com> writes:

> Ok, thats for the response.  I take it a PG namespace = Oracle table space (or
> namespace is simply the generic term).  

Actually if you check back you'll notice you're the first person to say
"namespace". The original question was about "tablespaces"

===

Date: Sun, 14 Dec 2003 00:43:20 -0500
From: "Keith C. Perry" <netadmin@vcsn.com>
To: Greg Stark <gsstark@mit.edu>
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] tablespaces in 7.5?

Quoting Greg Stark <gsstark@mit.edu>:

> "Keith C. Perry" <netadmin@vcsn.com> writes:
> 
> > Ok, thats for the response.  I take it a PG namespace = Oracle table space
> (or
> > namespace is simply the generic term).  
> 
> Actually if you check back you'll notice you're the first person to say
> "namespace". The original question was about "tablespaces"

Whoa, I sure did- my apologies.  That would also explain my confusion.  Ok, so
on PG-

namespaces   = yes, via schemas
tablesspaces = forthcoming

===

Subject: Re: [GENERAL] tablespaces a priority for 7.5?
Date: Wed, 21 Jan 2004 21:03:15 -0500
From: "Brian Maguire" <bmaguire@vantage.com>
To: <lnd@hnit.is>
Cc: <pgsql-general@postgresql.org>

lnd@hnit.is [mailto:lnd@hnit.is] wrote:

> Brian Maquire wrote:

> > Is support for tablespaces a priority feature for 7.5? I 
> > believe there has been significant development in this area 
> > and it seems that postgres' file structure opens it up nicely 
> > to support it.  What are the chances this will be completed?  
> > 
> > In my opinion, it really is a critical feature to support and 
> > administer enterprise databases.  All the major databases 
> > currently support this and it is a compelling enough reason 
> > drive big users from away from using postgres for their 
> > enterprise/large databases.  It really is a database 
> > administrator's feature.  

> In RAID era tablespaces are not such important regarding performance. 
> 
> But for backup/restore - the ability to backup/restore selected tablespaces 
> while leaving other tablespaces is a big thing. 
> The whole point here is: it is assumed that backup/restore of tablespaces can 
> hapen quite quickly and as simple as to copy tablespace files from one 
> location to another(even while database is on - WAL can be used to handle 
> this) - this is compared to dump. 
> 
> For example, index, tempoarary data tablespaces can be lost - not a big deal. 
> 
> Undo(rollback) tablespaces - in a way can be lost as well. 
> While system data tablespace (table structure, stored procedures, etc) - at 
> no cost should be lost.  
> The same way application can be devided in "critical" and "not critical" 
> tablespaces and their backups maintained accordingly. For example, it may not 
> be a big deal to lose year 1996 tables while year 2004 tables should be 
> online. 

I agree that RAID provides similar performance benifits
especially with striping io benifits, however it is powerful
and ideal to have both options.  For example you may have a
set of tables that are read-only for reporting and another
set mostly write only.  You could have they resting on
different raid configurations ideal for each situtation.
 
I also agree there are several admin benifits in the areas
of backup.  You can also more easily create different
frequency/schedules of backup for certain critical tables to
a different schedule than other less important tables.  The
flexibility of easily growing your database beyond the
current disk because of size limitations can be a life
savior.
 
===

To: "Brian Maguire" <bmaguire@vantage.com>
Cc: lnd@hnit.is, pgsql-general@postgresql.org
Subject: Re: [GENERAL] tablespaces a priority for 7.5? 
Date: Wed, 21 Jan 2004 22:12:29 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

> 	The whole point here is: it is assumed that backup/restore of tablespaces can 
> 	hapen quite quickly and as simple as to copy tablespace files from one 
> 	location to another(even while database is on - WAL can be used to handle 
> 	this) - this is compared to dump. 

This is not going to happen.  We intend to provide tablespaces in the
form of a simple management scheme for table files that are scattered
across multiple directories (typically on different filesystems).
That doesn't make it any safer to copy files behind the database's back.

===

Subject: Re: [GENERAL] tablespaces a priority for 7.5?
From: Cott Lang <cott@internetstaff.com>
To: pgsql-general@postgresql.org
Date: Thu, 22 Jan 2004 06:55:21 -0700

On Wed, 2004-01-21 at 09:05, Brian Maguire wrote:

> In my opinion, it really is a critical feature to support and administer
> enterprise databases.  All the major databases currently support this
> and it is a compelling enough reason drive big users from away from
> using postgres for their enterprise/large databases.  It really is a
> database administrator's feature.  

It seems to me that the lack of point-in-time recovery is a much bigger
roadblock against big users. :(


===

Subject: Re: [GENERAL] tablespaces a priority for 7.5?
Date: Thu, 22 Jan 2004 14:38:42 -0000
From: <lnd@hnit.is>
To: <pgsql-general@postgresql.org>

> On Behalf Of Cott Lang
> It seems to me that the lack of point-in-time recovery is a 
> much bigger roadblock against big users. :(

Meaning incremental (hot)-backups?
Or as protection against DROP/TRUNCATE/DELETE ALL TABLE/SCHEMA/DATABASE? 

With a WAL it should be doable in some 7.x version, all ingredients are
there.

Possibly someone suceeded in doing it already? Having a baseline backup and
saved WAL logs, shouldn't it be possible to recover?

===

Subject: Re: [GENERAL] tablespaces a priority for 7.5?
From: Cott Lang <cott@internetstaff.com>
To: pgsql-general@postgresql.org
Date: Thu, 22 Jan 2004 08:00:12 -0700

On Thu, 2004-01-22 at 07:38, lnd@hnit.is wrote:

> Meaning incremental (hot)-backups?
> Or as protection against DROP/TRUNCATE/DELETE ALL TABLE/SCHEMA/DATABASE? 
> 
> With a WAL it should be doable in some 7.x version, all ingredients are
> there.
> 
> Possibly someone suceeded in doing it already? Having a baseline backup and
> saved WAL logs, shouldn't it be possible to recover?

Incremental pg_dumps would be a huge step in the right direction!

However, unless I am mistaken, a baseline backup would need to be taken
cold because you cannot take a consistent online backup of the data
files without using file system snapshots or split mirrors, and even
that's questionable.

===

Subject: Re: [GENERAL] tablespaces a priority for 7.5?
Date: Thu, 22 Jan 2004 15:23:26 -0000
From: <lnd@hnit.is>
To: <pgsql-general@postgresql.org>

> Incremental pg_dumps would be a huge step in the right direction!

For big people (meaning bid databases) - not shure if pg_dump is the right
direction. Pg_dump must be quite slow also compact. Raw file copy is the way
to go: quick and simple, virtually no configuration is required, no possible
pg_dump bugs - the latest quite important(of course, no DB version and OS
changes are possible, but not really needed - this is recovery, not
migration).

> However, unless I am mistaken, a baseline backup would need 
> to be taken cold because you cannot take a consistent online 
> backup of the data files without using file system snapshots 
> or split mirrors, and even that's questionable.

That's doable: depends what you have in WAL logs. 
In short: a baseline full hot database backup is taken while database is
running, then when recovering WAL logs are put on top of this baseline
backup. WAL logs must actually account for a lot: table, index changes, etc.

===




Cc: pgsql-general@postgresql.org
From: Jeff <threshar@torgo.978.org>
Subject: Re: [GENERAL] tablespaces a priority for 7.5?
Date: Thu, 22 Jan 2004 10:42:09 -0500
To: Cott Lang <cott@internetstaff.com>


On Jan 22, 2004, at 10:00 AM, Cott Lang wrote:

> On Thu, 2004-01-22 at 07:38, lnd@hnit.is wrote:
>
>> Meaning incremental (hot)-backups?
>> Or as protection against DROP/TRUNCATE/DELETE ALL 
>> TABLE/SCHEMA/DATABASE?
>>
>> With a WAL it should be doable in some 7.x version, all ingredients 
>> are
>> there.
>>
>> Possibly someone suceeded in doing it already? Having a baseline 
>> backup and
>> saved WAL logs, shouldn't it be possible to recover?
>
> Incremental pg_dumps would be a huge step in the right direction!
>
> However, unless I am mistaken, a baseline backup would need to be taken
> cold because you cannot take a consistent online backup of the data
> files without using file system snapshots or split mirrors, and even
> that's questionable.

pg_dump always takes a consistent dump - things won't change underneath it.
So you're backup won't have anything that changed after pg_dump 
started.. thus where incremental would come in.

===

To: Cott Lang <cott@internetstaff.com>
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] tablespaces a priority for 7.5? 
Date: Thu, 22 Jan 2004 11:04:30 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

Cott Lang <cott@internetstaff.com> writes:
> However, unless I am mistaken, a baseline backup would need to be taken
> cold because you cannot take a consistent online backup of the data
> files without using file system snapshots or split mirrors, and even
> that's questionable.

No, it wouldn't.  All you need is to archive WAL beginning from the last
checkpoint record before you begin to take the baseline backup.  The
baseline backup would not be consistent in itself --- but replaying WAL
from the previous checkpoint to any time later than the end of the
backup would bring it to a consistent state.  Missed updates in the
backup are essentially damage that would get repaired by replay.

There are a few trivial things that still have to be done before this
can be a reality (one I can think of is that WAL really needs to have
entries for file creation/deletion), but it's not that far off in terms
of the base mechanisms.  Writing the management software is the main
task.

There is a lot more info available in the pghackers archives.

===

Subject: Re: [GENERAL] tablespaces a priority for 7.5?
From: Cott Lang <cott@internetstaff.com>
To: pgsql-general@postgresql.org
Date: Thu, 22 Jan 2004 09:15:42 -0700


> > However, unless I am mistaken, a baseline backup would need to be taken
> > cold because you cannot take a consistent online backup of the data
> > files without using file system snapshots or split mirrors, and even
> > that's questionable.
> >
> 
> pg_dump always takes a consistent dump - things won't change underneath 
> it.
> So you're backup won't have anything that changed after pg_dump 
> started.. thus where incremental would come in.

Sorry, I'm referring to two entirely different things there. :)

1. Being able to do incremental pg_dumps would be a big plus, because
you could take much quicker dumps and thus do it more regularly. I have
a 50GB database I dump every 3 hours that takes 35 minutes to dump. :(

2. The rest was my hypothesizing on what might be necessary for
point-in-time recovery, which pg_dump isn't going to allow even if you
can perform incremental dumps. 

===

Subject: Re: [GENERAL] tablespaces a priority for 7.5?
From: Cott Lang <cott@internetstaff.com>
To: pgsql-general@postgresql.org
Date: Thu, 22 Jan 2004 09:25:45 -0700

On Thu, 2004-01-22 at 09:04, Tom Lane wrote:

> No, it wouldn't.  All you need is to archive WAL beginning from the last
> checkpoint record before you begin to take the baseline backup.  The
> baseline backup would not be consistent in itself --- but replaying WAL
> from the previous checkpoint to any time later than the end of the
> backup would bring it to a consistent state.  Missed updates in the
> backup are essentially damage that would get repaired by replay.

I will experiment with this. I have plenty of databases to wreak havoc on. :)

> There are a few trivial things that still have to be done before this
> can be a reality (one I can think of is that WAL really needs to have
> entries for file creation/deletion), but it's not that far off in terms
> of the base mechanisms.  Writing the management software is the main
> task.

Being able to write WAL logs to two locations would be handy.  Is it
currently possible to have the logs not be recycled and occasionally
deleted?

Is there any hope for support for all of this in 7.5?   I dread knowing
that at some point, I may be forced kicking and screaming back to Oracle
because of this. :)

===

To: Cott Lang <cott@internetstaff.com>
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] tablespaces a priority for 7.5? 
Date: Thu, 22 Jan 2004 11:36:01 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

Cott Lang <cott@internetstaff.com> writes:
> 2. The rest was my hypothesizing on what might be necessary for
> point-in-time recovery, which pg_dump isn't going to allow even if you
> can perform incremental dumps. 

Right.  There seems to be some confusion about that in this thread,
so just for the record: pg_dump has nothing to do with our plans for
point-in-time recovery (or incremental backup, which is essentially the
same thing).  To do these, you would take a *physical* dump of the
database directory as a baseline, and thereafter copy WAL segments off
to tape or whatever you are using as archive media.  Recovery would
consist of restoring the physical baseline dump, and then replaying WAL
against it up to whatever point in time you wanted to recover to.  You
would, therefore, need to keep a continuous sequence of WAL files back
to the time of your most recent baseline backup.

===

Date: Thu, 22 Jan 2004 17:07:49 -0000 (GMT)
Subject: Re: [GENERAL] tablespaces a priority for 7.5?
From: "John Sidney-Woollett" <johnsw@wardbrook.com>
To: "Tom Lane" <tgl@sss.pgh.pa.us>
Cc: "Cott Lang" <cott@internetstaff.com>, pgsql-general@postgresql.org

Tom Lane said:
> To do these, you would take a *physical* dump of the
> database directory as a baseline, and thereafter copy WAL segments off
> to tape or whatever you are using as archive media.  Recovery would
> consist of restoring the physical baseline dump, and then replaying WAL
> against it up to whatever point in time you wanted to recover to.  You
> would, therefore, need to keep a continuous sequence of WAL files back
> to the time of your most recent baseline backup.

This is similar to Oracle...

With Oracle you have the option of EITHER exporting the database
(equivalent to doing pg_dump) OR taking the database offline and carrying
out a file system level copy of the database files.

Recovery can be accomplished by either restoring the backup database files
or creating a new database by importing the most recent export/dump file.

After that you apply the archive logs (equivalent to WAL segments) to
bring your database up to date.

One caveat for Oracle (at least in 8.1.x) is that DDL statements are not
recorded in the archive logs, and can screw things up. Best to take a new
export/dump after making DDL changes!

The difference between Oracle and Postgres appears to be that posgres
requires a file system level copy of the database instead of being able to
make use of a dump file for this type of recovery. Is that correct?

===


To: Cott Lang <cott@internetstaff.com>
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] tablespaces a priority for 7.5? 
Date: Thu, 22 Jan 2004 12:16:26 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

Cott Lang <cott@internetstaff.com> writes:
> Being able to write WAL logs to two locations would be handy.  Is it
> currently possible to have the logs not be recycled and occasionally
> deleted?

The current thought is to add some sort of hook that allows a
user-defined action to be taken at the point where a WAL file would
normally get recycled.  Presumably this action would involve copying the
WAL file to someplace else (eg a tape), after which it could get
recycled.  If you've got any thoughts on exactly how to manage this,
let's take up a discussion on pgsql-hackers.

> Is there any hope for support for all of this in 7.5?

People are thinking about it, but I'm not seeing a lot of work getting
done.  (Partly my fault, since this is one of the things I'm supposed to
be working on...)

===
	
To: johnsw@wardbrook.com
Cc: "Cott Lang" <cott@internetstaff.com>, pgsql-general@postgresql.org
Subject: Re: [GENERAL] tablespaces a priority for 7.5? 
Date: Thu, 22 Jan 2004 12:24:30 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

"John Sidney-Woollett" <johnsw@wardbrook.com> writes:
> With Oracle you have the option of EITHER exporting the database
> (equivalent to doing pg_dump) OR taking the database offline and carrying
> out a file system level copy of the database files.

The planned PITR feature would not require you to take anything offline.
The whole concept of an "offline" database is an Oracle-ism that I see
no value in emulating.

> One caveat for Oracle (at least in 8.1.x) is that DDL statements are not
> recorded in the archive logs, and can screw things up. Best to take a new
> export/dump after making DDL changes!

Well, we're ahead of them on that...

> The difference between Oracle and Postgres appears to be that posgres
> requires a file system level copy of the database instead of being able to
> make use of a dump file for this type of recovery. Is that correct?

I dunno what a "dump file" would equate to in Postgres terms, but yeah,
we're envisioning using ordinary filesystem tools (tar, say) as the
mechanism for handling a baseline backup.

===

Date: Thu, 22 Jan 2004 17:30:38 -0000 (GMT)
Subject: Re: [GENERAL] tablespaces a priority for 7.5?
From: "John Sidney-Woollett" <johnsw@wardbrook.com>
To: "Tom Lane" <tgl@sss.pgh.pa.us>
Cc: johnsw@wardbrook.com, "Cott Lang" <cott@internetstaff.com>,

Tom Lane said:
> The planned PITR feature would not require you to take anything offline.
> The whole concept of an "offline" database is an Oracle-ism that I see
> no value in emulating.

> I dunno what a "dump file" would equate to in Postgres terms, but yeah,
> we're envisioning using ordinary filesystem tools (tar, say) as the
> mechanism for handling a baseline backup.

(I think) that was the point I was trying to make. Oracle lets you backup
while up and running (using export) OR when offline (copying db files)...

However, it looks like postgres will require a file system level backup to
recover from. Or will you be able to build a new db from the dump file,
and then apply the WAL segments?


===


Subject: Re: [GENERAL] tablespaces a priority for 7.5?
From: Cott Lang <cott@internetstaff.com>
To: pgsql-general@postgresql.org
Date: Thu, 22 Jan 2004 10:58:47 -0700

> With Oracle you have the option of EITHER exporting the database
> (equivalent to doing pg_dump) OR taking the database offline and carrying
> out a file system level copy of the database files.

OR putting the tablespaces in hot backup mode and carrying out a file
system level copy of the database files, which is the option I certainly
prefer. :)

> Recovery can be accomplished by either restoring the backup 
> database files or creating a new database by importing the most 
> recent export/dump file.

> After that you apply the archive logs (equivalent to WAL segments)
> to bring your database up to date.

You can't do that with an export in Oracle. Archive logs are only useful
to apply to a cold or hot backup of the data files. 

Export with Postgres works about the same as with Oracle.  It's the lack
of well-supported hot backups in Postgres that are the limitation. 

> One caveat for Oracle (at least in 8.1.x) is that DDL statements
> are not recorded in the archive logs, and can screw things up. 
> Best to take a new export/dump after making DDL changes!

I've not heard that one. That would make hot standby databases a serious
maintenance problem. :)


===

Subject: Re: [GENERAL] tablespaces a priority for 7.5?
Date: Thu, 22 Jan 2004 18:08:19 -0000
From: <lnd@hnit.is>
To: <pgsql-general@postgresql.org>



John Sidney-Woollett [mailto:johnsw@wardbrook.com]  wrote:

> lnd@hnit.is said:

> >> (I think) that was the point I was trying to make. Oracle lets you 
> >> backup while up and running (using export) OR when offline 
> >> (copying 
> >> db files)...

> > Not neccesseraly offline. In oracle you take tablespace
> > in backup mode and just use OCOPY(I believe you must
> > watch out that file copy is performed at OS block sizes
> > not smaller than db page size - OCOPY does that) to copy
> > files.  REDO logs (i.e. WAL logs) will bring copied
> > files in synch.  Or one can use RMAN tool which is the
> > same as to copy files just it makes book-keeping for
> > backups and discards empty pages from backuped file.

> You obviously know Oracle better than me - I stand corrected! :)
> 
> Actually I was trying to figure out whether the WAL segments 
> could be applied to a database rebuilt using a file generated 
> by pg_dump. I suspect not, and doesn't that mean that the 
> postmaster needs to be stopped to carry out the file system 
> level copy of the pg files?

It's interesting to know what pg WAL's are:

If they are page level logs, then you theoretically can have a file system
backup - which is fast, reliable, no issues like pg_dump has, i.e. who is
first/last regarding stored procedures/views/triggers, to have users
precreated, etc, etc.

If the are SQL statements which are REPLAYD after pg_dump restore: well, then
file system backup has no chance. The issue is that you are missing index
pages and probably much more.

===

To: johnsw@wardbrook.com
Cc: "Cott Lang" <cott@internetstaff.com>, pgsql-general@postgresql.org
Subject: Re: [GENERAL] tablespaces a priority for 7.5? 
Date: Thu, 22 Jan 2004 13:10:05 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

"John Sidney-Woollett" <johnsw@wardbrook.com> writes:
> Tom Lane said:
>> I dunno what a "dump file" would equate to in Postgres terms, but yeah,
>> we're envisioning using ordinary filesystem tools (tar, say) as the
>> mechanism for handling a baseline backup.

> However, it looks like postgres will require a file system level backup to
> recover from. Or will you be able to build a new db from the dump file,
> and then apply the WAL segments?

What dump file?  I'm trying to say that we have no such concept and no
intention of inventing one.

===

From: "Rick Gigger" <rick@alpinenetworking.com>
To: "Cott Lang" <cott@internetstaff.com>, <pgsql-general@postgresql.org>
Subject: Re: [GENERAL] tablespaces a priority for 7.5?
Date: Thu, 22 Jan 2004 12:09:17 -0700


> On Wed, 2004-01-21 at 09:05, Brian Maguire wrote:
>
> > In my opinion, it really is a critical feature to support and administer
> > enterprise databases.  All the major databases currently support this
> > and it is a compelling enough reason drive big users from away from
> > using postgres for their enterprise/large databases.  It really is a
> > database administrator's feature.
>
> It seems to me that the lack of point-in-time recovery is a much bigger
> roadblock against big users. :(

 This is certainly my feeling.

===

From: Julian North <jnorth@lastminute.com>
To: "'Rick Gigger'" <rick@alpinenetworking.com>,
Subject: Re: [GENERAL] tablespaces a priority for 7.5?
Date: Thu, 22 Jan 2004 19:27:24 -0000

speaking as someone currently migrating enterprise stuff to postgres....

point-in-time is definitely the biggest issue.

this is the main thing we are having to look reproducing using some form of 
replication to an alternative server.

===

Subject: Re: [HACKERS] [GENERAL] tablespaces a priority for 7.5? 
Date: Thu, 22 Jan 2004 17:56:02 -0000
From: <lnd@hnit.is>
To: <pgsql-hackers@postgresql.org>

>  [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
> The current thought is to add some sort of hook that allows a 
> user-defined action to be taken at the point where a WAL file 
> would normally get recycled.  Presumably this action would 
> involve copying the WAL file to someplace else (eg a tape), 
> after which it could get recycled.  If you've got any 
> thoughts on exactly how to manage this, let's take up a 
> discussion on pgsql-hackers.

In practice it is quite OK just to have parameters like those:

	-Are database WAL logs recycable or not, i.e. accumulates.

	-location where to put(location's path, possibly more than one
location) those accumulating logs
	-is the location mandatory or not
	-log file name format: date/time/log sequence number

What's quite good to have are system catalog views for WAL logs:

	log_history view: listing every log produced together with it's
sequence number, physical path - logs are assumed to have sequence numbers.
This view can recycle after N days based on some parameter.
	
	From that view or from another it would be good to find out the last
applied WAL log when recovering.

	What for are those views: they can be used to delete (yes, cleaning
after yourself is the issue) unneccesary logs (the recycling of log_history
view can/must be based on that criteria). 

	More: this view would allow to figure out what logs to backup after
baseline backup (it is assumed that one may tell which WAL log is the
current/last from this view or another one). Vice versa, this view or another
one can be used to find out what WAL logs are applied during recovery andwhat
are still needed compared to the original database.
	

	And a SQL command to complete a current WAL log
(switch to a new) is handy.

===

From: "Simon Riggs" <simon@2ndquadrant.com>
To: "'Tom Lane'" <tgl@sss.pgh.pa.us>
Cc: "'Gavin Sherry'" <swm@linuxworld.com.au>, <tswan@idigx.com>,
Subject: Re: [HACKERS] Out of space situation and WAL log pre-allocation (was Tablespaces)
Date: Tue, 2 Mar 2004 22:53:09 -0000

Tom Lane [mailto:tgl@sss.pgh.pa.us]
> "Simon Riggs" <simon@2ndquadrant.com> writes:

> > You're absolutely right about the not-knowing when
> > you're out of space issue. However, if the xlog has been
> > written then it is not desirable, but at least
> > acceptable that the checkpoint/bgwriter cannot complete
> > on an already committed txn. It's not the txn which is
> > getting the error, that's all.

> Right.  This is in fact not a fatal situation, as long as you don't run
> out of preallocated WAL space.  

...following on also from thoughts on [PERFORM] list...

Clearly running out of pre-allocated WAL space is likely to be the next
issue. Running out of space in the first place is likely to be because
of an intense workload, which is exactly the thing which also makes you
run out of pre-allocated WAL space. Does that make sense?

===

To: simon@2ndquadrant.com
Cc: "'Gavin Sherry'" <swm@linuxworld.com.au>, tswan@idigx.com,
Subject: Re: [HACKERS] Out of space situation and WAL log pre-allocation (was Tablespaces) 
Date: Tue, 02 Mar 2004 23:12:23 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

"Simon Riggs" <simon@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Right.  This is in fact not a fatal situation, as long as you don't
>> run out of preallocated WAL space.  

> Clearly running out of pre-allocated WAL space is likely to be the next
> issue. Running out of space in the first place is likely to be because
> of an intense workload, which is exactly the thing which also makes you
> run out of pre-allocated WAL space. Does that make sense?

I think one of the first things people would do with tablespaces is
stick the data files onto a separate partition from the WAL and clog
files.  (Actually you can do this today with a simple symlink hack, but
tablespaces will make it easier and clearer.)  The space usage for WAL
is really pretty predictable, because of the checkpoint-at-least-
every-N-segments setting.  clog is not exactly a space hog either.
Once you have that separation established, out-of-disk-space can kill
individual transactions but never the database as a whole.

One of the things that bothers me about the present PITR design is that
it presumes that individual WAL log segments can be kept until the
external archiver process feels like writing them somewhere.  If there's
no guarantee that that happens within X amount of time, then you can't
bound the amount of space needed on the WAL drive, and so you are back
facing the possibility of an out-of-WAL-space panic.  I suspect that we
cannot really do anything about that, but it's annoying.  Any bright
ideas out there?

===

To: Joe Conway <mail@joeconway.com>
Cc: simon@2ndquadrant.com, "'Gavin Sherry'" <swm@linuxworld.com.au>,
Subject: Re: [HACKERS] Out of space situation and WAL log pre-allocation (was Tablespaces) 
Date: Tue, 02 Mar 2004 23:55:03 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

Joe Conway <mail@joeconway.com> writes:
> Tom Lane wrote:
>> facing the possibility of an out-of-WAL-space panic.  I suspect that we
>> cannot really do anything about that, but it's annoying.  Any bright
>> ideas out there?

> Maybe specify an archive location (that of course could be on a separate 
> partition) that the external archiver should check in addition to the 
> normal WAL location. At some predetermined interval, push WAL log 
> segments no longer needed to the archive location.

Does that really help?  The panic happens when you fill the "normal" and
"archive" partitions, how's that different from one partition?

===

From: "Simon Riggs" <simon@2ndquadrant.com>
To: "'Tom Lane'" <tgl@sss.pgh.pa.us>, "'Joe Conway'" <mail@joeconway.com>
Cc: "'Gavin Sherry'" <swm@linuxworld.com.au>, <tswan@idigx.com>,
Subject: Re: [HACKERS] Out of space situation and WAL log pre-allocation (was Tablespaces) 
Date: Wed, 3 Mar 2004 21:40:09 -0000

>Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Joe Conway <mail@joeconway.com> writes:
> > Tom Lane wrote:
> >> Joe Conway <mail@joeconway.com> writes:

> >>> Maybe specify an archive location (that of course
> >>> could be on a separate partition) that the external
> >>> archiver should check in addition to the
> >>> normal WAL location. At some predetermined interval, push WAL log
> >>> segments no longer needed to the archive location.

> >> Does that really help?  The panic happens when you fill
> >> the "normal" and "archive" partitions, how's that
> >> different from one partition?

> > I see your point. But it would allow you to use a
> > relatively modest local partition for WAL segments,
> > while you might be using a 1TB netapp tray over NFS for
> > the archive segments.

> Fair enough, but it seems to me that that sort of setup
> really falls in the category of a user-defined archiving
> process --- that is, the hook that Postgres calls will
> push WAL segments from the local partition to the NFS
> server, and then pushing them off NFS to tape is the
> responsibility of some other user-defined subprocess.
> Database panic happens if and only if the local partition
> overflows.  I don't see that making Postgres explicitly
> aware of the secondary NFS arrangement will buy anything.

Tom's last sentence there summarises the design I was working with. I
had considered Joe's suggested approach (which was Oracle's also).

However, the PITR design will come with a usable low-function program
which can easily copy logs from pg_xlog to another archive directory.
That's needed as a test harness anyway, so it may as well be part of the
package. You'd be able to use that in production to copy xlogs to
another larger directory as a staging area to tape/failover on another
system: effectively Joe's idea is catered for in the basic package.

Anyway I'm answering questions before publishing the design as
stands...though people do keep spurring me to refine it as I'm writing
it down! That's why its good to document it I guess.
 
> > I guess if the archive partition fills up, I would err on the side
of
> > dropping archive segments on the floor.
> 
> That should be user-scriptable policy, in my worldview.

Hmmm. Very difficult that one.

My experience is in commercial systems. Dropping archive segments on the
floor is just absolutely NOT GOOD, if that is the only behaviour. The
whole purpose of having a dbms is so that you can protect your business
data, while using it. Such behaviour would most likely be a barrier to
wider commercial adoption. [Oracle and other dbms will freeze when this
situation is hit, rather than continue and drop archive logs.]
 
User-selectable behaviour? OK. That's how we deal with fsync; I can
relate to that. That hadn't been part of my thinking because of the
importance I'd attached to the log files themselves, but I can go with
that, if that's what was meant.

So, if we had a parameter called Wal_archive_policy that has 3 settings:
None = no archiving
Optimistic = archive, but if for some reason log space runs out then
make space by dropping the oldest archive logs 
Strict = if log space runs out, stop further write transactions from
committing, by whatever means, even if this takes down dbms.

That way, we've got something akin to transaction isolation level with
various levels of protection.

===

To: Joe Conway <mail@joeconway.com>
Cc: simon@2ndquadrant.com, "'Gavin Sherry'" <swm@linuxworld.com.au>,
Subject: Re: [HACKERS] Out of space situation and WAL log pre-allocation (was Tablespaces) 
Date: Wed, 03 Mar 2004 17:10:01 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

Joe Conway <mail@joeconway.com> writes:
> Simon Riggs wrote:
>> O... and other dbms will freeze when this situation is hit, rather
>> than continue and drop archive logs.]

> Been there, done that, don't see how it's any better. I hesitate to be
> real specific here, but let's just say the end result was restore from 
> backup :-(

It's hard for me to imagine a situation in which killing the database
would be considered a more attractive option than dropping old log
data.  You may or may not ever need the old log data, but you darn well
do need a functioning database.  (If you don't, you wouldn't be going to
all this work.)

I think also that Simon completely misunderstood my intent in saying
that this could be "user-scriptable policy".  By that I meant that the
*user* could write the code to behave whichever way he liked.  Not that
we were going to go into a mad rush of feature invention and try to
support every combination we could think of.  I repeat: code that pushes
logs into a secondary area is not ours to write.  We should concentrate
on providing an API that lets users write it.  We have only limited
manpower for this project and we need to spend it on getting the core
functionality done right, not on inventing frammishes.

===

From: "Simon Riggs" <simon@2ndquadrant.com>
To: "'Joe Conway'" <mail@joeconway.com>
Cc: "'Tom Lane'" <tgl@sss.pgh.pa.us>, "'Gavin Sherry'" <swm@linuxworld.com.au>,
Subject: Re: [HACKERS] Out of space situation and WAL log pre-allocation (was Tablespaces)
Date: Mon, 8 Mar 2004 23:28:25 -0000

>Joe Conway [mailto:mail@joeconway.com]
> Simon Riggs wrote:
> >> Tom Lane [mailto:tgl@sss.pgh.pa.us] That should be user-scriptable
> >> policy, in my worldview.
> 
> > O... and other dbms will freeze when this situation is hit, rather
> > than continue and drop archive logs.]
> 
> Been there, done that, don't see how it's any better. I hesitate to be
> real specific here, but let's just say the end result was restore from
> backup :-(
> 
> > So, if we had a parameter called Wal_archive_policy that has 3
> > settings: None = no archiving Optimistic = archive, but if for some
> > reason log space runs out then make space by dropping the oldest
> > archive logs Strict = if log space runs out, stop further write
> > transactions from committing, by whatever means, even if this takes
> > down dbms.
> 
> That sounds good to me. For the "Optimistic" case, we need to yell
> loudly if we do find ourselves needing to drop segments. For the
> "Strict" case, we just need to be sure it works correctly ;-)

Good.

Yell loudly really needs to happen sometime earlier, which is as Gavin
originally thought something to do with tablespaces.

Strict behaviour is fairly straightforward, you just PANIC!

I'd think we could rename these to
Fail Operational rather than Optimistic
Fail Safe rather than Strict
...the other names were a bit like "I'm right" and "but I'll do yours
too" ;}

===

From: "Simon Riggs" <simon@2ndquadrant.com>
To: "'Tom Lane'" <tgl@sss.pgh.pa.us>, "'Joe Conway'" <mail@joeconway.com>
Cc: "'Gavin Sherry'" <swm@linuxworld.com.au>, <tswan@idigx.com>,
Subject: Re: [HACKERS] Out of space situation and WAL log pre-allocation (was Tablespaces) 
Date: Mon, 8 Mar 2004 23:28:25 -0000


Please excuse the delay in replying..

>Tom Lane
> Joe Conway <mail@joeconway.com> writes:
> > Simon Riggs wrote:
> >> O... and other dbms will freeze when this situation is hit, rather
> >> than continue and drop archive logs.]
> 
> > Been there, done that, don't see how it's any better. I hesitate to
be
> > real specific here, but let's just say the end result was restore
from
> > backup :-(

Myself also. I accept your experience and insight, I apologise if my own
seemed overblown. My take on that is that if you're in a situation that
has a high probability of going bad, the last thing you would want is to
drop xlogs. Same technical experience, different viewpoint on what to
learn from it. 

> It's hard for me to imagine a situation in which killing the database
> would be considered a more attractive option than dropping old log
> data.  You may or may not ever need the old log data, but you darn
well
> do need a functioning database.  (If you don't, you wouldn't be going
to
> all this work.)

The main point here for me is that the choice of keeping archived (not
old) log files against keeping the database up isn't actually mine to
make; that choice belongs to the owner of the database, not me as
developer or administrator, consultant or whatever. 

Although I admit I did not at first comprehend that such a view was
possible, I did flex to allow yours and Joe's perspective when that was
voiced.

The point is one of risk: does the owner wish to risk the possibility
that a transaction may be lost in order to keep the database up? The
possibility of lost rows must be balanced against the probably higher
possibility of being unable to write new data. But which is worse? Who
can say?

In some environments where I have worked, (again forgive any seeming
personal arrogance or posturing), such as banks or finance generally, it
has been desirable to stop the system rather than risk losing even a
single row. In other situations, lost rows must be balanced against the
money lost through downtime. Guess it depends whether you've got a
contract for uptime or for data integrity?? ;)

> I repeat: code that pushes
> logs into a secondary area is not ours to write.  We should
concentrate
> on providing an API that lets users write it.  

Agreed.

> We have only limited
> manpower for this project and we need to spend it on getting the core
> functionality done right, not on inventing frammishes.

Love that word "frammish"...seriously, I understand and agree.

My understanding is that existing logic will cause a PANIC if the xlog
directory cannot be written to. Helping the database stay up by dropping
logs would require extra code...

This was an edge case anyhow...

===

From: "Simon Riggs" <simon@2ndquadrant.com>
To: "'Joe Conway'" <mail@joeconway.com>
Cc: "'Tom Lane'" <tgl@sss.pgh.pa.us>, "'Gavin Sherry'" <swm@linuxworld.com.au>,
Subject: Re: [HACKERS] Out of space situation and WAL log pre-allocation (was Tablespaces)
Date: Mon, 8 Mar 2004 23:28:25 -0000

>Joe Conway [mailto:mail@joeconway.com]
> Simon Riggs wrote:
> >> Tom Lane [mailto:tgl@sss.pgh.pa.us] That should be user-scriptable
> >> policy, in my worldview.
> 
> > O... and other dbms will freeze when this situation is hit, rather
> > than continue and drop archive logs.]
> 
> Been there, done that, don't see how it's any better. I hesitate to be
> real specific here, but let's just say the end result was restore from
> backup :-(
> 
> > So, if we had a parameter called Wal_archive_policy that has 3
> > settings: None = no archiving Optimistic = archive, but if for some
> > reason log space runs out then make space by dropping the oldest
> > archive logs Strict = if log space runs out, stop further write
> > transactions from committing, by whatever means, even if this takes
> > down dbms.
> 
> That sounds good to me. For the "Optimistic" case, we need to yell
> loudly if we do find ourselves needing to drop segments. For the
> "Strict" case, we just need to be sure it works correctly ;-)

Good.

Yell loudly really needs to happen sometime earlier, which is as Gavin
originally thought something to do with tablespaces.

Strict behaviour is fairly straightforward, you just PANIC!

I'd think we could rename these to
Fail Operational rather than Optimistic
Fail Safe rather than Strict
...the other names were a bit like "I'm right" and "but I'll do yours
too" ;}

===


To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Out of space situation and WAL log pre-allocation (was Tablespaces)
From: Greg Stark <gsstark@mit.edu>
Date: 09 Mar 2004 22:28:09 -0500


"Simon Riggs" <simon@2ndquadrant.com> writes:

> Strict behaviour is fairly straightforward, you just PANIC!

There is another mode possible as well. Oracle for example neither panics nor
continues, it just freezes. It keeps retrying the transaction until it finds
it has space.  

The sysadmin or dba just has to somehow create additional space by removing
old files or however and the database will continue where it left off. That
seems a bit nicer than panicing.

When I first heard that I was shocked. It means implementing archive logs
*created* a new failure mode where there was none before. I thought that was
the dumbest idea in the world: who needed a backup process that increased the
chances of an outage? Now I can see the logic, but I'm still not sure which
mode I would pick if it was up to me. As others have said, I guess it would
depend on the situation.

===

To: Bruce Momjian <pgman@candle.pha.pa.us>
Cc: Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org,
Subject: Re: [pgsql-hackers-win32] [HACKERS] Tablespaces 
Date: Wed, 03 Mar 2004 00:18:39 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> For tablespaces on OS's that don't support it, I think we will have to
> store the path name in the file and read it via the backend.  Somehow we
> should cache those lookups.

My feeling is that we need not support tablespaces on OS's without
symlinks.

===

From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [pgsql-hackers-win32] [HACKERS] Tablespaces
To: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed, 3 Mar 2004 00:20:06 -0500 (EST)
Cc: Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org,

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > For tablespaces on OS's that don't support it, I think we will have to
> > store the path name in the file and read it via the backend.  Somehow we
> > should cache those lookups.
> 
> My feeling is that we need not support tablespaces on OS's without
> symlinks.

Agreed, but are we going to support non-tablespace installs?  I wasn't
sure that was an option.

===

Date: Wed, 3 Mar 2004 16:25:12 +1100 (EST)
From: Gavin Sherry <swm@linuxworld.com.au>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Bruce Momjian <pgman@candle.pha.pa.us>, Greg Stark <gsstark@mit.edu>,
Subject: Re: [pgsql-hackers-win32] [HACKERS] Tablespaces 

On Wed, 3 Mar 2004, Tom Lane wrote:

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > For tablespaces on OS's that don't support it, I think we will have to
> > store the path name in the file and read it via the backend.  Somehow we
> > should cache those lookups.
>
> My feeling is that we need not support tablespaces on OS's without
> symlinks.

I'm going to focus on implementing this on the system(s) I'm used to
developing on (ie, those which support symlinks). Once that is done, I'll
talk with the Win32 guys about what, if anything, we can do about getting
this to work on Win32 (and possibly other non-symlink supporting OSs).

===


From: Claudio Natoli <claudio.natoli@memetrics.com>
To: "'Gavin Sherry'" <swm@linuxworld.com.au>, Tom Lane <tgl@sss.pgh.pa.us>
Cc: Bruce Momjian <pgman@candle.pha.pa.us>, Greg Stark <gsstark@mit.edu>,
Subject: Re: [pgsql-hackers-win32] [HACKERS] Tablespaces 
Date: Wed, 3 Mar 2004 21:00:07 +1100 



Gavin Sherry wrote:
> I'm going to focus on implementing this on the system(s) I'm used to
> developing on (ie, those which support symlinks). Once that is done, I'll
> talk with the Win32 guys about what, if anything, we can do about getting
> this to work on Win32 (and possibly other non-symlink supporting OSs).

If the design can allow the default tablespace to not require symlinks, then
I think we're in good shape, as we can simply choose to introduce
tablespaces under Win32 whenever it suits.

===

Subject: Re: [pgsql-hackers-win32] [HACKERS] Tablespaces
Date: Wed, 3 Mar 2004 12:32:43 +0100
From: "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at>
To: "Bruce Momjian" <pgman@candle.pha.pa.us>, "Tom Lane" <tgl@sss.pgh.pa.us>
Cc: "Greg Stark" <gsstark@mit.edu>, <pgsql-hackers@postgresql.org>,


> > > For tablespaces on OS's that don't support it, I think we will have to
> > > store the path name in the file and read it via the backend.  Somehow we
> > > should cache those lookups.
> > 
> > My feeling is that we need not support tablespaces on OS's without
> > symlinks.

To create symlinked directories on Win2k NTFS see:
	http://www.sysinternals.com/ntw2k/source/misc.shtml#junction

I think Win2000 or XP would be a reasonable restriction for Win32 PG 
installations that want tablespaces.

===

From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [pgsql-hackers-win32] [HACKERS] Tablespaces
To: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed, 3 Mar 2004 08:31:07 -0500 (EST)
Cc: Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org,

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> My feeling is that we need not support tablespaces on OS's without
> >> symlinks.
> 
> > Agreed, but are we going to support non-tablespace installs?  I wasn't
> > sure that was an option.
> 
> A setup containing only the default tablespace cannot use any symlinks.
> That doesn't seem hard though.

Yea, I think you are right.  We just disable CREATE TABLESPACE and the
rest should just work.  Basically, pg_tablespace will only have one
entry on those platforms.  The initdb directory structure will have a
single tablespace, but that doesn't use symlinks.

===

From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [pgsql-hackers-win32] [HACKERS] Tablespaces
To: Claudio Natoli <claudio.natoli@memetrics.com>
Date: Wed, 3 Mar 2004 08:32:05 -0500 (EST)
Cc: Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org,

Claudio Natoli wrote:
> 
> Bruce Momjian writes:
> > I just checked from the MinGW console and I see:
> > [snip]
> > It accepts ln -s, but does nothing with it.
> 
> And even if it had worked, it wouldn't really matter, since we don't
> actually want to *run* the system under MinGW/msys, just build it.
> 
> I think the idea of implementing in symlinks for non-compliant platforms in
> md.c has some merit. FWIW, looks like that is how cygwin implements
> symlinks...

Why can't we use MS Win32 shortcut files to simulate symlinks?  MinGW
doesn't do it, so I suppose it isn't possible.

===

Subject: Re: [pgsql-hackers-win32] [HACKERS] Tablespaces
Date: Wed, 3 Mar 2004 14:46:44 +0100
From: "Magnus Hagander" <mha@sollentuna.net>
To: "Bruce Momjian" <pgman@candle.pha.pa.us>,
Cc: "Greg Stark" <gsstark@mit.edu>, <pgsql-hackers@postgresql.org>,


> > > I just checked from the MinGW console and I see:
> > > [snip]
> > > It accepts ln -s, but does nothing with it.
> > 
> > And even if it had worked, it wouldn't really matter, since 
> we don't 
> > actually want to *run* the system under MinGW/msys, just build it.
> > 
> > I think the idea of implementing in symlinks for non-compliant 
> > platforms in md.c has some merit. FWIW, looks like that is 
> how cygwin 
> > implements symlinks...
> 
> Why can't we use MS Win32 shortcut files to simulate 
> symlinks?  MinGW doesn't do it, so I suppose it isn't possible.

Shortcut files are *only* a GUI feature. If you do a "dir" in the
filesystem they show up as a file. If you issue open() or similar on
them (for example, by doing "type" on the command prompt), you will get
the .lnk file, not the file it points to.


===

To: "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at>
Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>, "Greg Stark" <gsstark@mit.edu>,
Subject: Re: [pgsql-hackers-win32] [HACKERS] Tablespaces 
Date: Wed, 03 Mar 2004 09:30:53 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
>>> My feeling is that we need not support tablespaces on OS's without
>>> symlinks.

> To create symlinked directories on Win2k NTFS see:
> 	http://www.sysinternals.com/ntw2k/source/misc.shtml#junction
> I think Win2000 or XP would be a reasonable restriction for Win32 PG 
> installations that want tablespaces.

Oh, good --- symlinks for directories are all that we need for this
design.  I think that settles it then.

===

Date: Wed, 03 Mar 2004 10:16:17 -0500
From: Andrew Dunstan <andrew@dunslane.net>
To: pgsql-hackers@postgresql.org,
Subject: Re: [pgsql-hackers-win32] [HACKERS] Tablespaces

Zeugswetter Andreas SB SD wrote:

>To create symlinked directories on Win2k NTFS see:
>	http://www.sysinternals.com/ntw2k/source/misc.shtml#junction
>
>  
>

I don't think we could use this s/w though, unless the author is 
prepared to relicense it. I'm sure implementing a clean room version of 
the relevant parts wouldn't be too hard, though.

===

Subject: Re: [pgsql-hackers-win32] [HACKERS] Tablespaces
Date: Wed, 3 Mar 2004 16:22:30 +0100
From: "Magnus Hagander" <mha@sollentuna.net>
To: "Andrew Dunstan" <andrew@dunslane.net>, <pgsql-hackers@postgresql.org>,

> >To create symlinked directories on Win2k NTFS see:
> >	http://www.sysinternals.com/ntw2k/source/misc.shtml#junction
> >
> >  
> >
> 
> I don't think we could use this s/w though, unless the author is 
> prepared to relicense it. I'm sure implementing a clean room 
> version of 
> the relevant parts wouldn't be too hard, though.

Definitly not (ooh, danger...) - it's a simple and well documented call
to DeviceIOControl().
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/
base/fsctl_set_reparse_point.asp
is the one you want, I think.

We can always reference to that utility as a good way to get information
about the junctions pgsql has created...

===

Date: Thu, 4 Mar 2004 11:50:11 -0600 (CST)
Subject: Re: [pgsql-hackers-win32] [HACKERS] Tablespaces
From: tswan@idigx.com
To: "Tom Lane" <tgl@sss.pgh.pa.us>
Cc: "Zeugswetter Andreas SB SD" <zeugswettera@spardat.at>,

> "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
>>>> My feeling is that we need not support tablespaces on OS's without
>>>> symlinks.
>
>> To create symlinked directories on Win2k NTFS see:
>> 	http://www.sysinternals.com/ntw2k/source/misc.shtml#junction
>> I think Win2000 or XP would be a reasonable restriction for Win32 PG
>> installations that want tablespaces.
>
> Oh, good --- symlinks for directories are all that we need for this
> design.  I think that settles it then.
>

What archival tools are there that would restore this to this back to the
filesystem: tar? zip?  What would happen if a symlink were removed or
pointed to an invalid location while the postmaste was running?

I think the catalog approach would future proof yourself; think about the
possibility of new filesystems, table storage mechanisms, or devices. 
There may be a raw disk system in the future.  With that you could point
to a block device instead of a filesystem directory.  Symlinks seem like a
simple solution but will it be complete enough?   A meta file or catalog
entry could store more than just the location of the tablespace: max/min
space allocation, vacuum frequencies, storage type (maybe in-place
updates), locking schemes (in case someome had a wild hair to customize
NFS operations).

That meta/configuration file with a summary or one that is generated for a
summary could be useful.

I just can't help but state that "I've got a bad feeling about this."

===

From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [pgsql-hackers-win32] [HACKERS] Tablespaces
To: tswan@idigx.com
Date: Thu, 4 Mar 2004 13:18:47 -0500 (EST)
Cc: Tom Lane <tgl@sss.pgh.pa.us>,

tswan@idigx.com wrote:
> > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> >>>> My feeling is that we need not support tablespaces on OS's without
> >>>> symlinks.
> >
> >> To create symlinked directories on Win2k NTFS see:
> >> 	http://www.sysinternals.com/ntw2k/source/misc.shtml#junction
> >> I think Win2000 or XP would be a reasonable restriction for Win32 PG
> >> installations that want tablespaces.
> >
> > Oh, good --- symlinks for directories are all that we need for this
> > design.  I think that settles it then.
> >
> 
> What archival tools are there that would restore this to this back to the
> filesystem: tar? zip?  What would happen if a symlink were removed or
> pointed to an invalid location while the postmaste was running?

Well, for backup, just run tar or find on /data with a flag to follow
symlinks, and you are done.  Can't get much easier than that.

===

Date: Thu, 4 Mar 2004 14:06:43 -0600 (CST)
Subject: Re: [pgsql-hackers-win32] [HACKERS] Tablespaces
From: tswan@idigx.com
To: "Bruce Momjian" <pgman@candle.pha.pa.us>
Cc: tswan@idigx.com, "Tom Lane " <tgl@sss.pgh.pa.us>,

> tswan@idigx.com wrote:
>> > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
>> >>>> My feeling is that we need not support tablespaces on OS's without
>> >>>> symlinks.
>> >
>> >> To create symlinked directories on Win2k NTFS see:
>> >> 	http://www.sysinternals.com/ntw2k/source/misc.shtml#junction
>> >> I think Win2000 or XP would be a reasonable restriction for Win32 PG
>> >> installations that want tablespaces.
>> >
>> > Oh, good --- symlinks for directories are all that we need for this
>> > design.  I think that settles it then.
>> >
>>
>> What archival tools are there that would restore this to this back to
>> the
>> filesystem: tar? zip?  What would happen if a symlink were removed or
>> pointed to an invalid location while the postmaste was running?
>
> Well, for backup, just run tar or find on /data with a flag to follow
> symlinks, and you are done.  Can't get much easier than that.

I'm ruferring to NTFS and the win32 platforms.  How does tar handle these
symlinks on the NTFS filesystem?  What about if someone finds that FAT32
is significantly better for the database?

It seems a little insane to introduce an OS/filesystem dependency at the
onset of a porting effort especially if you hope to be OS agnostic for
feature sets.  I think someone would be crying foul if a new feature only
worked on Linux and not on FreeBSD.

Additionally, another developer noted the advantage of a text file is that
it would be easy for someone to develop tools to help if it became
difficult to edit or parse.  Additionally, there could be a change away
from a flat file format to an XML format to configure the tablespace area.

Another argument against the symlink approach is how they may change in
the future.   A file location is universal, symlink behavoir may not be. 
The symlink behavior on different ports may change.  To rely on symlinks
introduces an unnecessary OS dependency.  All that is needed is the file
location.  This can be derived from a flat file, an XML file, a sysetm
catalog.

Also, to support some features on some platforms is a real poor prospect. 
 ( This application requires the Linux port of PostgreSQL 7.5.1 ) seems
like a poor choice for advocating PostgreSQL.   The extra effort insures
that all ports, current and future, can get the same set of features and
nhancements.   I like Unix and Linux as much as the next guy, but I have
to be real and make the presumption that there are and will be other
operating systems and it would be wise to plan a little for that.

===

To: <tswan@idigx.com>
Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>, "Tom Lane " <tgl@sss.pgh.pa.us>,
Subject: Re: [pgsql-hackers-win32] [HACKERS] Tablespaces
From: jearl@bullysports.com
Date: Thu, 04 Mar 2004 17:07:20 -0700

<tswan@idigx.com> writes:

>> tswan@idigx.com wrote:
>>> > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
>>> >>>> My feeling is that we need not support tablespaces on OS's without
>>> >>>> symlinks.
>>> >
>>> >> To create symlinked directories on Win2k NTFS see:
>>> >>
the rest of The Pile (a partial mailing list archive)
doom@kzsu.stanford.edu