postgresql-schema-for-xml-documents

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.

To: pgsql-hackers@postgresql.org
From: "John Gray" <jgray@beansindustry.co.uk>
Subject: [HACKERS] Re: Storing XML in PostgreSQL
Date: Wed, 25 Jul 2001 00:15:21 +0000

In article <4.2.2.20010724150449.00a9ea90@192.168.0.67>,
jm.poure@freesurf.fr (Jean-Michel POURE) wrote:
> Hello friends,
> 
> What is the best way to parse and store an XML document in PostgreSQL? I
> would like to store fwbuilder (http://www.fwbuilder.org) objects in 
> PostgreSQL.
> 

I think the best way depends on what you're trying to achieve with the
document once you have it in the database. One approach is to have tables
for  elements, attributes and cdata and use an XML parser to insert
appropriate database records.

I have used a schema such as the following- in the cdata table "element" is
the  ID of the containing element, and itempos is just an integer used to
order the  entries. I used this with a bit of java which hooks up to the
Lark parser (using SAX) to do the parsing and fires off INSERT queries 
through the jdbc driver. 

CREATE SEQUENCE cdata_seq; 
CREATE SEQUENCE attribute_seq; 
CREATE SEQUENCE element_seq; 

CREATE TABLE element (
       document integer,        element integer not null PRIMARY KEY
       default nextval('element_seq'), name text, parent integer, itempos
       integer
       );

CREATE TABLE attribute (
       document integer,        attribute integer not null default
       nextval('attribute_seq'), name text, value text, element integer,
       itempos integer
       );


CREATE TABLE cdata (
       document integer,        cdata integer not null default
       nextval('cdata_seq'), value text, element integer, itempos integer
       );

In my example, I was interested in selecting all the cdata content 
of a <type> tag immediately contained within a <feature> tag path.

The easiest solution is to create a view, which can then be queried to 
find  all cases where, for example, feature type = 'Ditch'.

CREATE VIEW featuretype AS featuretype
SELECT c.document,c.value
FROM cdata c, element e, element e1
WHERE c.element = e.element 
AND   e.parent = e1.element
AND   e.name = 'type'
AND  e1.name = 'feature'
AND   c.document = e.document
AND   e.document = e1.document;

if you are interested I can provide the very basic (java) code I used for 
this.

OR, depending on what these fwbuilder objects involve,  you can of 
course just store XML documents in fields of type text (especially if 
you use 7.1 which has TOAST, so you can store long documents). IT's 
not difficult to hook up a parser (I'm using expat) to a PostgreSQL 
function written in C and parse on the fly.

I haven't really finished that code, but after I've commented it, I can 
certainly post it if anyone is interested. It does work, but probably
needs some tidying. It really wasn't difficult to write the functions 
though. In fact, I've been surprised by how easy it is to write 
PostgreSQL C functions...

Please contact me if you have any questions -I've been away for a bit
so haven't worked on that code for a couple of weeks -I'm hoping to 
get back into it soon.

Regards

John



===
To: pgsql-hackers@postgresql.org
From: "John Gray" <jgray@beansindustry.co.uk>
Subject: [HACKERS] Re: Re: Storing XML in PostgreSQL
Date: Wed, 25 Jul 2001 18:12:28 +0000

In article <m2elr5b2cx.fsf@smaug.polygnosis.com>, gunnar@polygnosis.com
(Gunnar =?iso-8859-1?q?R=F8nning?=) wrote:

> Do you have any documentation on your C functions ? I'm just interested
> in knowing what functions they provide.
> 

There are only two (so far). They're very basic. I have:

pgxml_parse(text) returns bool
  -parses the provided text and returns true or false if it is 
well-formed or not.

pgxml_xpath(text doc, text xpath, int n) returns text
  -parses doc and returns the cdata of the nth occurence of
the "XPath" listed. This does handle relative and absolute paths 
but nothing else at present. I have a few variants of this. 

So, given a  table docstore:

 Attribute |  Type   | Modifier 
-----------+---------+----------
 docid     | integer | 
 document  | text    | 

containing documents such as:

<?XML version="1.0"?>
<site provider="Foundations" sitecode="ak97" version="1">
   <name>Church Farm, Ashton Keynes</name>
   <invtype>watching brief</invtype>
   <location scheme="osgb">SU04209424</location>
</site>

I can type:
select docid, 
pgxml_xpath(document,'/site/name',1) as sitename,
pgxml_xpath(document,'/site/location',1) as location
 from docstore;
 
and I get:

 docid |          sitename           |  location  
-------+-----------------------------+------------
     1 | Church Farm, Ashton Keynes  | SU04209424
     2 | Glebe Farm, Long Itchington | SP41506500
(2 rows)

The next thing is to use the "function as tuple source" support which is
underway in order to allow the return of a list (in the DTD I'm using
-and doubtless many others- certain elements might be repeated, and
I think it would be good to be able to join against all the data from a 
particular element.

I hope this helps give a flavour. I'll try and tidy up the functions in the
next couple of days and then I can post what I've got so far. I'm keen to
build on this, as it's part of an (unfunded, unfortunately) project we're 
doing. Expat is MIT-licensed so I don't imagine there's a problem linking
it into PostgreSQL.

One aim is to allow people to set pg functions as the handlers "direct"
from the parser -the catch is that the expat API has lots of handlers
(OK, so most of them are less commonly used), so it's a matter of 
working out  a) an efficient API for setting handlers on a particular 
parser and b) how persistent a parser instance should be (each expat
instance can only do one document). Of course, expat may not be the 
best one to use -it would be great to be parser-agnostic and use SAX
with a java parser, but I don't think we have java as a language for 
user functions yet :-)
 
Incidentally, I'll be changing my email address over the next couple 
of daysto jgray@azuli.co.uk -just so you can follow this thread after
I've done that....

Regards

John
Azuli IT



===

To: pgsql-hackers@postgresql.org
From: jgray@beansindustry.co.uk
Subject: [HACKERS] Re: From TODO, XML?
Date: Fri, 27 Jul 2001 18:30:40 +0000

In article <3B615336.D654E7E1@mohawksoft.com>, markw@mohawksoft.com (mlw)
wrote:
> I was looking over the todo list and saw that someone wanted to support
> XML. I have some quick and dirty stuff that could be used.
> 

I'm not clear from the TODO what that "XML support" might involve. The
reference to pg_dump suggests an XML dump format for databases. That only
makes sense if we build an XML frontend that can load XML-based pg_dump
files.

I can't see any very useful application though, unless someone has a
standard for database dumps using XML -I'd have thought that our current
"list of SQL statements" dump is fine (and useful if you speak SQL)

> OK, what should the feature look like?
> 

What's the feature for? The things I've been working on are trying to make
an XML parser available in the backend, and to build some XML document
manipulation functions/operators. This is useful for what I'm doing (using
XML documents as short packets of human and machine-readable descriptive
data) and may be useful to other people. This work hasn't progressed very
far (I did only spend an afternoon or so writing it though....):
(available at http://www.cabbage.uklinux.net/pgxml.tar.gz)

One obvious (and current) topic is XQuery and we might ask whether PG
could/should implement it. I think some thinking would be needed on that
because  a) It involves having a second, non-SQL parser on the front-end
and that could be quite a large undertaking  and  b)  there's probably
(from my initial reading) some discrepancy  between the PG (and indeed
SQL) data model and the XQuery one. If  we could work round that, XQuery
*might* be an attraction to people. Certainly the ability to form one XML
document out of another via a query may be good for some projects.

Perhaps if people interested in XML "stuff" could add here, we might flesh
out a little more of what's desired.

> Should it be grafted onto pg_dump or should a new utility pg_xml be
> created?
> 
> How strict should it be? A stricter parser is easier to write, one can
> use a library, unfortunately most xml is crap and for the utility to be
> useful, it has to be real fuzzy.
> 

I don't think you really can write a non-strict XML parser. At least, not
if you want the resulting DOM to be useful - violations of well-formedness
probably result in logical difficulties wth the document structure. i.e. 

<a>
<b>text
<c>more text</c>
</a>

Is <c> within <b>? Are <b> and <c> siblings? These are answerable with
well-formed XML -And they're very relevant questions to ask for many XML
processing tasks. 

> Any input would be appreciated.
> 

Likewise -I'd be very insterested to know what sort of things people were
interested in -as I've found an area where I have a need which others
might share. I'd like to contribute some effort into it.

Regards

John


===
To: jgray@beansindustry.co.uk,
From: mlw <markw@mohawksoft.com>
Subject: [HACKERS] Re: From TODO, XML?
Date: Fri, 27 Jul 2001 20:11:35 -0400

jgray@beansindustry.co.uk wrote:
> 
> In article <3B615336.D654E7E1@mohawksoft.com>, markw@mohawksoft.com (mlw)
> wrote:
> > I was looking over the todo list and saw that someone wanted to support
> > XML. I have some quick and dirty stuff that could be used.
> >
> 
> I'm not clear from the TODO what that "XML support" might involve. The
> reference to pg_dump suggests an XML dump format for databases. That only
> makes sense if we build an XML frontend that can load XML-based pg_dump
> files.
> 
> I can't see any very useful application though, unless someone has a
> standard for database dumps using XML -I'd have thought that our current
> "list of SQL statements" dump is fine (and useful if you speak SQL)

Actually I have been thinking about a couple projects I have done. Vendors like
to think XML is a way to distribute databases.

So a parser that can scan a DTD and make a usable create table (...) line would
be very helpful. One which could compare a DTD to an existing SQL table and map
XML data correctly. (Or error if conversion from data to SQL types yields an
error.)

During a database export, a SQL table could be used to create a DTD.

I was thinking along the line of being able to use XML as a fairly portable
import/export feature. Having this ability, as a generic solution, would have
made several tasks MUCH easier.

I would also like the XML parser to be fuzzy enough to take some bad XML
(because ALL XML is bad), because a lot of vendors like to distribute data in
bad XML.


===

To: markw@mohawksoft.com (mlw)
From: fche@redhat.com (Frank Ch. Eigler)
Subject: [HACKERS] Re: From TODO, XML?
Date: 27 Jul 2001 22:24:31 -0400

markw wrote:

: [...]  Actually I have been thinking about a couple projects I have
: done. Vendors like to think XML is a way to distribute databases.

I would find it very helpful to see a table of what sorts of XML
functionality each major vendor supports.


: So a parser that can scan a DTD and make a usable create table (...) 
: line would be very helpful. [...]

Hmm, but hierarchically structured documents such as XML don't map
well to a relational model.  The former tend to be recursive (e.g.,
have more levels of containment than the one or two that might be
mappable to tables and columns.)


: During a database export, a SQL table could be used to create a DTD.
: [...]

This mapping (relational model -> XML) is more straightforward.


===

To: "Frank Ch. Eigler" <fche@redhat.com>
From: mlw <markw@mohawksoft.com>
Subject: [HACKERS] Re: From TODO, XML?
Date: Sat, 28 Jul 2001 03:29:12 -0400

"Frank Ch. Eigler" wrote:
> 
> markw wrote:
> 
> : [...]  Actually I have been thinking about a couple projects I have
> : done. Vendors like to think XML is a way to distribute databases.
> 
> I would find it very helpful to see a table of what sorts of XML
> functionality each major vendor supports.

Actually I was thinking of databases of data, not database systems.

> 
> : So a parser that can scan a DTD and make a usable create table (...)
> : line would be very helpful. [...]
> 
> Hmm, but hierarchically structured documents such as XML don't map
> well to a relational model.  The former tend to be recursive (e.g.,
> have more levels of containment than the one or two that might be
> mappable to tables and columns.)

Yes!!! Exactly, being able to understand the recursive nature of XML and create
relations on the fly would be a very cool feature.

> 
> : During a database export, a SQL table could be used to create a DTD.
> : [...]
> 
> This mapping (relational model -> XML) is more straightforward.

Totally.

===

To: "mlw" <markw@mohawksoft.com>, "Frank Ch. Eigler"
<fche@redhat.com>
From: "Ken Hirsch" <kenhirsch@myself.com>
Subject: Re: [HACKERS] Re: From TODO, XML?
Date: Sun, 29 Jul 2001 11:50:05 -0400

mlw <markw@mohawksoft.com> wrote:


> "Frank Ch. Eigler" wrote:
> > : So a parser that can scan a DTD and make a usable create table (...)
> > : line would be very helpful. [...]
> >
> > Hmm, but hierarchically structured documents such as XML don't map
> > well to a relational model.  The former tend to be recursive (e.g.,
> > have more levels of containment than the one or two that might be
> > mappable to tables and columns.)
>
> Yes!!! Exactly, being able to understand the recursive nature of XML and
create
> relations on the fly would be a very cool feature.

I think there is a pretty straight forward mapping, except for one possible
ambiguity.

If an element, say <address>, is contained within another element, say
<employee>, it could either be a column (or group of columns) in an Employee
table, or it could be a table Address which references Employee.

When you say "create relations on the fly", what exactly do you mean?  I can
see it would be handy to have CREATE TABLE statements written for you, but
it seems likely that a human would want to edit them before the tables are
actually created.  You cannot infer much type information from the DTD.  I
don't think there's a way to infer a primary key from a DTD, so you would
want to either specify one or add a serial column (or perhaps that would
always be done automatically).  An XML schema would have more information,
of course.



===

To: Ken Hirsch <kenhirsch@myself.com>
From: mlw <markw@mohawksoft.com>
Subject: Re: [HACKERS] Re: From TODO, XML?
Date: Sun, 29 Jul 2001 12:19:48 -0400

Ken Hirsch wrote:
> 
> mlw <markw@mohawksoft.com> wrote:
> 
> > "Frank Ch. Eigler" wrote:
> > > : So a parser that can scan a DTD and make a usable create table (...)
> > > : line would be very helpful. [...]
> > >
> > > Hmm, but hierarchically structured documents such as XML don't map
> > > well to a relational model.  The former tend to be recursive (e.g.,
> > > have more levels of containment than the one or two that might be
> > > mappable to tables and columns.)
> >
> > Yes!!! Exactly, being able to understand the recursive nature of XML and
> create
> > relations on the fly would be a very cool feature.
> 
> I think there is a pretty straight forward mapping, except for one possible
> ambiguity.
> 
> If an element, say <address>, is contained within another element, say
> <employee>, it could either be a column (or group of columns) in an Employee
> table, or it could be a table Address which references Employee.
> 
> When you say "create relations on the fly", what exactly do you mean?  I can
> see it would be handy to have CREATE TABLE statements written for you, but
> it seems likely that a human would want to edit them before the tables are
> actually created.  You cannot infer much type information from the DTD.  I
> don't think there's a way to infer a primary key from a DTD, so you would
> want to either specify one or add a serial column (or perhaps that would
> always be done automatically).  An XML schema would have more information,
> of course.

I have been thinking about this. A lot of guessing would have to be done, of
course. But, unless some extra information is specified, when you have an XML
record, contained within another, the parser would have to generate its own
primary key and a sequence for each table. Obviously, the user should be able
to specify the primary key for each table, but lacking that input, the XML
parser/importer should do it automatically.


So this:

<employee>
<name>Bill</name>
<position>Programmer</position>
<address>
	<number>1290</number>
	<street>
		<name>Canton Ave</name>
	</street>
	
	<town>
		<name>Milton</name>
	</town>
</address>
</emplyee>

The above is almost impossible to convert to a relational format without
additional information or a good set of rules. However, we can determine which
XML titles are "containers" and which are "data." "employee" is a container
because it has sub tags. "position" is "data" because it has no sub tags.

We can recursively scan this hierarchy, decide which are containers and which
are data. Data gets assigned an appropriate SQL type and containers get
separated from the parent container, and an integer index is put in its place.
For each container, either a primary key is specified or created on the fly. 

We insert sub containers first and pop back the primary key value, until we
have the whole record. The primary key could even be the OID.

A second strategy is to concatenate the hierarchy into the field name, as
street_name, town_name, and so on.


What do you think?

===

To: mlw <markw@mohawksoft.com>
From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [HACKERS] Re: From TODO, XML?
Date: Mon, 30 Jul 2001 00:00:36 -0400 (EDT)

> I have been fighting, for a while now, with idiot data vendors that think XML
> is a cure all. The problem is that XML is a hierarchical format where as SQL is
> a relational format.
> 
> It would be good to get pg_dump to write an XML file and DTD, but getting
> external sources of XML into PostgreSQL is WAY more complicated. If an XML
> import is to be useful beyond just a different format for pg_dump, there has to
> be some intelligent database construction based on the XML information.
> 
> Go to mp3.com, and download some of their XML format data, first, it is bad
> XML, second, it is hierarchical.
> 
> I have managed to get several XML files into PostgreSQL by writing a parser,
> and it is a huge hassle, the public parsers are too picky. I am thinking that a
> fuzzy parser, combined with some intelligence and an XML DTD reader, could make
> a very cool utility, one which I have not been able to find.
> 
> Perhaps it is a two stage process? First pass creates a schema which can be
> modified/corrected, the second pass loads the data.

Can we accept only relational XML.  Does that buy us anything?  Are the
other database vendors outputting heirchical XML?  Are they using
foreign/primary keys to do it?
===

To: Bruce Momjian <pgman@candle.pha.pa.us>
From: mlw <markw@mohawksoft.com>
Subject: Re: [HACKERS] Re: From TODO, XML?
Date: Mon, 30 Jul 2001 00:01:56 -0400

Bruce Momjian wrote:
> 
> > > I would find it very helpful to see a table of what sorts of XML
> > > functionality each major vendor supports.
> >
> > Actually I was thinking of databases of data, not database systems.
> 
> I think we can go two ways.  Allow COPY/pg_dump to read/write XML, or
> write some perl scripts to convert XML to/from our pg_dump format.  The
> latter seems quite easy and fast.

I have been fighting, for a while now, with idiot data vendors that think XML
is a cure all. The problem is that XML is a hierarchical format where as SQL is
a relational format.

It would be good to get pg_dump to write an XML file and DTD, but getting
external sources of XML into PostgreSQL is WAY more complicated. If an XML
import is to be useful beyond just a different format for pg_dump, there has to
be some intelligent database construction based on the XML information.

Go to mp3.com, and download some of their XML format data, first, it is bad
XML, second, it is hierarchical.

I have managed to get several XML files into PostgreSQL by writing a parser,
and it is a huge hassle, the public parsers are too picky. I am thinking that a
fuzzy parser, combined with some intelligence and an XML DTD reader, could make
a very cool utility, one which I have not been able to find.

Perhaps it is a two stage process? First pass creates a schema which can be
modified/corrected, the second pass loads the data.

===

To: mlw <markw@mohawksoft.com>
From: Gavin Sherry <swm@linuxworld.com.au>
Subject: Re: [HACKERS] Re: From TODO, XML?
Date: Mon, 30 Jul 2001 15:43:26 +1000 (EST)

On Mon, 30 Jul 2001, mlw wrote:

> Bruce Momjian wrote:
> > 
> > > > I would find it very helpful to see a table of what sorts of XML
> > > > functionality each major vendor supports.
> > >
> > > Actually I was thinking of databases of data, not database systems.
> > 
> > I think we can go two ways.  Allow COPY/pg_dump to read/write XML, or
> > write some perl scripts to convert XML to/from our pg_dump format.  The
> > latter seems quite easy and fast.
> 

> I have managed to get several XML files into PostgreSQL by writing a parser,
> and it is a huge hassle, the public parsers are too picky. I am thinking that a
> fuzzy parser, combined with some intelligence and an XML DTD reader, could make
> a very cool utility, one which I have not been able to find.

I have had the same problem. The best XML parser I could find was the
gnome-xml library at xmlsoft.org (libxml). I am currently using this in C
to replicate a client's legacy Notes system on to Postgres. In this case I
was lucky in as much as I had some input on the XML namespace etc. XML was
used because they had already designed an XML based dump utility.

However, the way XML is being used is very basic. Only creation of tables,
insert and delete are handled. Libxml works fine with this however,
handling DTD/XML parsing, UTF-8, UTF-16 and iso-8859-1, validation
etc.

The main problem then is that every vendor has a different xml name
space. If people really want to pursue this, the best thing to do would be
to try to work with other open source database developers and design a
suitable XML namespace for open source databases. Naturally, there will be
much contention here about he most suitable this and that. It will be
difficult to get a real spec going and will probably be much more
trouble than it is worth. As such, if this fails, then we cannot expect
Oracle, IBM, Sybase, MS and the rest to ever do it.

Perhaps then it would be sufficient for pg_dump/restore to identify the
name space of a given database dump and parse it according to that name
space. Based on command-line arguments, pg_restore/dump could either
die/ignore/transmogrify instructions in the XML which PG does not support 
or recognise. It would also be useful if pg_dump could dump data from
postgres in the supported XML namespaces.

So it essentially comes down to how useful it will be and who has time to
code it up =) (as always).

**Creative Solution**

For those who have too much time on their hands and have managed to
untangle some of the syntax in the W3C XSLT 1.0 specification, how about
an XSL stylesheet to transform an XML based database dump from some third
party into (postgres) SQL. Erk! There would have to be an award for such a
thing ;-).


===

To: Bruce Momjian <pgman@candle.pha.pa.us>
From: mlw <markw@mohawksoft.com>
Subject: [HACKERS] Re: From TODO, XML?
Date: Mon, 30 Jul 2001 04:47:19 -0400

Bruce Momjian wrote:
> 
> > I have been fighting, for a while now, with idiot data vendors that think XML
> > is a cure all. The problem is that XML is a hierarchical format where as SQL is
> > a relational format.
> >
> > It would be good to get pg_dump to write an XML file and DTD, but getting
> > external sources of XML into PostgreSQL is WAY more complicated. If an XML
> > import is to be useful beyond just a different format for pg_dump, there has to
> > be some intelligent database construction based on the XML information.
> >
> > Go to mp3.com, and download some of their XML format data, first, it is bad
> > XML, second, it is hierarchical.
> >
> > I have managed to get several XML files into PostgreSQL by writing a parser,
> > and it is a huge hassle, the public parsers are too picky. I am thinking that a
> > fuzzy parser, combined with some intelligence and an XML DTD reader, could make
> > a very cool utility, one which I have not been able to find.
> >
> > Perhaps it is a two stage process? First pass creates a schema which can be
> > modified/corrected, the second pass loads the data.
> 
> Can we accept only relational XML.  Does that buy us anything?  Are the
> other database vendors outputting heirchical XML?  Are they using
> foreign/primary keys to do it?

Then what's the point? Almost no one creates a non-hierarchical XML. For the
utility to be usefull, beyond just a different format for pg_dump, it has to
deal with these issues and do the right thing.

===

To: Gavin Sherry <swm@linuxworld.com.au>
From: Karel Zak <zakkr@zf.jcu.cz>
Subject: Re: [HACKERS] Re: From TODO, XML?
Date: Mon, 30 Jul 2001 10:38:54 +0200

On Mon, Jul 30, 2001 at 03:43:26PM +1000, Gavin Sherry wrote:
> On Mon, 30 Jul 2001, mlw wrote:
> 
> I have had the same problem. The best XML parser I could find was the
> gnome-xml library at xmlsoft.org (libxml). I am currently using this in C

 What happen if you use DOM type of XML parser for large file? A dump from
SQL DB can be realy large. IMHO is for this (data dump from SQL DB) is
better SAX type of XML parser.

> an XSL stylesheet to transform an XML based database dump from some third

 Yes, it's right way how use XML.

			Karel

===
the rest of The Pile (a partial mailing list archive)
doom@kzsu.stanford.edu