This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.
To: pgsql-hackers@postgresql.org From: "John Gray" <jgray@beansindustry.co.uk> Subject: [HACKERS] Re: Storing XML in PostgreSQL Date: Wed, 25 Jul 2001 00:15:21 +0000 In article <4.2.2.20010724150449.00a9ea90@192.168.0.67>, jm.poure@freesurf.fr (Jean-Michel POURE) wrote: > Hello friends, > > What is the best way to parse and store an XML document in PostgreSQL? I > would like to store fwbuilder (http://www.fwbuilder.org) objects in > PostgreSQL. > I think the best way depends on what you're trying to achieve with the document once you have it in the database. One approach is to have tables for elements, attributes and cdata and use an XML parser to insert appropriate database records. I have used a schema such as the following- in the cdata table "element" is the ID of the containing element, and itempos is just an integer used to order the entries. I used this with a bit of java which hooks up to the Lark parser (using SAX) to do the parsing and fires off INSERT queries through the jdbc driver. CREATE SEQUENCE cdata_seq; CREATE SEQUENCE attribute_seq; CREATE SEQUENCE element_seq; CREATE TABLE element ( document integer, element integer not null PRIMARY KEY default nextval('element_seq'), name text, parent integer, itempos integer ); CREATE TABLE attribute ( document integer, attribute integer not null default nextval('attribute_seq'), name text, value text, element integer, itempos integer ); CREATE TABLE cdata ( document integer, cdata integer not null default nextval('cdata_seq'), value text, element integer, itempos integer ); In my example, I was interested in selecting all the cdata content of a <type> tag immediately contained within a <feature> tag path. The easiest solution is to create a view, which can then be queried to find all cases where, for example, feature type = 'Ditch'. CREATE VIEW featuretype AS featuretype SELECT c.document,c.value FROM cdata c, element e, element e1 WHERE c.element = e.element AND e.parent = e1.element AND e.name = 'type' AND e1.name = 'feature' AND c.document = e.document AND e.document = e1.document; if you are interested I can provide the very basic (java) code I used for this. OR, depending on what these fwbuilder objects involve, you can of course just store XML documents in fields of type text (especially if you use 7.1 which has TOAST, so you can store long documents). IT's not difficult to hook up a parser (I'm using expat) to a PostgreSQL function written in C and parse on the fly. I haven't really finished that code, but after I've commented it, I can certainly post it if anyone is interested. It does work, but probably needs some tidying. It really wasn't difficult to write the functions though. In fact, I've been surprised by how easy it is to write PostgreSQL C functions... Please contact me if you have any questions -I've been away for a bit so haven't worked on that code for a couple of weeks -I'm hoping to get back into it soon. Regards John === To: pgsql-hackers@postgresql.org From: "John Gray" <jgray@beansindustry.co.uk> Subject: [HACKERS] Re: Re: Storing XML in PostgreSQL Date: Wed, 25 Jul 2001 18:12:28 +0000 In article <m2elr5b2cx.fsf@smaug.polygnosis.com>, gunnar@polygnosis.com (Gunnar =?iso-8859-1?q?R=F8nning?=) wrote: > Do you have any documentation on your C functions ? I'm just interested > in knowing what functions they provide. > There are only two (so far). They're very basic. I have: pgxml_parse(text) returns bool -parses the provided text and returns true or false if it is well-formed or not. pgxml_xpath(text doc, text xpath, int n) returns text -parses doc and returns the cdata of the nth occurence of the "XPath" listed. This does handle relative and absolute paths but nothing else at present. I have a few variants of this. So, given a table docstore: Attribute | Type | Modifier -----------+---------+---------- docid | integer | document | text | containing documents such as: <?XML version="1.0"?> <site provider="Foundations" sitecode="ak97" version="1"> <name>Church Farm, Ashton Keynes</name> <invtype>watching brief</invtype> <location scheme="osgb">SU04209424</location> </site> I can type: select docid, pgxml_xpath(document,'/site/name',1) as sitename, pgxml_xpath(document,'/site/location',1) as location from docstore; and I get: docid | sitename | location -------+-----------------------------+------------ 1 | Church Farm, Ashton Keynes | SU04209424 2 | Glebe Farm, Long Itchington | SP41506500 (2 rows) The next thing is to use the "function as tuple source" support which is underway in order to allow the return of a list (in the DTD I'm using -and doubtless many others- certain elements might be repeated, and I think it would be good to be able to join against all the data from a particular element. I hope this helps give a flavour. I'll try and tidy up the functions in the next couple of days and then I can post what I've got so far. I'm keen to build on this, as it's part of an (unfunded, unfortunately) project we're doing. Expat is MIT-licensed so I don't imagine there's a problem linking it into PostgreSQL. One aim is to allow people to set pg functions as the handlers "direct" from the parser -the catch is that the expat API has lots of handlers (OK, so most of them are less commonly used), so it's a matter of working out a) an efficient API for setting handlers on a particular parser and b) how persistent a parser instance should be (each expat instance can only do one document). Of course, expat may not be the best one to use -it would be great to be parser-agnostic and use SAX with a java parser, but I don't think we have java as a language for user functions yet :-) Incidentally, I'll be changing my email address over the next couple of daysto jgray@azuli.co.uk -just so you can follow this thread after I've done that.... Regards John Azuli IT === To: pgsql-hackers@postgresql.org From: jgray@beansindustry.co.uk Subject: [HACKERS] Re: From TODO, XML? Date: Fri, 27 Jul 2001 18:30:40 +0000 In article <3B615336.D654E7E1@mohawksoft.com>, markw@mohawksoft.com (mlw) wrote: > I was looking over the todo list and saw that someone wanted to support > XML. I have some quick and dirty stuff that could be used. > I'm not clear from the TODO what that "XML support" might involve. The reference to pg_dump suggests an XML dump format for databases. That only makes sense if we build an XML frontend that can load XML-based pg_dump files. I can't see any very useful application though, unless someone has a standard for database dumps using XML -I'd have thought that our current "list of SQL statements" dump is fine (and useful if you speak SQL) > OK, what should the feature look like? > What's the feature for? The things I've been working on are trying to make an XML parser available in the backend, and to build some XML document manipulation functions/operators. This is useful for what I'm doing (using XML documents as short packets of human and machine-readable descriptive data) and may be useful to other people. This work hasn't progressed very far (I did only spend an afternoon or so writing it though....): (available at http://www.cabbage.uklinux.net/pgxml.tar.gz) One obvious (and current) topic is XQuery and we might ask whether PG could/should implement it. I think some thinking would be needed on that because a) It involves having a second, non-SQL parser on the front-end and that could be quite a large undertaking and b) there's probably (from my initial reading) some discrepancy between the PG (and indeed SQL) data model and the XQuery one. If we could work round that, XQuery *might* be an attraction to people. Certainly the ability to form one XML document out of another via a query may be good for some projects. Perhaps if people interested in XML "stuff" could add here, we might flesh out a little more of what's desired. > Should it be grafted onto pg_dump or should a new utility pg_xml be > created? > > How strict should it be? A stricter parser is easier to write, one can > use a library, unfortunately most xml is crap and for the utility to be > useful, it has to be real fuzzy. > I don't think you really can write a non-strict XML parser. At least, not if you want the resulting DOM to be useful - violations of well-formedness probably result in logical difficulties wth the document structure. i.e. <a> <b>text <c>more text</c> </a> Is <c> within <b>? Are <b> and <c> siblings? These are answerable with well-formed XML -And they're very relevant questions to ask for many XML processing tasks. > Any input would be appreciated. > Likewise -I'd be very insterested to know what sort of things people were interested in -as I've found an area where I have a need which others might share. I'd like to contribute some effort into it. Regards John === To: jgray@beansindustry.co.uk, From: mlw <markw@mohawksoft.com> Subject: [HACKERS] Re: From TODO, XML? Date: Fri, 27 Jul 2001 20:11:35 -0400 jgray@beansindustry.co.uk wrote: > > In article <3B615336.D654E7E1@mohawksoft.com>, markw@mohawksoft.com (mlw) > wrote: > > I was looking over the todo list and saw that someone wanted to support > > XML. I have some quick and dirty stuff that could be used. > > > > I'm not clear from the TODO what that "XML support" might involve. The > reference to pg_dump suggests an XML dump format for databases. That only > makes sense if we build an XML frontend that can load XML-based pg_dump > files. > > I can't see any very useful application though, unless someone has a > standard for database dumps using XML -I'd have thought that our current > "list of SQL statements" dump is fine (and useful if you speak SQL) Actually I have been thinking about a couple projects I have done. Vendors like to think XML is a way to distribute databases. So a parser that can scan a DTD and make a usable create table (...) line would be very helpful. One which could compare a DTD to an existing SQL table and map XML data correctly. (Or error if conversion from data to SQL types yields an error.) During a database export, a SQL table could be used to create a DTD. I was thinking along the line of being able to use XML as a fairly portable import/export feature. Having this ability, as a generic solution, would have made several tasks MUCH easier. I would also like the XML parser to be fuzzy enough to take some bad XML (because ALL XML is bad), because a lot of vendors like to distribute data in bad XML. === To: markw@mohawksoft.com (mlw) From: fche@redhat.com (Frank Ch. Eigler) Subject: [HACKERS] Re: From TODO, XML? Date: 27 Jul 2001 22:24:31 -0400 markw wrote: : [...] Actually I have been thinking about a couple projects I have : done. Vendors like to think XML is a way to distribute databases. I would find it very helpful to see a table of what sorts of XML functionality each major vendor supports. : So a parser that can scan a DTD and make a usable create table (...) : line would be very helpful. [...] Hmm, but hierarchically structured documents such as XML don't map well to a relational model. The former tend to be recursive (e.g., have more levels of containment than the one or two that might be mappable to tables and columns.) : During a database export, a SQL table could be used to create a DTD. : [...] This mapping (relational model -> XML) is more straightforward. === To: "Frank Ch. Eigler" <fche@redhat.com> From: mlw <markw@mohawksoft.com> Subject: [HACKERS] Re: From TODO, XML? Date: Sat, 28 Jul 2001 03:29:12 -0400 "Frank Ch. Eigler" wrote: > > markw wrote: > > : [...] Actually I have been thinking about a couple projects I have > : done. Vendors like to think XML is a way to distribute databases. > > I would find it very helpful to see a table of what sorts of XML > functionality each major vendor supports. Actually I was thinking of databases of data, not database systems. > > : So a parser that can scan a DTD and make a usable create table (...) > : line would be very helpful. [...] > > Hmm, but hierarchically structured documents such as XML don't map > well to a relational model. The former tend to be recursive (e.g., > have more levels of containment than the one or two that might be > mappable to tables and columns.) Yes!!! Exactly, being able to understand the recursive nature of XML and create relations on the fly would be a very cool feature. > > : During a database export, a SQL table could be used to create a DTD. > : [...] > > This mapping (relational model -> XML) is more straightforward. Totally. === To: "mlw" <markw@mohawksoft.com>, "Frank Ch. Eigler" <fche@redhat.com> From: "Ken Hirsch" <kenhirsch@myself.com> Subject: Re: [HACKERS] Re: From TODO, XML? Date: Sun, 29 Jul 2001 11:50:05 -0400 mlw <markw@mohawksoft.com> wrote: > "Frank Ch. Eigler" wrote: > > : So a parser that can scan a DTD and make a usable create table (...) > > : line would be very helpful. [...] > > > > Hmm, but hierarchically structured documents such as XML don't map > > well to a relational model. The former tend to be recursive (e.g., > > have more levels of containment than the one or two that might be > > mappable to tables and columns.) > > Yes!!! Exactly, being able to understand the recursive nature of XML and create > relations on the fly would be a very cool feature. I think there is a pretty straight forward mapping, except for one possible ambiguity. If an element, say <address>, is contained within another element, say <employee>, it could either be a column (or group of columns) in an Employee table, or it could be a table Address which references Employee. When you say "create relations on the fly", what exactly do you mean? I can see it would be handy to have CREATE TABLE statements written for you, but it seems likely that a human would want to edit them before the tables are actually created. You cannot infer much type information from the DTD. I don't think there's a way to infer a primary key from a DTD, so you would want to either specify one or add a serial column (or perhaps that would always be done automatically). An XML schema would have more information, of course. === To: Ken Hirsch <kenhirsch@myself.com> From: mlw <markw@mohawksoft.com> Subject: Re: [HACKERS] Re: From TODO, XML? Date: Sun, 29 Jul 2001 12:19:48 -0400 Ken Hirsch wrote: > > mlw <markw@mohawksoft.com> wrote: > > > "Frank Ch. Eigler" wrote: > > > : So a parser that can scan a DTD and make a usable create table (...) > > > : line would be very helpful. [...] > > > > > > Hmm, but hierarchically structured documents such as XML don't map > > > well to a relational model. The former tend to be recursive (e.g., > > > have more levels of containment than the one or two that might be > > > mappable to tables and columns.) > > > > Yes!!! Exactly, being able to understand the recursive nature of XML and > create > > relations on the fly would be a very cool feature. > > I think there is a pretty straight forward mapping, except for one possible > ambiguity. > > If an element, say <address>, is contained within another element, say > <employee>, it could either be a column (or group of columns) in an Employee > table, or it could be a table Address which references Employee. > > When you say "create relations on the fly", what exactly do you mean? I can > see it would be handy to have CREATE TABLE statements written for you, but > it seems likely that a human would want to edit them before the tables are > actually created. You cannot infer much type information from the DTD. I > don't think there's a way to infer a primary key from a DTD, so you would > want to either specify one or add a serial column (or perhaps that would > always be done automatically). An XML schema would have more information, > of course. I have been thinking about this. A lot of guessing would have to be done, of course. But, unless some extra information is specified, when you have an XML record, contained within another, the parser would have to generate its own primary key and a sequence for each table. Obviously, the user should be able to specify the primary key for each table, but lacking that input, the XML parser/importer should do it automatically. So this: <employee> <name>Bill</name> <position>Programmer</position> <address> <number>1290</number> <street> <name>Canton Ave</name> </street> <town> <name>Milton</name> </town> </address> </emplyee> The above is almost impossible to convert to a relational format without additional information or a good set of rules. However, we can determine which XML titles are "containers" and which are "data." "employee" is a container because it has sub tags. "position" is "data" because it has no sub tags. We can recursively scan this hierarchy, decide which are containers and which are data. Data gets assigned an appropriate SQL type and containers get separated from the parent container, and an integer index is put in its place. For each container, either a primary key is specified or created on the fly. We insert sub containers first and pop back the primary key value, until we have the whole record. The primary key could even be the OID. A second strategy is to concatenate the hierarchy into the field name, as street_name, town_name, and so on. What do you think? === To: mlw <markw@mohawksoft.com> From: Bruce Momjian <pgman@candle.pha.pa.us> Subject: Re: [HACKERS] Re: From TODO, XML? Date: Mon, 30 Jul 2001 00:00:36 -0400 (EDT) > I have been fighting, for a while now, with idiot data vendors that think XML > is a cure all. The problem is that XML is a hierarchical format where as SQL is > a relational format. > > It would be good to get pg_dump to write an XML file and DTD, but getting > external sources of XML into PostgreSQL is WAY more complicated. If an XML > import is to be useful beyond just a different format for pg_dump, there has to > be some intelligent database construction based on the XML information. > > Go to mp3.com, and download some of their XML format data, first, it is bad > XML, second, it is hierarchical. > > I have managed to get several XML files into PostgreSQL by writing a parser, > and it is a huge hassle, the public parsers are too picky. I am thinking that a > fuzzy parser, combined with some intelligence and an XML DTD reader, could make > a very cool utility, one which I have not been able to find. > > Perhaps it is a two stage process? First pass creates a schema which can be > modified/corrected, the second pass loads the data. Can we accept only relational XML. Does that buy us anything? Are the other database vendors outputting heirchical XML? Are they using foreign/primary keys to do it? === To: Bruce Momjian <pgman@candle.pha.pa.us> From: mlw <markw@mohawksoft.com> Subject: Re: [HACKERS] Re: From TODO, XML? Date: Mon, 30 Jul 2001 00:01:56 -0400 Bruce Momjian wrote: > > > > I would find it very helpful to see a table of what sorts of XML > > > functionality each major vendor supports. > > > > Actually I was thinking of databases of data, not database systems. > > I think we can go two ways. Allow COPY/pg_dump to read/write XML, or > write some perl scripts to convert XML to/from our pg_dump format. The > latter seems quite easy and fast. I have been fighting, for a while now, with idiot data vendors that think XML is a cure all. The problem is that XML is a hierarchical format where as SQL is a relational format. It would be good to get pg_dump to write an XML file and DTD, but getting external sources of XML into PostgreSQL is WAY more complicated. If an XML import is to be useful beyond just a different format for pg_dump, there has to be some intelligent database construction based on the XML information. Go to mp3.com, and download some of their XML format data, first, it is bad XML, second, it is hierarchical. I have managed to get several XML files into PostgreSQL by writing a parser, and it is a huge hassle, the public parsers are too picky. I am thinking that a fuzzy parser, combined with some intelligence and an XML DTD reader, could make a very cool utility, one which I have not been able to find. Perhaps it is a two stage process? First pass creates a schema which can be modified/corrected, the second pass loads the data. === To: mlw <markw@mohawksoft.com> From: Gavin Sherry <swm@linuxworld.com.au> Subject: Re: [HACKERS] Re: From TODO, XML? Date: Mon, 30 Jul 2001 15:43:26 +1000 (EST) On Mon, 30 Jul 2001, mlw wrote: > Bruce Momjian wrote: > > > > > > I would find it very helpful to see a table of what sorts of XML > > > > functionality each major vendor supports. > > > > > > Actually I was thinking of databases of data, not database systems. > > > > I think we can go two ways. Allow COPY/pg_dump to read/write XML, or > > write some perl scripts to convert XML to/from our pg_dump format. The > > latter seems quite easy and fast. > > I have managed to get several XML files into PostgreSQL by writing a parser, > and it is a huge hassle, the public parsers are too picky. I am thinking that a > fuzzy parser, combined with some intelligence and an XML DTD reader, could make > a very cool utility, one which I have not been able to find. I have had the same problem. The best XML parser I could find was the gnome-xml library at xmlsoft.org (libxml). I am currently using this in C to replicate a client's legacy Notes system on to Postgres. In this case I was lucky in as much as I had some input on the XML namespace etc. XML was used because they had already designed an XML based dump utility. However, the way XML is being used is very basic. Only creation of tables, insert and delete are handled. Libxml works fine with this however, handling DTD/XML parsing, UTF-8, UTF-16 and iso-8859-1, validation etc. The main problem then is that every vendor has a different xml name space. If people really want to pursue this, the best thing to do would be to try to work with other open source database developers and design a suitable XML namespace for open source databases. Naturally, there will be much contention here about he most suitable this and that. It will be difficult to get a real spec going and will probably be much more trouble than it is worth. As such, if this fails, then we cannot expect Oracle, IBM, Sybase, MS and the rest to ever do it. Perhaps then it would be sufficient for pg_dump/restore to identify the name space of a given database dump and parse it according to that name space. Based on command-line arguments, pg_restore/dump could either die/ignore/transmogrify instructions in the XML which PG does not support or recognise. It would also be useful if pg_dump could dump data from postgres in the supported XML namespaces. So it essentially comes down to how useful it will be and who has time to code it up =) (as always). **Creative Solution** For those who have too much time on their hands and have managed to untangle some of the syntax in the W3C XSLT 1.0 specification, how about an XSL stylesheet to transform an XML based database dump from some third party into (postgres) SQL. Erk! There would have to be an award for such a thing ;-). === To: Bruce Momjian <pgman@candle.pha.pa.us> From: mlw <markw@mohawksoft.com> Subject: [HACKERS] Re: From TODO, XML? Date: Mon, 30 Jul 2001 04:47:19 -0400 Bruce Momjian wrote: > > > I have been fighting, for a while now, with idiot data vendors that think XML > > is a cure all. The problem is that XML is a hierarchical format where as SQL is > > a relational format. > > > > It would be good to get pg_dump to write an XML file and DTD, but getting > > external sources of XML into PostgreSQL is WAY more complicated. If an XML > > import is to be useful beyond just a different format for pg_dump, there has to > > be some intelligent database construction based on the XML information. > > > > Go to mp3.com, and download some of their XML format data, first, it is bad > > XML, second, it is hierarchical. > > > > I have managed to get several XML files into PostgreSQL by writing a parser, > > and it is a huge hassle, the public parsers are too picky. I am thinking that a > > fuzzy parser, combined with some intelligence and an XML DTD reader, could make > > a very cool utility, one which I have not been able to find. > > > > Perhaps it is a two stage process? First pass creates a schema which can be > > modified/corrected, the second pass loads the data. > > Can we accept only relational XML. Does that buy us anything? Are the > other database vendors outputting heirchical XML? Are they using > foreign/primary keys to do it? Then what's the point? Almost no one creates a non-hierarchical XML. For the utility to be usefull, beyond just a different format for pg_dump, it has to deal with these issues and do the right thing. === To: Gavin Sherry <swm@linuxworld.com.au> From: Karel Zak <zakkr@zf.jcu.cz> Subject: Re: [HACKERS] Re: From TODO, XML? Date: Mon, 30 Jul 2001 10:38:54 +0200 On Mon, Jul 30, 2001 at 03:43:26PM +1000, Gavin Sherry wrote: > On Mon, 30 Jul 2001, mlw wrote: > > I have had the same problem. The best XML parser I could find was the > gnome-xml library at xmlsoft.org (libxml). I am currently using this in C What happen if you use DOM type of XML parser for large file? A dump from SQL DB can be realy large. IMHO is for this (data dump from SQL DB) is better SAX type of XML parser. > an XSL stylesheet to transform an XML based database dump from some third Yes, it's right way how use XML. Karel ===