comp.lang.perl.modules-rc_file_formats_xml_vs_human_readability

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.



From: JS Bangs <jaspax@u.washington.edu>
Subject: XML or home-grown format?
Newsgroups: comp.lang.perl.misc,comp.lang.perl.modules
Date: Thu, 24 Jul 2003 15:00:22 -0700
Organization: University of Washington

In the continuing development of Lingua::Phonology, I'm starting to
consider what the benefits would be of moving my file-parsing formats to
XML from the current custom format.

Currently, two of the sub-modules do some form of file-parsing, and the
formats they use are described at:

http://search.cpan.org/author/JASPAX/Lingua-Phonology-0.25/Phonology/Features.pm#loadfile
http://search.cpan.org/author/JASPAX/Lingua-Phonology-0.25/Phonology/Symbols.pm#loadfile

The existing formats are concise and human-readable, but completely
custom. As I'm thinking of adding file-parsing to Lingua::Phonology::Rules
(and perhaps other modules), I was looking for something more reusable,
general, and powerful (especially since the Rules submodule will require
some fairly complex parsing rules). If I use XML, I can pass parsing
duties off to XML::Whatever, but I'm concerned that the costs (in terms of
verbosity) will outweight the benefits of portability and extensibility.

For example, I can currently write the following line in a file to be
parsed by Lingua::Phonology::Symbols:

d    +anterior -distributed voice

In XML, this might have to be as verbose as:

<symbol label="d`">
    <feature name="anterior" value="+" \>
    <feature name="distributed" value="-" \>
    <feature name="voice" \>
</symbol>

Which is significantly heavier and less clear. I'm rather torn on this, so
I was wondering what insight the minds here have to offer. Many thanks--

===
From: Rich <scriptyrich@yahoo.co.uk>
Subject: Re: XML or home-grown format?
Newsgroups: comp.lang.perl.misc,comp.lang.perl.modules
Followup-To: comp.lang.perl.misc
Date: Thu, 24 Jul 2003 23:41:01 +0000
Reply-To: scriptyrich@yahoo.co.uk

JS Bangs wrote:

snip

> In XML, this might have to be as verbose as:
> 
> <symbol label="d`">
>     <feature name="anterior" value="+" \>
>     <feature name="distributed" value="-" \>
>     <feature name="voice" \>
> </symbol>
> 
> Which is significantly heavier and less clear. I'm rather torn on this, so
> I was wondering what insight the minds here have to offer. Many thanks--

I'd consider YAML whenever you need XML like structures that poor old humans
might have to read/edit.

The slight downer is that YAML seems to be developing at a pace similar to
p6, though in both cases it'll be worth the wait.

===
From: usenet@megazone.org (MegaZone)
Subject: Re: XML or home-grown format?
Newsgroups: comp.lang.perl.misc,comp.lang.perl.modules
Date: 24 Jul 2003 23:51:03 GMT
Organization: WPI Discordian Society, Undocumented Cabal of the Accursed Saint Shiranto Joe

JS Bangs <jaspax@u.washington.edu> shaped the electrons to say:
>d    +anterior -distributed voice
>
>In XML, this might have to be as verbose as:
>
><symbol label="d`">
>    <feature name="anterior" value="+" \>
>    <feature name="distributed" value="-" \>
>    <feature name="voice" \>
></symbol>

<symbol label="d" anterior="+" distributed="-" voice="+" />

Something like that is just as valid in XML, and XML::LibXML works
well, I've been using it for a few months now and I'm really starting
to like it now that it has sunk into my brain so I don't have to keep
looking things up. :-)

Since XML requires attributes to have values I just used "+" for
voice, but you could do things like voice="voice", etc.

It really depends on what you're looking to use the data for - I just
created a file like:

<?xml version="1.0" encoding="UTF-8"?>
<CurrencyTable xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="pcCurrencyTable.xsd">
        <Version>1.0</Version>
        <CurrencyShift number="840" name="USD">2</CurrencyShift>
</CurrencyTable>

(there were a lot more CurrencyShift elements...)

Using XPath you can find the element based on one attribute and get
the value of another - as in this:
---
use strict;
use warnings;
use XML::LibXML 1.0053;

my $xmlFile;
my $parser = XML::LibXML->new();

open (XMLCONF, "<./pcCurrencyTable.xml") || 
    die "Can't open table: $!";
while (<XMLCONF>) {
    $xmlFile .= $_;
}
close (XMLCONF);

my $dom = $parser->parse_string($xmlFile);
$xpath = "//CurrencyTable/CurrencyShift[\@number='840']/\@name";
print( ($dom->findnodes($xpath))[0]->textContent() . "\n");

----

That prints "USD".

(And I make no claim that is the the most elegant way to do that, just
what came to me first.)

===
From: Bren <iambrenNOSPAM@sympatico.ca>
Subject: Re: XML or home-grown format?
Newsgroups: comp.lang.perl.misc,comp.lang.perl.modules,comp.text.xml
Date: Fri, 25 Jul 2003 18:41:55 -0400
Organization: Newsfeeds.com http://www.newsfeeds.com 100,000+ UNCENSORED Newsgroups.

On Fri, 25 Jul 2003 14:16:04 -0700, JS Bangs <jaspax@u.washington.edu>
wrote:

>> print( ($dom->findnodes($xpath))[0]->textContent() . "\n");
>>
>> ----
>>
>> That prints "USD".

Actually, that would print "USD
"

;-)


===

Subject: Re: XML or home-grown format?
Newsgroups: comp.lang.perl.misc,comp.lang.perl.modules,comp.text.xml
Date: 25 Jul 2003 23:48:48 GMT
Organization: WPI Discordian Society, Undocumented Cabal of the Accursed Saint Shiranto Joe

Bren <iambrenNOSPAM@sympatico.ca> shaped the electrons to say:
>Actually, that would print "USD
>"

Yes.  Point. :-)

I actually changed my mind when doing my production code which I'd
written the test file as prep for and used the ->getAttribute() method
since I could first do -hasAttribute() in an if clause and else set it
to some default value, etc.

More than one way. ;-)



===

From: JS Bangs <jaspax@u.washington.edu>
Subject: Re: XML or home-grown format?
Newsgroups: comp.lang.perl.misc,comp.lang.perl.modules,comp.text.xml
Date: Fri, 25 Jul 2003 14:16:04 -0700
Organization: University of Washington

I've added comp.text.xml to the cross-posting for this, since it's
probably more concerned with XML than anything else at this point. So far,
we've been discussing whether it's worth the trouble to move a custom file
format for the perl module Lingua::Phonology over to XML. I pointed out an
original example line like:

> >d    +anterior -distributed voice

Which would have to become:

> >In XML, this might have to be as verbose as:
> >
> ><symbol label="d`">
> >    <feature name="anterior" value="+" \>
> >    <feature name="distributed" value="-" \>
> >    <feature name="voice" \>
> ></symbol>

To which MegaZone suggested the shorter version:

> <symbol label="d" anterior="+" distributed="-" voice="+" />
>
> Something like that is just as valid in XML,

My response:

The example you gave is *well-formed* XML, which is different from *valid*
XML. The problem is that your example could never be valid XML, because
the attributes needed to define a given <symbol> cannot be known ahead of
time in the module. Rather, the list of feature names is given in a
separate <featureset></featureset> section.

True, one could make the featureset declaration into a DTD, but that would
require the users of my module to write their own DTD's, which is too much
work for them. I'd rather leave the validation of features against the
featureset to the application--which I'm also writing, so it's not much of
a problem.

I could go your way, but it would require all XML files parsed by my
module to run in standalone mode, and would prevent writing any DTD that
could validate all such files.

> Using XPath you can find the element based on one attribute and get
> the value of another - as in this:
> ---
> use strict;
> use warnings;
> use XML::LibXML 1.0053;
>
> my $xmlFile;
> my $parser = XML::LibXML->new();
>
> open (XMLCONF, "<./pcCurrencyTable.xml") ||
>     die "Can't open table: $!";
> while (<XMLCONF>) {
>     $xmlFile .= $_;
> }
> close (XMLCONF);
>
> my $dom = $parser->parse_string($xmlFile);
> $xpath = "//CurrencyTable/CurrencyShift[\@number='840']/\@name";
> print( ($dom->findnodes($xpath))[0]->textContent() . "\n");
>
> ----
>
> That prints "USD".

Something like this could provide an elegant way for the Lingua::Phonology
module to do checking that a given file doesn't contain errors (i.e. that
all attributes or feature names given for a <symbol> match some feature
declared in the <featureset> section. Once I've decided on my format, I'll
have to consider exactly how to do this.

===

From: "Julian Scarfe" <julian@avbrief.com>
Subject: Re: XML or home-grown format?
Newsgroups: comp.lang.perl.misc,comp.lang.perl.modules
Date: Sat, 26 Jul 2003 16:50:58 +0100
Organization: ntl Cablemodem News Service

"JS Bangs" <jaspax@u.washington.edu> wrote in message
news:Pine.A41.4.56.0307241439550.111292@dante03.u.washington.edu...

> For example, I can currently write the following line in a file to be
> parsed by Lingua::Phonology::Symbols:
>
> d    +anterior -distributed voice
>
> In XML, this might have to be as verbose as:
>
> <symbol label="d`">
>     <feature name="anterior" value="+" \>
>     <feature name="distributed" value="-" \>
>     <feature name="voice" \>
> </symbol>
>
> Which is significantly heavier and less clear. I'm rather torn on this, so
> I was wondering what insight the minds here have to offer. Many thanks--

My guess is that you find this less clear because you're used to reading the
current format.  However:

<symbol label="d">
    <feature name="anterior" value="true" \>
    <feature name="distributed" value="false" \>
    <feature name="voice" \>
</symbol>

means a great deal more to me than trying to work out what your +s and -s
mean.  The structure is immediately clear and it's not hard to edit using an
XML editor or even a simple text editor. I'd check out XML schema (rather
than playing with DTDs)  if you haven't already.

===

the rest of The Pile (a partial mailing list archive)

doom@kzsu.stanford.edu