modperl_html_entity_minutia

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.

To: modperl@apache.org
From: Marc Lehmann <pcg@goof.com>
Subject: mod_perl guide corrections: & in uris
Date: Mon, 12 Feb 2001 02:54:48 +0100

Stas told me to forward my mail to the list, since there was a large
discussion about it. Since I now see that this seems to have been a kind
of dispute and not an ommision I'll provide references to the standards
below.

----- Forwarded message from Marc Lehmann <pcg@goof.com> -----

Subject: mod_perl guie corrections
From: Marc Lehmann <pcg@goof.com>
Date: Sun, 11 Feb 2001 20:24:59 +0100
To: stas@stason.org

in http://perl.apache.org/guide/browserbugs.html I read:

   Preventing QUERY_STRING from getting corrupted because of &entity key
   names:

   http://my.site.com/foo.pl?foo=bar&reg=foobar, then some browsers will
   interpret &reg as an SGML entity

This claims this is a browser bug, which it isn't. Browsers are perfectly
fine to interpret the &reg as an entity when you embed this in the
html source unquoted. What's wrong is feeding non-html code to the
browser in the first place. But as we all know browsers always try to
decipher html-like syntax even if it is incorrect. In the above case the
browser might "fix" the broken html fragment by assuming "&reg=" is an
entity. Other browsers might interpret it differently. Still others might
just view the page as text since it isn't html.

Saying this is a browser bug will only feed on people generating such
broken urls which will always be a problem with browsers adhering to the
standard ;)

So it would be much better to educate people to actually generate correct
html (by quoting & as &amp; for example).

(This is, btw, the #1 php bug on the web despite the php manual explicitly
warning about this case ;) Interestingly, one rarely sees this bug in perl
code, although the mod_perl guide implicitly say this would be correct
code ;->)

----- End forwarded message -----

Now the rationale. Who defines HTML? What is standard HTML? First of all,
there is no HTML standard. The best thing that comes close is the W3C HTML
Recommendation. Since the W3C and nobody else defines HTML I argue that the
W3C HTML reocmmendations are the most important definition of HTML.

The current HTML version is XHTML1.0 (see http://www.w3.org/TR/html/). No
XML parser will parse the above fragment, as it is clearly incorrect (see
XML definition at http://www.w3.org/TR/REC-xml).

Since the de-facto HTML version in use is HTML4.01, however, I will also
give reasons on why it is also incorrect HTML4.01 (which is an application
of SGML). First of all, in most SGML applications &reg would indeed be a
valid entity reference and, if not defined, would generate a parse error.

In HTML, "&" is an active character like "<". Thinking that a browser
must somehow "guess" at wether it is used as entity start or not is like
requesting that a browser must also guess that "<p=neu</p" might or might
not contain an p element or that "<xx>" is not a valid html element and
should therefore be displayed as text.

In 5.3.2 Character entity references
(http://www.w3.org/TR/html4/charset.html) it is written:

   Authors wishing to put the "<" character in text should use "&lt;" (ASCII
   decimal 60) to avoid possible confusion with the beginning of a tag (start
   tag open delimiter).

And the same advice they apply to "&":
   
   Authors should use "&amp;" (ASCII decimal 38) instead of "&" to avoid
   confusion with the beginning of a character reference (entity reference
   open delimiter). Authors should also use "&amp;" in attribute values since
   character references are allowed within CDATA attribute values.

These parts of HTML4 use an informal description of HTML (e.g. it
doesn't describe the SGML comment syntax fully). The HTML declaration
(http://www.w3.org/TR/html4/HTML4.decl) makes the role of & explicit, but
as I've written formal SGML doesn't help at all.

Finally, in appendix B of the html4 standard is written:

   Although URIs do not contain non-ASCII values (see [URI], section 2.1)
   authors sometimes specify them in attribute values expecting URIs (i.e.,
   defined with %URI; in the DTD). For instance, the following href value is
   illegal: <A href="http://foo.org/H
the rest of The Pile (a partial mailing list archive)
doom@kzsu.stanford.edu