This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.
To: modperl@apache.org From: Marc Lehmann <pcg@goof.com> Subject: mod_perl guide corrections: & in uris Date: Mon, 12 Feb 2001 02:54:48 +0100 Stas told me to forward my mail to the list, since there was a large discussion about it. Since I now see that this seems to have been a kind of dispute and not an ommision I'll provide references to the standards below. ----- Forwarded message from Marc Lehmann <pcg@goof.com> ----- Subject: mod_perl guie corrections From: Marc Lehmann <pcg@goof.com> Date: Sun, 11 Feb 2001 20:24:59 +0100 To: stas@stason.org in http://perl.apache.org/guide/browserbugs.html I read: Preventing QUERY_STRING from getting corrupted because of &entity key names: http://my.site.com/foo.pl?foo=bar®=foobar, then some browsers will interpret ® as an SGML entity This claims this is a browser bug, which it isn't. Browsers are perfectly fine to interpret the ® as an entity when you embed this in the html source unquoted. What's wrong is feeding non-html code to the browser in the first place. But as we all know browsers always try to decipher html-like syntax even if it is incorrect. In the above case the browser might "fix" the broken html fragment by assuming "®=" is an entity. Other browsers might interpret it differently. Still others might just view the page as text since it isn't html. Saying this is a browser bug will only feed on people generating such broken urls which will always be a problem with browsers adhering to the standard ;) So it would be much better to educate people to actually generate correct html (by quoting & as & for example). (This is, btw, the #1 php bug on the web despite the php manual explicitly warning about this case ;) Interestingly, one rarely sees this bug in perl code, although the mod_perl guide implicitly say this would be correct code ;->) ----- End forwarded message ----- Now the rationale. Who defines HTML? What is standard HTML? First of all, there is no HTML standard. The best thing that comes close is the W3C HTML Recommendation. Since the W3C and nobody else defines HTML I argue that the W3C HTML reocmmendations are the most important definition of HTML. The current HTML version is XHTML1.0 (see http://www.w3.org/TR/html/). No XML parser will parse the above fragment, as it is clearly incorrect (see XML definition at http://www.w3.org/TR/REC-xml). Since the de-facto HTML version in use is HTML4.01, however, I will also give reasons on why it is also incorrect HTML4.01 (which is an application of SGML). First of all, in most SGML applications ® would indeed be a valid entity reference and, if not defined, would generate a parse error. In HTML, "&" is an active character like "<". Thinking that a browser must somehow "guess" at wether it is used as entity start or not is like requesting that a browser must also guess that "<p=neu</p" might or might not contain an p element or that "<xx>" is not a valid html element and should therefore be displayed as text. In 5.3.2 Character entity references (http://www.w3.org/TR/html4/charset.html) it is written: Authors wishing to put the "<" character in text should use "<" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter). And the same advice they apply to "&": Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values. These parts of HTML4 use an informal description of HTML (e.g. it doesn't describe the SGML comment syntax fully). The HTML declaration (http://www.w3.org/TR/html4/HTML4.decl) makes the role of & explicit, but as I've written formal SGML doesn't help at all. Finally, in appendix B of the html4 standard is written: Although URIs do not contain non-ASCII values (see [URI], section 2.1) authors sometimes specify them in attribute values expecting URIs (i.e., defined with %URI; in the DTD). For instance, the following href value is illegal: <A href="http://foo.org/H