This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.
Date: Wed, 27 Sep 2000 20:10:26 -0700
From: Joey Hess <joey@kitenet.net>
To: svlug@svlug.org
Subject: Re: [svlug] Re: Netscape doesn't work
Rick Moen wrote:
> My _God_. Look at all that crud! You let a Web browser near _that_
> demented pile of table-ridden excreta from a misplaced DTP wretch?
I've figured out how to handle this type of html.
joey@kite:~>cat test.pl
use HTML::Sanitizer;
$s=HTML::Sanitizer->new(
javascript => 0,
comment => 0,
title => [], h1 => [], h2 => [], h3 => [], h4 => [], h5 => [],
p => [], hr => [], li => [], ol => [], ul => [], br => [],
b => [], i => [], em => [], strong => [],
a => [qw{href name}],
blockquote => [], pre => [], br => [], div => [], tt => [],
form => [qw{action method}],
input => [qw{type name value}],
table => [qw{border summary}],
tr => [], th => [], td => [], dl => [], dt => [], dd => [],
img => [qw{alt src}],
textarea => [qw{name rows cols wrap}],
);
print $s->sanitize(join '', <>);
joey@kite:~>perl test.pl ~/torture.html
<title>Alteon WebSystems Intelligent Webworking</title>
<table>
<tr>
<td>
<table>
<tr>
<td><img alt="" src="/images/logo_main_700.gif"></td>
</tr>
<tr>
<td><br><br></td>
</tr>
<tr>
<td>
<table border="0">
<tr>
<td><b>|</b></td>
<td><b><a href="/main.asp">English</a></b></td>
<td><b>|</b></td>
<td><b><a href="/chinese.asp"><img src="/images/Chinese.gif"></a></b></td>
<td><b>|</b></td>
<td><b><a href="/german.asp">Deutsch</a></b></td>
<td><b>|</b></td>
<td><b><a href="/spanish.asp">Espa