html_sanitizing

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.



Date: Wed, 27 Sep 2000 20:10:26 -0700
From: Joey Hess <joey@kitenet.net>
To: svlug@svlug.org
Subject: Re: [svlug] Re: Netscape doesn't work

Rick Moen wrote:
> My _God_.  Look at all that crud!  You let a Web browser near _that_
> demented pile of table-ridden excreta from a misplaced DTP wretch?

I've figured out how to handle this type of html.

joey@kite:~>cat test.pl
use HTML::Sanitizer;
$s=HTML::Sanitizer->new(
		javascript => 0,
                comment => 0,
		title => [], h1 => [], h2 => [], h3 => [], h4 => [], h5 => [],
                p => [], hr => [], li => [], ol => [], ul => [], br => [],
                b => [], i => [], em => [], strong => [], 
		a => [qw{href name}],
                blockquote => [], pre => [], br => [], div => [], tt => [],
                form => [qw{action method}],
                input => [qw{type name value}],
                table => [qw{border summary}],
                tr => [], th => [], td => [], dl => [], dt => [], dd => [],
		img => [qw{alt src}],
                textarea => [qw{name rows cols wrap}],
);
print $s->sanitize(join '', <>);

joey@kite:~>perl test.pl ~/torture.html
<title>Alteon WebSystems Intelligent Webworking</title>
<table>
<tr>
        <td>
        <table>
        <tr>
                <td><img alt="" src="/images/logo_main_700.gif"></td>
        </tr>   
        <tr>
                <td><br><br></td>
        </tr>
        <tr>
                <td>
                <table border="0">
                <tr>
                        <td><b>|</b></td> 
                        <td><b><a href="/main.asp">English</a></b></td>
                        <td><b>|</b></td> 
                        <td><b><a href="/chinese.asp"><img src="/images/Chinese.gif"></a></b></td>
                        <td><b>|</b></td> 
                        <td><b><a href="/german.asp">Deutsch</a></b></td>
                        <td><b>|</b></td> 
                        <td><b><a href="/spanish.asp">Espa

the rest of The Pile (a partial mailing list archive)

doom@kzsu.stanford.edu