converting_html_to_text

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.

From: Jonathan Stowe <gellyfish@gellyfish.com>
Subject: Re: Converting HTML to text, harder than the FAQ implies
Date: Wed, 05 Jul 2000 16:25:34 GMT

On Wed, 05 Jul 2000 10:59:59 -0400, Jeff Boes Wrote:
> I'll state right up front that I've read FAQ 9.7, "How do I fetch an
> HTML file?" and its description of how to fetch (and parse) HTML. The
> oft-quoted example (listed here for those looking for that level of
> answer, which is unfortunately not me) is:
> 
> 	use LWP::Simple;
> 	use HTML::Parse;
> 	use HTML::FormatText;
> 	my ($html, $ascii);
> 	$html = get("<some-URL>");
> 	defined $html
> 	  or die "Can't fetch HTML from <some-URL>";
> 	$ascii = HTML::FormatText->new->format(parse_html($html));
> 	print $ascii;
> 
> works fine for simple HTML documents. Or does it? Compare the output of
> the approach above with
> 
> 	print `lynx -dump <some-URL>`;
> 
> if the URL contains this document:
> 
> <html>
> <body>
> <form>
> Hi there! You won't see me!
> </form>
> </body>
> 
> 
> Lynx will render this, but the HTML::FormatText approach just gives you
> 
> [FORM NOT SHOWN]
> 
> A similar shortcoming is displayed if you include a table in the
> document. These are documented shortcomings of HTML::FormatText. This is
> disappointing, to put it mildly.
> 
> Therefore, I'm looking for an approach that will let me render HTML as
> plain text, in a predictable format. It needs to handle both <form> and
> <table> It doesn't HAVE to match lynx's output, although that would be a
> sizeable bonus. I could use lynx for this task (and in fact I am), but
> the start-up costs associated with running lynx get prohibitive when you
> are talking about doing thousands of URLs.
> 

Dont worry about HTML::FormatText, Use HTML::Parser directly and work out
how you are going to output things yourself - seee some of the examples
at <http://www.gellyfish.com/htexample/> for starters - then you can work
the rest out for yourself.

/J\

Path: nntp.stanford.edu!newsfeed.stanford.edu!newsfeedZ.netscum.dQ!netscum.int!newsfeed.mathworks.com!portc01.blue.aol.com!newsfeeds.sol.net!news.execpc.com!newspeer.sol.net!posts0.nwblwi.newsops.execpc.com!posts.news.net-link.net!not-for-mail
Date: Wed, 05 Jul 2000 13:36:38 -0400
From: Jeff Boes <jboes@eoexchange.com>
Organization: Eoexchange
X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-15mdk i686)
X-Accept-Language: en
MIME-Version: 1.0
Newsgroups: comp.lang.perl.misc
Subject: Re: Converting HTML to text, harder than the FAQ implies
References: <39634fcb$0$1503$44a10c7e@news.net-link.net>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Lines: 25
Message-ID: <396372f2$0$1498$44a10c7e@news.net-link.net>
NNTP-Posting-Host: 67c14dd2.news.net-link.net
X-Trace: DXC=9nABe_E[EbZL:]83Yo7ZD[LoK8f=:?J8P[JFGR4UMKEP7De4II:F;FW=Y\1A^kb<<\JWjXA[@YcDW[H[dTd=HfI^
Xref: nntp.stanford.edu comp.lang.perl.misc:323556

Jeff Boes wrote:
> 
> Therefore, I'm looking for an approach that will let me render HTML as
> plain text, in a predictable format. It needs to handle both <form> and
> <table> It doesn't HAVE to match lynx's output, although that would be a
> sizeable bonus. I could use lynx for this task (and in fact I am), but
> the start-up costs associated with running lynx get prohibitive when you
> are talking about doing thousands of URLs.
> 

FYI, following up my own post here:

The eg/ directory included with HTML::Parser has a simple script that I
modified to let me get all the text from an HTML document, but it still
doesn't do anything remotely close to what lynx does in terms of
rendering.

-- 
Jeff Boes        |Computer science is no more about   
|jboes@eoexchange.com
Sr. S/W Engineer |computers than astronomy is about    |616-381-9889 ext
18
Change Technology|telescopes. --E. W. Dijkstra         |616-381-4823 fax
EoExchange, Inc. |                                    
|www.eoexchange.com

Path: nntp.stanford.edu!newsfeed.stanford.edu!headwall.stanford.edu!feeder.via.net!news.he.net!sn-xit-03!supernews.com!sn-inject-01!corp.supernews.com!news.victoria.tc.ca!vtn1!yf110
From: yf110@vtn1.victoria.tc.ca (Malcolm Dew-Jones)
Newsgroups: comp.lang.perl.misc
Subject: Re: Converting HTML to text, harder than the FAQ implies
Date: 5 Jul 2000 10:33:53 -0800
Organization: Victoria Telecommunity Network
Lines: 22
Message-ID: <39637181@news.victoria.tc.ca>
References: <39634fcb$0$1503$44a10c7e@news.net-link.net>
X-Complaints-To: newsabuse@supernews.com
X-Newsreader: TIN [version 1.2 PL2]
X-Original-NNTP-Posting-Host: 199.60.222.3
XPident: yf110
Xref: nntp.stanford.edu comp.lang.perl.misc:323553

Jeff Boes (jboes@eoexchange.com) wrote:
: I'll state right up front that I've read FAQ 9.7, "How do I fetch an
: HTML file?" and its description of how to fetch (and parse) HTML. The

: Lynx will render this, but the HTML::FormatText approach just gives you

: sizeable bonus. I could use lynx for this task (and in fact I am), but
: the start-up costs associated with running lynx get prohibitive when you
: are talking about doing thousands of URLs.

I believe there are ways to make lynx do a bunch of files all at once.  It
will even walk a document tree dumping each document.  This would avoid
the startup overhead.  I suspect that --traversal would help.  Place all
the links in a single file and then do a lynx --traversal to get them all
in one go (each html page is stored in a seperate file).

If you do go the perl route, then add your handlers onto the FormatText
(? I forgot the name already) methods, and add them as some kind of
contribution to CPAN, cause they would be useful.

(Or build a complete fancy formatter, that would be great too.)


Path: nntp.stanford.edu!newsfeed.stanford.edu!newsfeedZ.netscum.dQ!netscum.int!news.algonet.se!algonet!newsfeed.online.be!newspeer.clara.net!news.clara.net!peer1.news.dircon.net!peer2.news.dircon.net!news.dircon.co.uk.POSTED!localhost!not-for-mail
From: Jonathan Stowe <gellyfish@gellyfish.com>
Newsgroups: comp.lang.perl.misc
Subject: Re: Converting HTML to text, harder than the FAQ implies
Organization: Twenty First Century Gellyfish
Lines: 13

===

Date: 6 Jul 2000 00:00:01 +0100

On Wed, 05 Jul 2000 16:25:34 GMT Jonathan Stowe wrote:
>
>    <http://www.gellyfish.com/htexample/>
>

     <http://www.gellyfish.com/htexamples/>

D'oh!

/J\
-- 
yapc::Europe in assocation with the Institute Of Contemporary Arts
   <http://www.yapc.org/Europe/>   <http://www.ica.org.uk>

Path: nntp.stanford.edu!newsfeed.stanford.edu!bloom-beacon.mit.edu!newsswitch.lcs.mit.edu!sunqbc.risq.qc.ca!bignews.mediaways.net!abq.news.ans.net!news.chips.ibm.com!newsfeed.btv.ibm.com!newshost.transarc.com!not-for-mail
From: "Joe_Broz@transarc.com" <jbroz@transarc.com>
Newsgroups: comp.lang.perl.misc
Subject: Re: Converting HTML to text, harder than the FAQ implies
Date: Thu, 06 Jul 2000 17:15:12 +0100
Organization: Transarc Corporation
Lines: 22
Message-ID: <3964B090.659BFF31@transarc.com>
References: <39634fcb$0$1503$44a10c7e@news.net-link.net> <396372f2$0$1498$44a10c7e@news.net-link.net>
Reply-To: jbroz@transarc.com
NNTP-Posting-Host: jbroz.uk.transarc.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Mailer: Mozilla 4.7 [en] (WinNT; U)
X-Accept-Language: en
Xref: nntp.stanford.edu comp.lang.perl.misc:323770

Jeff Boes wrote:
> 
> Jeff Boes wrote:
> >
> > Therefore, I'm looking for an approach that will let me render HTML as
> > plain text, in a predictable format. It needs to handle both <form> and
> > <table> It doesn't HAVE to match lynx's output, although that would be a
> > sizeable bonus. I could use lynx for this task (and in fact I am), but
> > the start-up costs associated with running lynx get prohibitive when you
> > are talking about doing thousands of URLs.
> >
> 
> FYI, following up my own post here:
> 
> The eg/ directory included with HTML::Parser has a simple script that I
> modified to let me get all the text from an HTML document, but it still
> doesn't do anything remotely close to what lynx does in terms of
> rendering.

That's because, AFAIK, HTML::Parser isn't designed to 'render' anything.
It's a parser. What you do with the resulting data is, as you know, up to
you.
the rest of The Pile (a partial mailing list archive)
doom@kzsu.stanford.edu