This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.
From: Jonathan Stowe <gellyfish@gellyfish.com> Subject: Re: Converting HTML to text, harder than the FAQ implies Date: Wed, 05 Jul 2000 16:25:34 GMT On Wed, 05 Jul 2000 10:59:59 -0400, Jeff Boes Wrote: > I'll state right up front that I've read FAQ 9.7, "How do I fetch an > HTML file?" and its description of how to fetch (and parse) HTML. The > oft-quoted example (listed here for those looking for that level of > answer, which is unfortunately not me) is: > > use LWP::Simple; > use HTML::Parse; > use HTML::FormatText; > my ($html, $ascii); > $html = get("<some-URL>"); > defined $html > or die "Can't fetch HTML from <some-URL>"; > $ascii = HTML::FormatText->new->format(parse_html($html)); > print $ascii; > > works fine for simple HTML documents. Or does it? Compare the output of > the approach above with > > print `lynx -dump <some-URL>`; > > if the URL contains this document: > > <html> > <body> > <form> > Hi there! You won't see me! > </form> > </body> > > > Lynx will render this, but the HTML::FormatText approach just gives you > > [FORM NOT SHOWN] > > A similar shortcoming is displayed if you include a table in the > document. These are documented shortcomings of HTML::FormatText. This is > disappointing, to put it mildly. > > Therefore, I'm looking for an approach that will let me render HTML as > plain text, in a predictable format. It needs to handle both <form> and > <table> It doesn't HAVE to match lynx's output, although that would be a > sizeable bonus. I could use lynx for this task (and in fact I am), but > the start-up costs associated with running lynx get prohibitive when you > are talking about doing thousands of URLs. > Dont worry about HTML::FormatText, Use HTML::Parser directly and work out how you are going to output things yourself - seee some of the examples at <http://www.gellyfish.com/htexample/> for starters - then you can work the rest out for yourself. /J\ Path: nntp.stanford.edu!newsfeed.stanford.edu!newsfeedZ.netscum.dQ!netscum.int!newsfeed.mathworks.com!portc01.blue.aol.com!newsfeeds.sol.net!news.execpc.com!newspeer.sol.net!posts0.nwblwi.newsops.execpc.com!posts.news.net-link.net!not-for-mail Date: Wed, 05 Jul 2000 13:36:38 -0400 From: Jeff Boes <jboes@eoexchange.com> Organization: Eoexchange X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-15mdk i686) X-Accept-Language: en MIME-Version: 1.0 Newsgroups: comp.lang.perl.misc Subject: Re: Converting HTML to text, harder than the FAQ implies References: <39634fcb$0$1503$44a10c7e@news.net-link.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Lines: 25 Message-ID: <396372f2$0$1498$44a10c7e@news.net-link.net> NNTP-Posting-Host: 67c14dd2.news.net-link.net X-Trace: DXC=9nABe_E[EbZL:]83Yo7ZD[LoK8f=:?J8P[JFGR4UMKEP7De4II:F;FW=Y\1A^kb<<\JWjXA[@YcDW[H[dTd=HfI^ Xref: nntp.stanford.edu comp.lang.perl.misc:323556 Jeff Boes wrote: > > Therefore, I'm looking for an approach that will let me render HTML as > plain text, in a predictable format. It needs to handle both <form> and > <table> It doesn't HAVE to match lynx's output, although that would be a > sizeable bonus. I could use lynx for this task (and in fact I am), but > the start-up costs associated with running lynx get prohibitive when you > are talking about doing thousands of URLs. > FYI, following up my own post here: The eg/ directory included with HTML::Parser has a simple script that I modified to let me get all the text from an HTML document, but it still doesn't do anything remotely close to what lynx does in terms of rendering. -- Jeff Boes |Computer science is no more about |jboes@eoexchange.com Sr. S/W Engineer |computers than astronomy is about |616-381-9889 ext 18 Change Technology|telescopes. --E. W. Dijkstra |616-381-4823 fax EoExchange, Inc. | |www.eoexchange.com Path: nntp.stanford.edu!newsfeed.stanford.edu!headwall.stanford.edu!feeder.via.net!news.he.net!sn-xit-03!supernews.com!sn-inject-01!corp.supernews.com!news.victoria.tc.ca!vtn1!yf110 From: yf110@vtn1.victoria.tc.ca (Malcolm Dew-Jones) Newsgroups: comp.lang.perl.misc Subject: Re: Converting HTML to text, harder than the FAQ implies Date: 5 Jul 2000 10:33:53 -0800 Organization: Victoria Telecommunity Network Lines: 22 Message-ID: <39637181@news.victoria.tc.ca> References: <39634fcb$0$1503$44a10c7e@news.net-link.net> X-Complaints-To: newsabuse@supernews.com X-Newsreader: TIN [version 1.2 PL2] X-Original-NNTP-Posting-Host: 199.60.222.3 XPident: yf110 Xref: nntp.stanford.edu comp.lang.perl.misc:323553 Jeff Boes (jboes@eoexchange.com) wrote: : I'll state right up front that I've read FAQ 9.7, "How do I fetch an : HTML file?" and its description of how to fetch (and parse) HTML. The : Lynx will render this, but the HTML::FormatText approach just gives you : sizeable bonus. I could use lynx for this task (and in fact I am), but : the start-up costs associated with running lynx get prohibitive when you : are talking about doing thousands of URLs. I believe there are ways to make lynx do a bunch of files all at once. It will even walk a document tree dumping each document. This would avoid the startup overhead. I suspect that --traversal would help. Place all the links in a single file and then do a lynx --traversal to get them all in one go (each html page is stored in a seperate file). If you do go the perl route, then add your handlers onto the FormatText (? I forgot the name already) methods, and add them as some kind of contribution to CPAN, cause they would be useful. (Or build a complete fancy formatter, that would be great too.) Path: nntp.stanford.edu!newsfeed.stanford.edu!newsfeedZ.netscum.dQ!netscum.int!news.algonet.se!algonet!newsfeed.online.be!newspeer.clara.net!news.clara.net!peer1.news.dircon.net!peer2.news.dircon.net!news.dircon.co.uk.POSTED!localhost!not-for-mail From: Jonathan Stowe <gellyfish@gellyfish.com> Newsgroups: comp.lang.perl.misc Subject: Re: Converting HTML to text, harder than the FAQ implies Organization: Twenty First Century Gellyfish Lines: 13 === Date: 6 Jul 2000 00:00:01 +0100 On Wed, 05 Jul 2000 16:25:34 GMT Jonathan Stowe wrote: > > <http://www.gellyfish.com/htexample/> > <http://www.gellyfish.com/htexamples/> D'oh! /J\ -- yapc::Europe in assocation with the Institute Of Contemporary Arts <http://www.yapc.org/Europe/> <http://www.ica.org.uk> Path: nntp.stanford.edu!newsfeed.stanford.edu!bloom-beacon.mit.edu!newsswitch.lcs.mit.edu!sunqbc.risq.qc.ca!bignews.mediaways.net!abq.news.ans.net!news.chips.ibm.com!newsfeed.btv.ibm.com!newshost.transarc.com!not-for-mail From: "Joe_Broz@transarc.com" <jbroz@transarc.com> Newsgroups: comp.lang.perl.misc Subject: Re: Converting HTML to text, harder than the FAQ implies Date: Thu, 06 Jul 2000 17:15:12 +0100 Organization: Transarc Corporation Lines: 22 Message-ID: <3964B090.659BFF31@transarc.com> References: <39634fcb$0$1503$44a10c7e@news.net-link.net> <396372f2$0$1498$44a10c7e@news.net-link.net> Reply-To: jbroz@transarc.com NNTP-Posting-Host: jbroz.uk.transarc.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mailer: Mozilla 4.7 [en] (WinNT; U) X-Accept-Language: en Xref: nntp.stanford.edu comp.lang.perl.misc:323770 Jeff Boes wrote: > > Jeff Boes wrote: > > > > Therefore, I'm looking for an approach that will let me render HTML as > > plain text, in a predictable format. It needs to handle both <form> and > > <table> It doesn't HAVE to match lynx's output, although that would be a > > sizeable bonus. I could use lynx for this task (and in fact I am), but > > the start-up costs associated with running lynx get prohibitive when you > > are talking about doing thousands of URLs. > > > > FYI, following up my own post here: > > The eg/ directory included with HTML::Parser has a simple script that I > modified to let me get all the text from an HTML document, but it still > doesn't do anything remotely close to what lynx does in terms of > rendering. That's because, AFAIK, HTML::Parser isn't designed to 'render' anything. It's a parser. What you do with the resulting data is, as you know, up to you.