inetbot web crawler
Main  |  Get access to the repository  |  API  |  The robot  |  Publications  |  Usenet Groups  |  Plainweb  | 
 inetbot - Groups (beta)

Current group: borland.public.delphi.xml

Processing HTML as XML

Processing HTML as XML  
Victor Hadianto
 Re: Processing HTML as XML  
eshipman
 Re: Processing HTML as XML  
Jeff Rafter
 Re: Processing HTML as XML  
Victor Hadianto
 Re: Processing HTML as XML  
Victor Hadianto
 Re: Processing HTML as XML  
eshipman
 Re: Processing HTML as XML  
John McTaggart
 Re: Processing HTML as XML  
Andrea Raimondi
 Re: Processing HTML as XML  
eshipman
 Re: Processing HTML as XML  
Andrea Raimondi
 Re: Processing HTML as XML  
SD
 Re: Processing HTML as XML  
Eric Zurcher
From:Victor Hadianto
Subject:Processing HTML as XML
Date:Wed, 8 Dec 2004 22:07:42 +1100
Hello,

Anyone knows how to convert HTML strings in XML in Delphi? Any suggestions
for a good component to do this?

Thanks,

Victor
From:eshipman
Subject:Re: Processing HTML as XML
Date:Wed, 8 Dec 2004 08:39:23 -0600
In article <41b6e02a@newsgroups.borland.com>, victor@synop.com says...
> Hello,
>
> Anyone knows how to convert HTML strings in XML in Delphi? Any suggestions
> for a good component to do this?
>

You can't really progmatically do it because HTML is so,
let me say, unstructured.
I'd suggest doing it manually by first using the
XHTML validator on W3c.org
From:Jeff Rafter
Subject:Re: Processing HTML as XML
Date:Wed, 08 Dec 2004 08:27:16 -0800
> Anyone knows how to convert HTML strings in XML in Delphi? Any suggestions
> for a good component to do this?

The only good way I know of is to use the Tidy COM component. There is
also a parser called TagSoup which is written in C, I am not sure if
there is a way to get it into Delphi easily though.

Cheers,
Jeff Rafter
From:Victor Hadianto
Subject:Re: Processing HTML as XML
Date:Thu, 9 Dec 2004 07:25:50 +1100
>> Anyone knows how to convert HTML strings in XML in Delphi? Any
>> suggestions for a good component to do this?
>
> The only good way I know of is to use the Tidy COM component. There is
> also a parser called TagSoup which is written in C, I am not sure if there
> is a way to get it into Delphi easily though.

I'm surprised that there are no good components to do this, there are a
couple of good one that we can use in Java and .Net. There's gotta be one
for Delphi :) I'll continue to dig around.

I'm thinking since IE does this by creating the DOM I wonder if we can
leverage IE to do this?

Regards,

Victor
From:Victor Hadianto
Subject:Re: Processing HTML as XML
Date:Mon, 13 Dec 2004 21:24:50 +1100
>>> Anyone knows how to convert HTML strings in XML in Delphi? Any
>>> suggestions for a good component to do this?
>>
>> The only good way I know of is to use the Tidy COM component. There is
>> also a parser called TagSoup which is written in C, I am not sure if
>> there is a way to get it into Delphi easily though.

I found this: http://www.elsdoerfer.net/delphi/?page=libtidy, makes it
really easy to use it from Delphi. There is an option to spit out XML from
Tidy. Works perfectly :)

Victor
From:eshipman
Subject:Re: Processing HTML as XML
Date:Wed, 8 Dec 2004 15:58:28 -0600
In article <41b762f8@newsgroups.borland.com>, victor@synop.com says...
> >> Anyone knows how to convert HTML strings in XML in Delphi? Any
> >> suggestions for a good component to do this?
> >
> > The only good way I know of is to use the Tidy COM component. There is
> > also a parser called TagSoup which is written in C, I am not sure if there
> > is a way to get it into Delphi easily though.
>
> I'm surprised that there are no good components to do this, there are a
> couple of good one that we can use in Java and .Net. There's gotta be one
> for Delphi :) I'll continue to dig around.
>
> I'm thinking since IE does this by creating the DOM I wonder if we can
> leverage IE to do this?

Victor, with the plethora of crappy html on the web, it would
be close to impossible to do this correctly just from using the
IE DOM. Even Tidy has problem converting faulty HTML to XHTML.

Heck, even most HTML designers don't code XHTML well.

I think you could use something like the extIEParser from the
delphi-webbrowser Yahoo group to parse the HTML, it has events for each
type of tag, and "convert" the tag to XHTML in an in-memory copy of the
file.
From:John McTaggart
Subject:Re: Processing HTML as XML
Date:Wed, 15 Dec 2004 08:18:12 -0500
> I think you could use something like the extIEParser from the
> delphi-webbrowser Yahoo group to parse the HTML, it has events for each
> type of tag, and "convert" the tag to XHTML in an in-memory copy of the
> file.

Exactly how I'd do it with my parser..

http://www.compnet101.com/atagparser

It would make it a trivial operation.

John McTaggart
From:Andrea Raimondi
Subject:Re: Processing HTML as XML
Date:Thu, 09 Dec 2004 20:58:17 +0100
eshipman wrote:
> Heck, even most HTML designers don't code XHTML well.

I'm not an HTML designer by profession, but I do some
HTML and XHTML. My pages mostly validate. When they don't, it's not
because of the structure but for the content( which may be
custom ).

Do you mean "validates with W3C" by "code XTML well"?

Cheers,

Andrew
--
Online thoughts blog
http://araimondi.blogspot.com
From:eshipman
Subject:Re: Processing HTML as XML
Date:Thu, 9 Dec 2004 14:24:34 -0600
In article <41b8ad0f$1@newsgroups.borland.com>, rainaple@tin.it says...
> eshipman wrote:
> > Heck, even most HTML designers don't code XHTML well.
>
> I'm not an HTML designer by profession, but I do some
> HTML and XHTML. My pages mostly validate. When they don't, it's not
> because of the structure but for the content( which may be
> custom ).
>
> Do you mean "validates with W3C" by "code XTML well"?
>

Well, I would say, yes, but you have to remember that there are several
DTD's to use with XHTML and you must validate against the DTD included
in the document.
From:Andrea Raimondi
Subject:Re: Processing HTML as XML
Date:Thu, 09 Dec 2004 23:16:58 +0100
eshipman wrote:
> Well, I would say, yes, but you have to remember that there are several
> DTD's to use with XHTML and you must validate against the DTD included
> in the document.

That goes without saying. Validation *must* take place against the DTD
used in the document. I'm not aware of different validation kinds.

Cheers,

Andrew
--
Online thoughts blog
http://araimondi.blogspot.com
From:SD
Subject:Re: Processing HTML as XML
Date:Tue, 14 Dec 2004 08:31:59 +0100
"Victor Hadianto" has wrote:
> Anyone knows how to convert HTML strings in XML in Delphi? Any suggestions
> for a good component to do this?

In recent days I've found and used a good HTML DOM implementation,
which I use to extract TEXT code from HTML files and create an
index dbase of words.

Try it on:

http://sourceforge.net/projects/htmlp/

Good enjoy...

PS:
There are also other good tools but don't have sources
without pay....

http://www.compnet101.com/atagparser/
http://www.yunqa.de/delphi/htmlparser/index.htm
http://www.jazarsoft.com/index.php

Bye Bye

Silverio Diquigiovanni
From:Eric Zurcher
Subject:Re: Processing HTML as XML
Date:Fri, 17 Dec 2004 13:19:11 +1100
Hi Victor,

You might consider using libxml2, which implements an HTML 4.0
non-verifying parser with an API compatible with the XML parser ones.
You can obtain libxml2 as a Windows binary from

http://www.zlatkovic.com/libxml.en.html

Delphi bindings for the DLL can be found at

http://sourceforge.net/projects/libxml2-pas/

(Note: I hope to provide an updated version of the Delphi bindings shortly.)

IMHO, the libxml2 parser is absolutely outstanding. It is very fast
(with my data, I find it to be about twice as fast as Microsoft's
parser), and is very convenient for cross-platform or cross-language
support.


Victor Hadianto wrote:
> Hello,
>
> Anyone knows how to convert HTML strings in XML in Delphi? Any suggestions
> for a good component to do this?
>
> Thanks,
>
> Victor
>

Eric Zurcher
CSIRO Livestock Industries
Canberra, Australia
   

Copyright © 2006 inetbot   -   All rights reserved