 | | From: | Victor Hadianto | | Subject: | Processing HTML as XML | | Date: | Wed, 8 Dec 2004 22:07:42 +1100 |
|
|
 | Hello,
Anyone knows how to convert HTML strings in XML in Delphi? Any suggestions for a good component to do this?
Thanks,
Victor
|
|
 | | From: | eshipman | | Subject: | Re: Processing HTML as XML | | Date: | Wed, 8 Dec 2004 08:39:23 -0600 |
|
|
 | In article <41b6e02a@newsgroups.borland.com>, victor@synop.com says... > Hello, > > Anyone knows how to convert HTML strings in XML in Delphi? Any suggestions > for a good component to do this? >
You can't really progmatically do it because HTML is so, let me say, unstructured. I'd suggest doing it manually by first using the XHTML validator on W3c.org
|
|
 | | From: | Jeff Rafter | | Subject: | Re: Processing HTML as XML | | Date: | Wed, 08 Dec 2004 08:27:16 -0800 |
|
|
 | > Anyone knows how to convert HTML strings in XML in Delphi? Any suggestions > for a good component to do this?
The only good way I know of is to use the Tidy COM component. There is also a parser called TagSoup which is written in C, I am not sure if there is a way to get it into Delphi easily though.
Cheers, Jeff Rafter
|
|
 | | From: | Victor Hadianto | | Subject: | Re: Processing HTML as XML | | Date: | Thu, 9 Dec 2004 07:25:50 +1100 |
|
|
 | >> Anyone knows how to convert HTML strings in XML in Delphi? Any >> suggestions for a good component to do this? > > The only good way I know of is to use the Tidy COM component. There is > also a parser called TagSoup which is written in C, I am not sure if there > is a way to get it into Delphi easily though.
I'm surprised that there are no good components to do this, there are a couple of good one that we can use in Java and .Net. There's gotta be one for Delphi :) I'll continue to dig around.
I'm thinking since IE does this by creating the DOM I wonder if we can leverage IE to do this?
Regards,
Victor
|
|
 | | From: | Victor Hadianto | | Subject: | Re: Processing HTML as XML | | Date: | Mon, 13 Dec 2004 21:24:50 +1100 |
|
|
 | >>> Anyone knows how to convert HTML strings in XML in Delphi? Any >>> suggestions for a good component to do this? >> >> The only good way I know of is to use the Tidy COM component. There is >> also a parser called TagSoup which is written in C, I am not sure if >> there is a way to get it into Delphi easily though.
I found this: http://www.elsdoerfer.net/delphi/?page=libtidy, makes it really easy to use it from Delphi. There is an option to spit out XML from Tidy. Works perfectly :)
Victor
|
|
 | | From: | eshipman | | Subject: | Re: Processing HTML as XML | | Date: | Wed, 8 Dec 2004 15:58:28 -0600 |
|
|
 | In article <41b762f8@newsgroups.borland.com>, victor@synop.com says... > >> Anyone knows how to convert HTML strings in XML in Delphi? Any > >> suggestions for a good component to do this? > > > > The only good way I know of is to use the Tidy COM component. There is > > also a parser called TagSoup which is written in C, I am not sure if there > > is a way to get it into Delphi easily though. > > I'm surprised that there are no good components to do this, there are a > couple of good one that we can use in Java and .Net. There's gotta be one > for Delphi :) I'll continue to dig around. > > I'm thinking since IE does this by creating the DOM I wonder if we can > leverage IE to do this?
Victor, with the plethora of crappy html on the web, it would be close to impossible to do this correctly just from using the IE DOM. Even Tidy has problem converting faulty HTML to XHTML.
Heck, even most HTML designers don't code XHTML well.
I think you could use something like the extIEParser from the delphi-webbrowser Yahoo group to parse the HTML, it has events for each type of tag, and "convert" the tag to XHTML in an in-memory copy of the file.
|
|
 | | From: | John McTaggart | | Subject: | Re: Processing HTML as XML | | Date: | Wed, 15 Dec 2004 08:18:12 -0500 |
|
|
 | > I think you could use something like the extIEParser from the > delphi-webbrowser Yahoo group to parse the HTML, it has events for each > type of tag, and "convert" the tag to XHTML in an in-memory copy of the > file.
Exactly how I'd do it with my parser..
http://www.compnet101.com/atagparser
It would make it a trivial operation.
John McTaggart
|
|
 | | From: | Andrea Raimondi | | Subject: | Re: Processing HTML as XML | | Date: | Thu, 09 Dec 2004 20:58:17 +0100 |
|
|
 | eshipman wrote: > Heck, even most HTML designers don't code XHTML well.
I'm not an HTML designer by profession, but I do some HTML and XHTML. My pages mostly validate. When they don't, it's not because of the structure but for the content( which may be custom ).
Do you mean "validates with W3C" by "code XTML well"?
Cheers,
Andrew -- Online thoughts blog http://araimondi.blogspot.com
|
|
 | | From: | eshipman | | Subject: | Re: Processing HTML as XML | | Date: | Thu, 9 Dec 2004 14:24:34 -0600 |
|
|
 | In article <41b8ad0f$1@newsgroups.borland.com>, rainaple@tin.it says... > eshipman wrote: > > Heck, even most HTML designers don't code XHTML well. > > I'm not an HTML designer by profession, but I do some > HTML and XHTML. My pages mostly validate. When they don't, it's not > because of the structure but for the content( which may be > custom ). > > Do you mean "validates with W3C" by "code XTML well"? >
Well, I would say, yes, but you have to remember that there are several DTD's to use with XHTML and you must validate against the DTD included in the document.
|
|
 | | From: | Andrea Raimondi | | Subject: | Re: Processing HTML as XML | | Date: | Thu, 09 Dec 2004 23:16:58 +0100 |
|
|
 | eshipman wrote: > Well, I would say, yes, but you have to remember that there are several > DTD's to use with XHTML and you must validate against the DTD included > in the document.
That goes without saying. Validation *must* take place against the DTD used in the document. I'm not aware of different validation kinds.
Cheers,
Andrew -- Online thoughts blog http://araimondi.blogspot.com
|
|
 | | From: | SD | | Subject: | Re: Processing HTML as XML | | Date: | Tue, 14 Dec 2004 08:31:59 +0100 |
|
|
 | "Victor Hadianto" has wrote: > Anyone knows how to convert HTML strings in XML in Delphi? Any suggestions > for a good component to do this?
In recent days I've found and used a good HTML DOM implementation, which I use to extract TEXT code from HTML files and create an index dbase of words.
Try it on:
http://sourceforge.net/projects/htmlp/
Good enjoy...
PS: There are also other good tools but don't have sources without pay....
http://www.compnet101.com/atagparser/ http://www.yunqa.de/delphi/htmlparser/index.htm http://www.jazarsoft.com/index.php
Bye Bye
Silverio Diquigiovanni
|
|
 | | From: | Eric Zurcher | | Subject: | Re: Processing HTML as XML | | Date: | Fri, 17 Dec 2004 13:19:11 +1100 |
|
|
 | Hi Victor,
You might consider using libxml2, which implements an HTML 4.0 non-verifying parser with an API compatible with the XML parser ones. You can obtain libxml2 as a Windows binary from
http://www.zlatkovic.com/libxml.en.html
Delphi bindings for the DLL can be found at
http://sourceforge.net/projects/libxml2-pas/
(Note: I hope to provide an updated version of the Delphi bindings shortly.)
IMHO, the libxml2 parser is absolutely outstanding. It is very fast (with my data, I find it to be about twice as fast as Microsoft's parser), and is very convenient for cross-platform or cross-language support.
Victor Hadianto wrote: > Hello, > > Anyone knows how to convert HTML strings in XML in Delphi? Any suggestions > for a good component to do this? > > Thanks, > > Victor >
Eric Zurcher CSIRO Livestock Industries Canberra, Australia
|
|