Wednesday 4 January 2012

libtidy : convert html to xml

LIBTIDY: Tidy up your html code


Out of various uses of libtidy, this post contain one simple use of libtidy.
It convert your html page code, parse it , tidy it  and finally produce an xml doc ready to be fed to your xml parser like libxml.

For that we either provide command line options to the tidy or we can write down all the command line option in a single file so that we only need to pass only one command line argument and that is you guess it right, it's config file path. :D



Config file used:



fix-bad-comments: yes
tidy-mark:no
write-back:yes
fix-uri: yes
hide-comments: yes
bare:yes
markup: yes
clean: yes
wrap-attributes:no
wrap-script-literals: yes
wrap-sections: no
input-encoding: ascii
output-encoding: utf8
error-file: errors.txt
indent: auto
indent-spaces: 2
indent-cdata: no

show-warnings: yes
show-body-only: no

break-before-br: no
uppercase-tags: no
uppercase-attributes: no
char-encoding: utf8
vertical-space:no

output-xml: no
input-xml: no
output-html: no
output-xhtml: yes
add-xml-decl: yes
add-xml-space: yes

new-inline-tags: cfif, cfelse, math, mroot,
  mrow, mi, mn, mo, msqrt, mfrac, msubsup, munderover,
  munder, mover, mmultiscripts, msup, msub, mtext,
  mprescripts, mtable, mtr, mtd, mth
#new-blocklevel-tags: cfoutput, cfquery
new-empty-tags: cfelse

Now, having your config file ready, it's time to run tidy to convert your html to nearly xml  doc.

Fire up your cmd and type:
tidy -config tidyConfig.txt MyWebPage.html


Now your html oops, YOUR XML is ready . :)



No comments:

Post a Comment