java - How to unescape HTML entities but leave XML entities untouched? -
this input:
<div>the price < 5 €</div>
it valid html not valid xml (because €
not declared in dtd). valid xml like:
<div>the price < 5 €</div>
can recommend java library can me unescape html entities , convert them xml entities?
the list of html named character references available @ http://www.whatwg.org/specs/web-apps/current-work/multipage/entities.json
if can tolerate occasional mistake, go on file , replace named character references not allowed in stand-alone xml corresponding numeric character reference.
that simple approach can run problems though if input html, not xhtml:
<script>var y=1, lt = 3, x = y< alert(x);</script>
contains script element content not encoded using entities, naively replacing <
here break script. there other elements <xmp>
, <style>
can have similar problems cdata sections in foreign xml elements.
if need faithful conversion, or if html messy, best bet might parse html dom using nu.validator , use how pretty print xml java? convert dom valid xml.
even if input xhtml, might need worry character sequences entities in cdata sections. again, parse , re-render might best option.
Comments
Post a Comment