java - How to unescape HTML entities but leave XML entities untouched? -


this input:

<div>the price &lt; 5 &euro;</div> 

it valid html not valid xml (because &euro; not declared in dtd). valid xml like:

<div>the price &lt; 5 &#8364;</div> 

can recommend java library can me unescape html entities , convert them xml entities?

the list of html named character references available @ http://www.whatwg.org/specs/web-apps/current-work/multipage/entities.json

if can tolerate occasional mistake, go on file , replace named character references not allowed in stand-alone xml corresponding numeric character reference.

that simple approach can run problems though if input html, not xhtml:

<script>var y=1, lt = 3, x = y&lt; alert(x);</script> 

contains script element content not encoded using entities, naively replacing &lt; here break script. there other elements <xmp> , <style> can have similar problems cdata sections in foreign xml elements.

if need faithful conversion, or if html messy, best bet might parse html dom using nu.validator , use how pretty print xml java? convert dom valid xml.

even if input xhtml, might need worry character sequences entities in cdata sections. again, parse , re-render might best option.


Comments

Popular posts from this blog

linux - Does gcc have any options to add version info in ELF binary file? -

android - send complex objects as post php java -

charts - What graph/dashboard product is facebook using in Dashboard: PUE & WUE -