ruby - How do I parse this HTML code with Nokogiri? -
i have following html:
<h3><strong>adresse:</strong></h3> <p> hochschule darmstadt<br> technologietransfercentrum<br> d19, raum 221, 222<br> schöfferstraße 10<br> <b>64295 darmstadt</b><p> <h3>kommunikationsdaten: </h3> <p> but <p> , <br> tags not closed.
how extract address information:
hochschule darmstadt technologietransfercentrum d19, raum 221, 222 schöfferstraße 10 64295 darmstadt
starting basis:
# encoding: utf-8 require 'nokogiri' doc = nokogiri::html(<<eot) <h3><strong>adresse:</strong></h3> <p> hochschule darmstadt<br> technologietransfercentrum<br> d19, raum 221, 222<br> schöfferstraße 10<br> <b>64295 darmstadt</b><p> <h3>kommunikationsdaten: </h3> <p> eot puts doc.errors puts doc.to_html i when run code:
<!doctype html public "-//w3c//dtd html 4.0 transitional//en" "http://www.w3.org/tr/rec-html40/loose.dtd"> <html><body> <h3><strong>adresse:</strong></h3> <p> hochschule darmstadt<br> technologietransfercentrum<br> d19, raum 221, 222<br> schöfferstraße 10<br><b>64295 darmstadt</b></p> <p> </p> <h3>kommunikationsdaten: </h3> <p></p> </body></html> notice nokogiri has added <html> , <body> tags. also, has closed <p> tags, adding </p>. can tell parse html fragment, , not add headers using instead:
nokogiri::html::documentfragment.parse which generates:
<h3><strong>adresse:</strong></h3> <p> hochschule darmstadt<br> technologietransfercentrum<br> d19, raum 221, 222<br> schöfferstraße 10<br><b>64295 darmstadt</b></p><p> </p><h3>kommunikationsdaten: </h3> <p></p> there's still fixup on html happening, it's basic html passed in. either way, resulting html technically correct.
on finding text in question: if there 1 <p> tag, or it's first one:
doc.at('p').text => "\nhochschule darmstadt\ntechnologietransfercentrum\nd19, raum 221, 222\nschöfferstraße 1064295 darmstadt" or:
doc.at('h3').next_sibling.next_sibling.text => "\nhochschule darmstadt\ntechnologietransfercentrum\nd19, raum 221, 222\nschöfferstraße 1064295 darmstadt" two next_sibling methods needed. first finds text node following end of <h3> node:
doc.at('h3').next_sibling => #<nokogiri::xml::text:0x3fef59dedfb8 "\n ">
Comments
Post a Comment