ruby - How do I parse this HTML code with Nokogiri? -


i have following html:

<h3><strong>adresse:</strong></h3>     <p> hochschule darmstadt<br> technologietransfercentrum<br> d19, raum 221, 222<br> schöfferstraße 10<br> <b>64295 darmstadt</b><p> <h3>kommunikationsdaten: </h3>  <p> 

but <p> , <br> tags not closed.

how extract address information:

hochschule darmstadt technologietransfercentrum d19, raum 221, 222 schöfferstraße 10 64295 darmstadt 

starting basis:

# encoding: utf-8 require 'nokogiri'  doc = nokogiri::html(<<eot) <h3><strong>adresse:</strong></h3>     <p> hochschule darmstadt<br> technologietransfercentrum<br> d19, raum 221, 222<br> schöfferstraße 10<br> <b>64295 darmstadt</b><p> <h3>kommunikationsdaten: </h3>  <p> eot  puts doc.errors puts doc.to_html 

i when run code:

<!doctype html public "-//w3c//dtd html 4.0 transitional//en" "http://www.w3.org/tr/rec-html40/loose.dtd"> <html><body> <h3><strong>adresse:</strong></h3>     <p> hochschule darmstadt<br> technologietransfercentrum<br> d19, raum 221, 222<br> schöfferstraße 10<br><b>64295 darmstadt</b></p> <p> </p> <h3>kommunikationsdaten: </h3> <p></p> </body></html> 

notice nokogiri has added <html> , <body> tags. also, has closed <p> tags, adding </p>. can tell parse html fragment, , not add headers using instead:

nokogiri::html::documentfragment.parse 

which generates:

<h3><strong>adresse:</strong></h3>     <p> hochschule darmstadt<br> technologietransfercentrum<br> d19, raum 221, 222<br> schöfferstraße 10<br><b>64295 darmstadt</b></p><p> </p><h3>kommunikationsdaten: </h3> <p></p> 

there's still fixup on html happening, it's basic html passed in. either way, resulting html technically correct.

on finding text in question: if there 1 <p> tag, or it's first one:

doc.at('p').text => "\nhochschule darmstadt\ntechnologietransfercentrum\nd19, raum 221, 222\nschöfferstraße 1064295 darmstadt" 

or:

doc.at('h3').next_sibling.next_sibling.text => "\nhochschule darmstadt\ntechnologietransfercentrum\nd19, raum 221, 222\nschöfferstraße 1064295 darmstadt" 

two next_sibling methods needed. first finds text node following end of <h3> node:

doc.at('h3').next_sibling => #<nokogiri::xml::text:0x3fef59dedfb8 "\n    "> 

Comments

Popular posts from this blog

php - Why I am getting the Error "Commands out of sync; you can't run this command now" -

linux - Does gcc have any options to add version info in ELF binary file? -

java - Are there any classes that implement javax.persistence.Parameter<T>? -