How can I extract XML text using python BeautifulSoup? -


i'm trying extract dialog folger library shakespeare tei xml editions. typical chunk of dialog looks this:

<sp xml:id="sp-0024" who="#horatio"> <speaker xml:id="spk-0024"> <w xml:id="w0003030">horatio</w> </speaker> <ab xml:id="ab-0024"> <join type="line" xml:id="ftln-0024" n="1.1.24" ana="#short" target="#w0003040 #c0003050 #w0003060 #c0003070 #w0003080 #c0003090 #w0003100 #p0003110"/> <w xml:id="w0003040" n="1.1.24">a</w> <c xml:id="c0003050" n="1.1.24"> </c> <w xml:id="w0003060" n="1.1.24">piece</w> <c xml:id="c0003070" n="1.1.24"> </c> <w xml:id="w0003080" n="1.1.24">of</w> <c xml:id="c0003090" n="1.1.24"> </c> <w xml:id="w0003100" n="1.1.24">him</w> <pc xml:id="p0003110" n="1.1.24">.</pc> </ab> </sp> 

i want output looks this: ['horatio','a piece of him.'] dialog of particular character. in other words, want able input folger shakespeare tei xml file , output files gertrude.txt , horatio.txt each containing collected dialog particular character.

i can dialog/stage direction/etc of particular speaker soup.find_all(who=u'#gertrude') can't seem else results, drill down further, text between tags, etc, without re-parsing data on again. here's happens:

>>> gertrude=soup.find_all(who=u'#gertrude') >>> gertrude.w traceback (most recent call last):   file "<stdin>", line 1, in <module> attributeerror: 'resultset' object has no attribute 'w' >>> gertrude.get_text() traceback (most recent call last):   file "<stdin>", line 1, in <module> attributeerror: 'resultset' object has no attribute 'get_text' 

beautifulsoup's .find_all() method returns resultset object, specialized kind of list. have 0 or more matches, , need either loop on result set or use indexing @ individual elements contained in result set:

for speaker in soup.find_all(who=u'#gertrude'): 

Comments

Popular posts from this blog

linux - Does gcc have any options to add version info in ELF binary file? -

android - send complex objects as post php java -

charts - What graph/dashboard product is facebook using in Dashboard: PUE & WUE -