How can I extract XML text using python BeautifulSoup? -
i'm trying extract dialog folger library shakespeare tei xml editions. typical chunk of dialog looks this:
<sp xml:id="sp-0024" who="#horatio"> <speaker xml:id="spk-0024"> <w xml:id="w0003030">horatio</w> </speaker> <ab xml:id="ab-0024"> <join type="line" xml:id="ftln-0024" n="1.1.24" ana="#short" target="#w0003040 #c0003050 #w0003060 #c0003070 #w0003080 #c0003090 #w0003100 #p0003110"/> <w xml:id="w0003040" n="1.1.24">a</w> <c xml:id="c0003050" n="1.1.24"> </c> <w xml:id="w0003060" n="1.1.24">piece</w> <c xml:id="c0003070" n="1.1.24"> </c> <w xml:id="w0003080" n="1.1.24">of</w> <c xml:id="c0003090" n="1.1.24"> </c> <w xml:id="w0003100" n="1.1.24">him</w> <pc xml:id="p0003110" n="1.1.24">.</pc> </ab> </sp>
i want output looks this: ['horatio','a piece of him.'] dialog of particular character. in other words, want able input folger shakespeare tei xml file , output files gertrude.txt , horatio.txt each containing collected dialog particular character.
i can dialog/stage direction/etc of particular speaker soup.find_all(who=u'#gertrude')
can't seem else results, drill down further, text between tags, etc, without re-parsing data on again. here's happens:
>>> gertrude=soup.find_all(who=u'#gertrude') >>> gertrude.w traceback (most recent call last): file "<stdin>", line 1, in <module> attributeerror: 'resultset' object has no attribute 'w' >>> gertrude.get_text() traceback (most recent call last): file "<stdin>", line 1, in <module> attributeerror: 'resultset' object has no attribute 'get_text'
beautifulsoup's .find_all()
method returns resultset
object, specialized kind of list. have 0 or more matches, , need either loop on result set or use indexing @ individual elements contained in result set:
for speaker in soup.find_all(who=u'#gertrude'):
Comments
Post a Comment