python - Find all links within a div using lxml -
i'm writing tool needs collect urls within div on web page no urls outside div. simplified page looks this:
<div id="bar"> <a link dont want> <div id="foo"> <lots of html> <h1 class="baz"> <a href=”link want”> </h1> <h1 class="caz"> <a href=“link want”> </h1> </div> </div>
when selecting div firebug , selecting xpath get: //*[@id="foo"]. far good. i'm stuck @ trying find urls inside div foo. please me find way extract url defined href in elements.
example code similar i'm working on using w3schools:
import mechanize import lxml.html import cookielib br = mechanize.browser() cj = cookielib.lwpcookiejar() br.set_cookiejar(cj) br.set_handle_equiv(true) br.set_handle_gzip(true) br.set_handle_redirect(true) br.set_handle_referer(true) br.set_handle_robots(false) br.set_handle_refresh(mechanize._http.httprefreshprocessor(), max_time=1) br.addheaders = [('user-agent', 'watcherbot')] r = br.open('http://w3schools.com/') html = br.response().read() root = lxml.html.fromstring(html) hrefs = root.xpath('//*[@id="leftcolumn"]') # found no solution yet. stuck
thank time!
you want this:
hrefs = root.xpath('//div[@id="foo"]//a/@href')
this give list of href
values a
tags inside <div id="foo">
@ level
Comments
Post a Comment