python - Find all links within a div using lxml -


i'm writing tool needs collect urls within div on web page no urls outside div. simplified page looks this:

<div id="bar">    <a link dont want>    <div id="foo">       <lots of html>       <h1 class="baz">          <a href=”link want”>       </h1>       <h1 class="caz">          <a href=“link want”>       </h1>    </div> </div> 

when selecting div firebug , selecting xpath get: //*[@id="foo"]. far good. i'm stuck @ trying find urls inside div foo. please me find way extract url defined href in elements.

example code similar i'm working on using w3schools:

import mechanize import lxml.html import cookielib  br = mechanize.browser() cj = cookielib.lwpcookiejar() br.set_cookiejar(cj)  br.set_handle_equiv(true) br.set_handle_gzip(true) br.set_handle_redirect(true) br.set_handle_referer(true) br.set_handle_robots(false)  br.set_handle_refresh(mechanize._http.httprefreshprocessor(), max_time=1) br.addheaders = [('user-agent', 'watcherbot')]  r = br.open('http://w3schools.com/') html = br.response().read() root = lxml.html.fromstring(html)  hrefs = root.xpath('//*[@id="leftcolumn"]')  # found no solution yet. stuck 

thank time!

you want this:

hrefs = root.xpath('//div[@id="foo"]//a/@href') 

this give list of href values a tags inside <div id="foo"> @ level


Comments

Popular posts from this blog

c++ - Function signature as a function template parameter -

algorithm - What are some ways to combine a number of (potentially incompatible) sorted sub-sets of a total set into a (partial) ordering of the total set? -

How to call a javascript function after the page loads with a chrome extension? -