python 2.7 - How to write rejax and xpath for the below link? -
here link https://www.google.com/about/careers/search#!t=jo&jid=34154& have extract content under job details.
job details team or role: software engineering // how write xapth job type: full-time // how write xapth last updated: oct 17, 2014 // how write xapth job location(s):seattle, wa, usa; kirkland, wa, usa //// how write rejax extract city, state , country separately each jobs. need filter usa, canada , uk jobs separately.
here have added html code extract above content:
<div class="detail-content"> <div> <div class="greytext info" style="display: inline-block;">team or role:</div> <div class="info-text" style="display: inline-block;">software engineering</div> // how write xpath 1 </div> <div> <div class="greytext info" style="display: inline-block;">job type:</div> <div class="info-text" style="display: inline-block;" itemprop="employmenttype">full-time</div>// how write xpath job type 1 </div> <div style="display: none;" aria-hidden="true"> <div class="greytext info" style="display: inline-block;">job level:</div> <div class="info-text" style="display: inline-block;"></div> </div> <div style="display: none;" aria-hidden="true"> <div class="greytext info" style="display: inline-block;">salary:</div> <div class="info-text" style="display: inline-block;"></div> </div> <div> <div class="greytext info" style="display: inline-block;">last updated:</div> <div class="info-text" style="display: inline-block;" itemprop="dateposted"> oct 17, 2014</div> // how write xpath posted date 1 </div> <div> <div class="greytext info" style="display: inline-block;">job location(s):</div> <div class="info-text" style="display: inline-block;">seattle, wa, usa; kirkland, wa, usa</div> // how write rejax extract city, state , country seprately </div> </div> </div>
here spider code:
def parse_listing_page(self,response): selector = selector(response) item=googlespideritem() item['companyname'] = "google" item ['jobdetailurl'] = response.url item['title'] = selector.xpath("//a[@class='heading detail-title']/span[@itemprop='name title']/text()").extract() item['city'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract().re('(.)\,.') item['state'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract().re('\,(.)') item['jobtype'] = selector.xpath(".//*[@id='75015001']/div[2]/div[7]/div[2]/div[5]/div[2]/text()").extract() description = selector.xpath("string(//div[@itemprop='description'])").extract() item['description'] = [d.encode('utf-8') d in description] print "done!" yield item
output is:
traceback (most recent call last): file "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 824, in rununtilcurrent call.func(*call.args, **call.kw) file "/usr/lib64/python2.7/site-packages/twisted/internet/task.py", line 638, in _tick taskobj._oneworkunit() file "/usr/lib64/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneworkunit result = next(self._iterator) file "/usr/lib64/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr> work = (callable(elem, *args, **named) elem in iterable) --- <exception caught here> --- file "/usr/lib64/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback yield next(it) file "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output x in result: file "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr> return (_set_referer(r) r in result or ()) file "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr> return (r r in result or () if _filter(r)) file "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr> return (r r in result or () if _filter(r)) file "/home/sureshp/downloads/wwwgooglecom/wwwgooglecom/spiders/googlepage.py", line 49, in parse_listing_page > **item['city'] = selector.xpath("//a[@class='source > sr-filter']/span[@itemprop='name']/text()").extract().re('(.*)\,.') > exceptions.attributeerror: 'list' object has no attribute 're'**
i have noticed have typo errors in parse code.
i fixed it. output is.
{'city': [u'seattle, wa, usa', u'kirkland, wa, usa'], 'companyname': 'google', 'description': [u"google's software engineers develop next-generation technologies change how millions of users connect, explore, , interact information , 1 another. our ambitions reach far beyond search. our products need handle information @ the scale of web. we're looking ideas every area of computer science, including information retrieval, artificial intelligence, natural language processing, distributed computing, large-scale system design, networking, security, data compression, , user interface design; list goes on , growing every day. software engineer, work on small team , can switch teams , projects our fast-paced business grows , evolves. need our engineers versatile , passionate tackle new problems continue push technology forward.?\nwith technical expertise manage individual projects priorities, deadlines , deliverables. design, develop, test, deploy, maintain, , enhance software solutions.\n\nseattle/kirkland engineering teams involved in development of several of google?s popular products: cloud platform, hangouts/google+, maps/geo, advertising, chrome os/browser, android, machine intelligence. our engineers need versatile , willing tackle new problems continue push technology forward."], 'jobdetailurl': 'https://www.google.com/about/careers/search?_escaped_fragment_=t%3djo%26jid%3d34154%26', 'jobtype': [], 'state': [u'seattle, wa, usa', u'kirkland, wa, usa'], 'title': [u'software engineer']}
here modified code:
from scrapy.spider import spider scrapy.selector import selector google.items import googleitem import re class dmozspider(spider): name = "google" allowed_domains = ["google.com"] start_urls = [ "https://www.google.com/about/careers/search#!t=jo&jid=34154&", ] def parse(self, response): selector = selector(response) item=googleitem() item['description'] = selector.xpath("string(//div[@itemprop='description'])").extract() item['companyname'] = "google" item ['jobdetailurl'] = response.url item['title'] = selector.xpath("//a[@class='heading detail-title']/span[@itemprop='name title']/text()").extract() item['city'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract() item['state'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract() item['jobtype'] = selector.xpath(".//*[@id='75015001']/div[2]/div[7]/div[2]/div[5]/div[2]/text()").extract() yield item
for separate city, state , nation can use cycle on selector:
for p in selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract(): city,state,nation= p.split(',') item['city'] = city item['state'] = state item['nation'] = nation
Comments
Post a Comment