python 2.7 - How to write rejax and xpath for the below link? -

here link https://www.google.com/about/careers/search#!t=jo&jid=34154& have extract content under job details.

job details    team or role: software engineering // how write xapth  job type: full-time // how write xapth  last updated: oct 17, 2014 // how write xapth  job location(s):seattle, wa, usa; kirkland, wa, usa //// how write rejax extract city, state , country separately each jobs. need filter usa, canada , uk jobs separately.

here have added html code extract above content:

<div class="detail-content">  <div>  <div class="greytext info" style="display: inline-block;">team or role:</div>  <div class="info-text" style="display: inline-block;">software engineering</div> // how write xpath 1  </div>  <div>  <div class="greytext info" style="display: inline-block;">job type:</div>  <div class="info-text" style="display: inline-block;" itemprop="employmenttype">full-time</div>// how write xpath job type 1  </div>  <div style="display: none;" aria-hidden="true">  <div class="greytext info" style="display: inline-block;">job level:</div>  <div class="info-text" style="display: inline-block;"></div>  </div>  <div style="display: none;" aria-hidden="true">  <div class="greytext info" style="display: inline-block;">salary:</div>  <div class="info-text" style="display: inline-block;"></div>  </div>  <div>  <div class="greytext info" style="display: inline-block;">last updated:</div>  <div class="info-text" style="display: inline-block;" itemprop="dateposted"> oct 17, 2014</div> // how write xpath posted date 1  </div>  <div>  <div class="greytext info" style="display: inline-block;">job location(s):</div>  <div class="info-text" style="display: inline-block;">seattle, wa, usa; kirkland, wa, usa</div> // how write rejax extract city, state , country seprately  </div>  </div>  </div>

here spider code:

def parse_listing_page(self,response):          selector = selector(response)          item=googlespideritem()          item['companyname'] = "google"	          item ['jobdetailurl'] = response.url          item['title'] = selector.xpath("//a[@class='heading detail-title']/span[@itemprop='name title']/text()").extract()          item['city'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract().re('(.)\,.')          item['state'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract().re('\,(.)')          item['jobtype'] = selector.xpath(".//*[@id='75015001']/div[2]/div[7]/div[2]/div[5]/div[2]/text()").extract()          description = selector.xpath("string(//div[@itemprop='description'])").extract()  	item['description'] = [d.encode('utf-8') d in description]  	print "done!"          yield item

output is:

	traceback (most recent call last):  	  file "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 824, in rununtilcurrent  	    call.func(*call.args, **call.kw)  	  file "/usr/lib64/python2.7/site-packages/twisted/internet/task.py", line 638, in _tick  	    taskobj._oneworkunit()  	  file "/usr/lib64/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneworkunit  	    result = next(self._iterator)  	  file "/usr/lib64/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>  	    work = (callable(elem, *args, **named) elem in iterable)  	--- <exception caught here> ---  	  file "/usr/lib64/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback  	    yield next(it)  	  file "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output  	    x in result:  	  file "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>  	    return (_set_referer(r) r in result or ())  	  file "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>  	    return (r r in result or () if _filter(r))  	  file "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>  	    return (r r in result or () if _filter(r))  	  file "/home/sureshp/downloads/wwwgooglecom/wwwgooglecom/spiders/googlepage.py", line 49, in parse_listing_page  	       >  **item['city'] = selector.xpath("//a[@class='source  > sr-filter']/span[@itemprop='name']/text()").extract().re('(.*)\,.')  >     	exceptions.attributeerror: 'list' object has no attribute 're'**

i have noticed have typo errors in parse code.

i fixed it. output is.

{'city': [u'seattle, wa, usa', u'kirkland, wa, usa'],  'companyname': 'google',  'description': [u"google's software engineers develop next-generation technologies change how millions of users connect, explore, , interact information , 1 another. our ambitions reach far beyond search. our products need handle information @ the scale of web. we're looking ideas every area of computer science, including information retrieval, artificial intelligence, natural language processing, distributed computing, large-scale system design, networking, security, data compression, , user interface design; list goes on , growing every day. software engineer, work on small team , can switch teams , projects our fast-paced business grows , evolves. need our engineers versatile , passionate tackle new problems continue push technology forward.?\nwith technical expertise manage individual projects priorities, deadlines , deliverables. design, develop, test, deploy, maintain, , enhance software solutions.\n\nseattle/kirkland engineering teams involved in development of several of google?s popular products: cloud platform, hangouts/google+, maps/geo, advertising, chrome os/browser, android, machine intelligence. our engineers need versatile , willing tackle new problems continue push technology forward."],  'jobdetailurl': 'https://www.google.com/about/careers/search?_escaped_fragment_=t%3djo%26jid%3d34154%26',  'jobtype': [],  'state': [u'seattle, wa, usa', u'kirkland, wa, usa'],  'title': [u'software engineer']}

here modified code:

from scrapy.spider import spider scrapy.selector import selector google.items import googleitem import re class dmozspider(spider):     name = "google"     allowed_domains = ["google.com"]     start_urls = [     "https://www.google.com/about/careers/search#!t=jo&jid=34154&",     ]      def parse(self, response):         selector = selector(response)         item=googleitem()         item['description'] = selector.xpath("string(//div[@itemprop='description'])").extract()         item['companyname'] = "google"           item ['jobdetailurl'] = response.url         item['title'] = selector.xpath("//a[@class='heading detail-title']/span[@itemprop='name title']/text()").extract()         item['city'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract()         item['state'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract()         item['jobtype'] = selector.xpath(".//*[@id='75015001']/div[2]/div[7]/div[2]/div[5]/div[2]/text()").extract()          yield item

for separate city, state , nation can use cycle on selector:

for p in selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract():     city,state,nation= p.split(',')     item['city'] =  city     item['state'] =  state     item['nation'] =  nation

Search This Blog

Add

python 2.7 - How to write rejax and xpath for the below link? -

Comments

Post a Comment

Popular posts from this blog

c++ - QTextObjectInterface with Qml TextEdit (QQuickTextEdit) -

xcode - Swift Playground - Files are not readable -

jboss7.x - JBoss AS 7.3 vs 7.4 and differences -