python - calculating coverage of a scrapy webspider -


i writing web spiders scrap products form websites using scrapy framework in python. wondering what's best practices calculate coverage , missing items of written spiders.

what i'm using right logging cases that's unable parse or raises exceptions. example that: when expect specific format price of product or address of place , find written regular expressions doesn't match scrapped strings. or when xpath selectors specific data returns nothing.

sometimes when products listed in 1 page or multiple ones use curl , grep calculate number of products. wondering if there's better practices handle this.

the common approach is, yes, use logging log error , exit callback returning nothing.

example (product price required):

loader = productloader(productitem(), response=response) loader.add_xpath('price', '//span[@class="price"]/text()') if not loader.get_output_value('price'):     log.msg("error fetching product price", level=log.error)     return 

you can use signals catch , log kind of exceptions happened while crawling, see:

this follows easier ask forgiveness permission principle when let spider fail , catch , process error in single, 1 particular place - signal handler.


other thoughts:

  • you can place response urls , error tracebacks database following review - still "logging", in structured manner can more convenient go through later
  • a idea might create custom exceptions represent different crawling errors, instance: missingrequiredfielderror, invalidfieldformaterror can raise in case crawled fields haven't passed validation.

Comments

Popular posts from this blog

c++ - QTextObjectInterface with Qml TextEdit (QQuickTextEdit) -

javascript - angular ng-required radio button not toggling required off in firefox 33, OK in chrome -

xcode - Swift Playground - Files are not readable -