python - calculating coverage of a scrapy webspider -
i writing web spiders scrap products form websites using scrapy framework in python. wondering what's best practices calculate coverage , missing items of written spiders.
what i'm using right logging cases that's unable parse or raises exceptions. example that: when expect specific format price of product or address of place , find written regular expressions doesn't match scrapped strings. or when xpath
selectors specific data returns nothing.
sometimes when products listed in 1 page or multiple ones use curl
, grep
calculate number of products. wondering if there's better practices handle this.
the common approach is, yes, use logging
log error , exit callback returning nothing.
example (product price required):
loader = productloader(productitem(), response=response) loader.add_xpath('price', '//span[@class="price"]/text()') if not loader.get_output_value('price'): log.msg("error fetching product price", level=log.error) return
you can use signals catch , log kind of exceptions happened while crawling, see:
this follows easier ask forgiveness permission principle when let spider fail , catch , process error in single, 1 particular place - signal handler.
other thoughts:
- you can place response urls , error tracebacks database following review - still "logging", in structured manner can more convenient go through later
- a idea might create custom exceptions represent different crawling errors, instance:
missingrequiredfielderror
,invalidfieldformaterror
can raise in case crawled fields haven't passed validation.
Comments
Post a Comment