scrapy crawl bestcbooks.com

Goals and design

Goals

I found a website: http://bestcbooks.com/ offers plenty PDF books about computer science.
The Web site looks like below:
bestcbooks.com
I want to get all the books download link and password(As it stored on pan.baidu.com)

Design

  1. As I want to get all the books, so I need to get books by catagery, so I need to crawl all the catagory links;
  2. Get the book link from the catagory link;
  3. Get the book store link(on pan.baidu.com) and it’s password

How to do

Create a scrapy project

  1. Install scrapy:

    1
    $ sudo -H pip install scrapy
  2. Create the scrapy project:

    1
    2
    3
    4
    5
    6
    7
    8
    $ scrapy startproject tutorial
    New Scrapy project 'tutorial', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in:
    /tmp/tutorial
    You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com

Crawl the catagory page

  1. Create a spider named bestcbooks.py, the full path is:tutorial/spiders/bestcbooks.py
    The project structure like below:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    $ cd /tmp/tutorial ; tree
    .
    ├── scrapy.cfg
    └── tutorial
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
    ├── bestcbooks.py
    └── __init__.py
    2 directories, 7 files
  2. edit the bestcbooks.py content as below:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    #!/usr/bin/env python
    # encoding: utf-8
    from urlparse import urljoin
    from scrapy import Spider, Request
    from tutorial.items import CatagoryItems, BookPageItems, BookItems
    class DmozSpider(Spider):
    name = 'bestcbooks'
    allowed_domains = ['bestcbooks.com']
    start_urls = [
    'http://bestcbooks.com/'
    ]
    meta = {'cookiejar': 1}
    headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'gzip,deflate',
    'Accept-Language': 'en-US,en;q=0.8,zh;q=0.6,zh-CN;q=0.4,zh-TW;q=0.2',
    'Connection': 'keep-alive',
    'Connection-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.90 Safari/537.36'
    }
  3. Edit tutorial/items.py like blow, which is used to store the datas we get

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    # -*- coding: utf-8 -*-
    # Define here the models for your scraped items
    #
    # See documentation in:
    # http://doc.scrapy.org/en/latest/topics/items.html
    from scrapy import Item, Field
    class BestCBookItems(Item):
    # define the fields for your item here like:
    name = Field()
    url = Field()
    class CatagoryItems(Item):
    catagory = Field()
    url = Field()
    class BookPageItems(Item):
    name = Field()
    url = Field()
    class BookItems(Item):
    name = Field()
    link = Field()
    password = Field()
    orig_url = Field()

Crawl the catagory pages

  1. Analysis
    Use Google Chrome inspect the catagory link, get the xpath
    Inspect catagory
    So the xpath for it is: ('//ul[@id="category-list"]/li')

  2. Implement the code
    Then add a parse method(default parse method for scrapy) in the bestcbooks.py

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    def parse(self, response):
    """docstring for parse"""
    filename = response.url.split('/')[-2]
    with open(filename, 'wb') as f:
    f.write(response.body)
    for sel in response.xpath('//ul[@id="category-list"]/li'):
    name = sel.xpath('a/text()').extract()[0]
    url = urljoin(response.url, sel.xpath('a/@href').extract()[0])
    yield Request(url, callback=self.parse_catagory_page)
    item = CatagoryItems()
    item['url'] = url
    item['catagory'] = name

The parse_catagory_page is used for parse catagory page defined below

Crawl the book pages

  1. Analysis
    Inspect Book page
    The book link’s xpath is: (//div[@class="categorywell"]/h4)

  2. Implement the code
    Let’s implement the parse_catagory_page method like below

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    def parse_catagory_page(self, response):
    """Parse catagory page"""
    for sel in response.xpath('//div[@class="categorywell"]/h4'):
    name = sel.xpath('a/text()').extract()[0]
    url = urljoin(response.url, sel.xpath('a/@href').extract()[0])
    yield Request(url, callback=self.parse_book_page)
    item = BookPageItems()
    item['url'] = url
    item['name'] = name

The parse_book_page will implement below

Crawl the book info

  1. Analysis
    Inspect Book Info
    The book detail xpath is: ('//h1[@class="entry-title"]/text()')

  2. Implement the code
    Let’s implement the method parse_book_page:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def parse_book_page(self, response):
"""parse detail book page"""
orig_url = response.url
name = response.xpath('//h1[@class="entry-title"]/text()').extract()
for sel in response.xpath('//blockquote'):
link = sel.xpath('p/a/@href').extract()
try:
password = sel.xpath('p/text()').extract()[-1].split()[-1][-4:]
except:
password = ""
item = BookItems()
item['name'] = name
item['link'] = link
item['password'] = password
item['orig_url'] = orig_url
yield item

How to run

Start scrapy

I use this command to start scrapy, and the result will store in items.json

1
scrapy crawl bestcbooks -o items.json

Check the result

Content in items.json is the result of the book’s download link and the password