有一条长长的石头,外形像一条龙,石头上长满斑斑点点的东西,像龙身上的鳞。人生就是一场旅行,不在乎目的地,在乎的应该是沿途的风景以及看风景的心情。春色满园关不住,一枝红杏出墙来。
创建爬虫项目douban
scrapy startproject douban
设置items.py文件,存储要保存的数据类型和字段名称
# -*- coding: utf-8 -*- import scrapy class DoubanItem(scrapy.Item): title = scrapy.Field() # 内容 content = scrapy.Field() # 评分 rating_num = scrapy.Field() # 简介 quote = scrapy.Field()
设置爬虫文件doubanmovies.py
# -*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem
class DoubanmoviesSpider(scrapy.Spider):
name = 'doubanmovies'
allowed_domains = ['movie.douban.com']
offset = 0
url = 'https://movie.douban.com/top250?start='
start_urls = [url + str(offset)]
def parse(self, response):
# print('*'*60)
# print(response.url)
# print('*'*60)
item = DoubanItem()
info = response.xpath("//div[@class='info']")
for each in info:
item['title'] = each.xpath(".//span[@class='title'][1]/text()").extract()
item['content'] = each.xpath(".//div[@class='bd']/p[1]/text()").extract()
item['rating_num'] = each.xpath(".//span[@class='rating_num']/text()").extract()
item['quote'] = each .xpath(".//span[@class='inq']/text()").extract()
yield item
# print(item)
self.offset += 25
if self.offset <= 250:
yield scrapy.Request(self.url + str(self.offset),callback=self.parse)
设置管道文件,使用mongodb数据库来保存爬取的数据。重点部分
# -*- coding: utf-8 -*- from scrapy.conf import settings import pymongo class DoubanPipeline(object): def __init__(self): self.host = settings['MONGODB_HOST'] self.port = settings['MONGODB_PORT'] def process_item(self, item, spider): # 创建mongodb客户端连接对象,该例从settings.py文件里面获取mongodb所在的主机和端口参数,可直接书写主机和端口 self.client = pymongo.MongoClient(self.host,self.port) # 创建数据库douban self.mydb = self.client['douban'] # 在数据库douban里面创建表doubanmovies # 把类似字典的数据转换为phthon字典格式 content = dict(item) # 把数据添加到表里面 self.mysheetname.insert(content) return item
设置settings.py文件
# -*- coding: utf-8 -*-
BOT_NAME = 'douban'
SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'
USER_AGENT = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;'
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline': 300,
}
# mongodb数据库设置变量
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
终端测试
scrapy crawl douban
这博客园的代码片段缩进,难道要用4个空格才可以搞定?我发现只能使用4个空格才能解决如上图的代码块的缩进
到此这篇关于Python使用mongodb保存爬取豆瓣电影的数据过程解析就介绍到这了。知道你想要什么,一半在于知道得到它之前必须放弃什么。更多相关Python使用mongodb保存爬取豆瓣电影的数据过程解析内容请查看相关栏目,小编编辑不易,再次感谢大家的支持!




