1. 首页
  2. 未分类

利用python爬取赶集网房源信息

“u003Cdivu003Eu003Cpu003E阅读本文章需要提前学习xpath,python等基本知识。本文章只适合python初学者学习了解python网络爬虫所用,禁止用于商业谋取利益。话不多说,直接上图u003Cu002Fpu003Eu003Cdiv class=”pgc-img”u003Eu003Cimg src=”http:u002Fu002Fp3.pstatp.comu002Flargeu002Fpgc-imageu002F3dcb0f43e9c9403db0911295fbc8f80f” img_width=”2768″ img_height=”1336″ alt=”利用python爬取赶集网房源信息” inline=”0″u003Eu003Cp class=”pgc-img-caption”u003Eu003Cu002Fpu003Eu003Cu002Fdivu003Eu003Cpu003E如上图所示,我们想抓取房源信息的标题、图片、区域、地址、价格和单价。那么直接上代码:u003Cu002Fpu003Eu003Cpu003E这里为模拟浏览器获输入网站的url,回车,然后得到网页源代码。u003Cu002Fpu003Eu003Cpreu003E# 获取url对应的网页源码u003Cbru003Edef getsource(url):u003Cbru003E headers = {u003Cbru003E ‘User-Agent’: ‘User-Agent Mozillau002F5.0 (Windows NT 6.1; WOW64; Tridentu002F7.0; rv:11.0)’}u003Cbru003E sourceHtml = requests.get(url, headers=headers)u003Cbru003E return sourceHtml.textu003Cbru003Eu003Cu002Fpreu003Eu003Cpu003E此处为根据得到的网页源代码,通过xpath解析得到每一项数据,封装成我们需要的数组。u003Cu002Fpu003Eu003Cpreu003E# 开始抓取并分析数据u003Cbru003Edef spiderData(url):u003Cbru003E domtext = getsource(url)u003Cbru003E dom = etree.HTML(domtext)u003Cbru003E div_list = dom.xpath(‘u002Fu002Fdiv[contains(@class, “js-tips-list”)]u002Fdiv[contains(@class, “f-list-item”)]’)u003Cbru003E data = []u003Cbru003E for item in div_list:u003Cbru003E res = {}u003Cbru003E res[‘title’] = item.xpath(‘.u002Fu002Fdd[contains(@class, “title”)]u002Fau002Ftext()’)[0]u003Cbru003E res[‘address’] = item.xpath(‘.u002Fu002Fdd[contains(@class, “address”)]u002Fu002Fa[@class=”address-eara”]u002Ftext()’)[0]u003Cbru003E res[‘address-eara’] = item.xpath(‘.u002Fu002Fdd[contains(@class, “address”)]u002Fu002Fspan[@class=”address-eara”]u002Ftext()’)[0]u003Cbru003E res[‘price’] = item.xpath(‘.u002Fu002Fdd[contains(@class, “info”)]u002Fu002Fspan[@class=”num”]u002Ftext()’)[0]+’万’u003Cbru003E res[‘singlePrice’] = item.xpath(‘.u002Fu002Fdd[contains(@class, “info”)]u002Fu002Fdiv[@class=”time”]u002Ftext()’)[0]u003Cbru003E res[‘images’] = item.xpath(‘.u002Fu002Fdiv[@class=”img-wrap”]u002Fu002Fimgu002F@src’)[0]u003Cbru003E data.append(res)u003Cbru003E return json.dumps(data, encoding=”UTF-8″, ensure_ascii=False )u003Cbru003Eu003Cu002Fpreu003Eu003Cpu003E执行,查看部分json结果(因为太长,所以截取)u003Cu002Fpu003Eu003Cpreu003E[{u003Cbru003E “title”: “9号线 外地人可买无需社保 70年产权 可落户上”,u003Cbru003E “address-eara”: “阳光理想城”,u003Cbru003E “price”: “53万”,u003Cbru003E “singlePrice”: “9636元u002F㎡”,u003Cbru003E “address”: “松江”,u003Cbru003E “images”: “http:u002Fu002Fpic7.58cdn.com.cnu002Fanjuke_58u002F8205d4e70763f8a7bd7217d0e3cd574c?w=480&h=360&crop=1″u003Cbru003E}, {u003Cbru003E “title”: “崇明品质小区 ,外地人可买 ,精装修带地暖 ,首”,u003Cbru003E “address-eara”: “明南佳苑”,u003Cbru003E “price”: “101万”,u003Cbru003E “singlePrice”: “17127元u002F㎡”,u003Cbru003E “address”: “崇明”,u003Cbru003E “images”: “http:u002Fu002Fpic6.58cdn.com.cnu002Fanjuke_58u002F940dfa7b71d5e2a651e4fccb584c6170?w=480&h=360&crop=1″u003Cbru003E}, {u003Cbru003E “title”: “经典小户型,外地可买,不受限购,送6万家具家电!”,u003Cbru003E “address-eara”: “临港17区”,u003Cbru003E “price”: “20万”,u003Cbru003E “singlePrice”: “15384元u002F㎡”,u003Cbru003E “address”: “浦东”,u003Cbru003E “images”: “http:u002Fu002Fpic2.58cdn.com.cnu002Fanjuke_58u002Ff7a9cd75d2d0439bf867d984ce06fcee?w=480&h=360&crop=1″u003Cbru003E}, {u003Cbru003E “title”: “上海临港自贸区,不限贷,中国的第二个香港,精装修”,u003Cbru003E “address-eara”: “临港17区”,u003Cbru003E “price”: “105万”,u003Cbru003E “singlePrice”: “16153元u002F㎡”,u003Cbru003E “address”: “浦东”,u003Cbru003E “images”: “http:u002Fu002Fpic1.58cdn.com.cnu002Fanjuke_58u002F6553ebe8ccb5e062e7e727b2dd48e4e4?w=480&h=360&crop=1″u003Cbru003E}u003Cbru003E···u003Cbru003E]u003Cbru003Eu003Cu002Fpreu003Eu003Cpu003E这样我们就得到的所需要的爬虫数据。u003Cu002Fpu003Eu003Cpu003E欢迎喜欢网络爬虫的同学们批评指正,一起共同学习交流。u003Cu002Fpu003Eu003Cpu003E以下提供全部源码u003Cu002Fpu003Eu003Cpreu003E# -*- coding: UTF-8 -*-u003Cbru003Eimport urllibu003Cbru003Efrom lxml import etreeu003Cbru003Eimport jsonu003Cbru003Eimport requestsu003Cbru003E# 获取url对应的网页源码u003Cbru003Edef getsource(url):u003Cbru003E headers = {u003Cbru003E ‘User-Agent’: ‘User-Agent Mozillau002F5.0 (Windows NT 6.1; WOW64; Tridentu002F7.0; rv:11.0)’}u003Cbru003E sourceHtml = requests.get(url, headers=headers)u003Cbru003E return sourceHtml.textu003Cbru003E# 开始抓取并分析数据u003Cbru003Edef spiderData(url):u003Cbru003E domtext = getsource(url)u003Cbru003E dom = etree.HTML(domtext)u003Cbru003E div_list = dom.xpath(‘u002Fu002Fdiv[contains(@class, “js-tips-list”)]u002Fdiv[contains(@class, “f-list-item”)]’)u003Cbru003E data = []u003Cbru003E for item in div_list:u003Cbru003E res = {}u003Cbru003E res[‘title’] = item.xpath(‘.u002Fu002Fdd[contains(@class, “title”)]u002Fau002Ftext()’)[0]u003Cbru003E res[‘address’] = item.xpath(‘.u002Fu002Fdd[contains(@class, “address”)]u002Fu002Fa[@class=”address-eara”]u002Ftext()’)[0]u003Cbru003E res[‘address-eara’] = item.xpath(‘.u002Fu002Fdd[contains(@class, “address”)]u002Fu002Fspan[@class=”address-eara”]u002Ftext()’)[0]u003Cbru003E res[‘price’] = item.xpath(‘.u002Fu002Fdd[contains(@class, “info”)]u002Fu002Fspan[@class=”num”]u002Ftext()’)[0]+’万’u003Cbru003E res[‘singlePrice’] = item.xpath(‘.u002Fu002Fdd[contains(@class, “info”)]u002Fu002Fdiv[@class=”time”]u002Ftext()’)[0]u003Cbru003E res[‘images’] = item.xpath(‘.u002Fu002Fdiv[@class=”img-wrap”]u002Fu002Fimgu002F@src’)[0]u003Cbru003E data.append(res)u003Cbru003E return json.dumps(data, encoding=”UTF-8″, ensure_ascii=False )u003Cbru003Eresult = spiderData(‘http:u002Fu002Fsh.ganji.comu002Fershoufangu002F’)u003Cbru003Eprint resultu003Cbru003Eu003Cu002Fpreu003Eu003Cpu003E最后,小编想说:我是一名python开发工程师,整理了一套最新的python系统学习教程,想要这些资料的可以关注私信小编“01”即可,希望能对你有所帮助u003Cu002Fpu003Eu003Cu002Fdivu003E”

原文始发于:利用python爬取赶集网房源信息

主题测试文章,只做测试使用。发布者:熱鬧獨處,转转请注明出处:http://www.cxybcw.com/18169.html

联系我们

13687733322

在线咨询:点击这里给我发消息

邮件:1877088071@qq.com

工作时间:周一至周五,9:30-18:30,节假日休息

QR code