新网创想网站建设,新征程启航

为企业提供网站建设、域名注册、服务器等服务

Python常用爬虫代码总结方便查询-创新互联

beautifulsoup解析页面

创新互联公司主营南岔网站建设的网络公司,主营网站建设方案,重庆APP软件开发,南岔h5微信小程序搭建,南岔网站营销推广欢迎南岔等地区企业咨询
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmltxt, "lxml")
# 三种装载器
soup = BeautifulSoup("

", "html.parser") ### 只有起始标签的会自动补全,只有结束标签的会自动忽略 ### 结果为:
soup = BeautifulSoup("

", "lxml") ### 结果为:
soup = BeautifulSoup("

", "html5lib") ### html5lib则出现一般的标签都会自动补全 ### 结果为:

# 根据标签名、id、class、属性等查找标签 ### 根据class、id、以及属性alog-action的值和标签类别查询 soup.find("a",class_="title",id="t1",attrs={"alog-action": "qb-ask-uname"})) ### 查询标签内某属性的值 pubtime = soup.find("meta",attrs={"itemprop":"datePublished"}).attrs['content'] ### 获取所有class为title的标签 for i in soup.find_all(class_="title"): print(i.get_text()) ### 获取特定数量的class为title的标签 for i in soup.find_all(class_="title",limit = 2): print(i.get_text()) ### 获取文本内容时可以指定不同标签之间的分隔符,也可以选择是否去掉前后的空白。 soup = BeautifulSoup('

The Dormouses story

The Dormouses story

', "html5lib") soup.find(class_="title").get_text("|", strip=True) #结果为:The Dormouses story|The Dormouses story ### 获取class为title的p标签的id soup.find(class_="title").get("id") ### 对class名称正则: soup.find_all(class_=re.compile("tit")) ### recursive参数,recursive=False时,只find当前标签的第一级子标签的数据 soup = BeautifulSoup('abc','lxml') soup.html.find_all("title", recursive=False)</pre> <br> 新闻标题:Python常用爬虫代码总结方便查询-创新互联 <br> 网页URL:<a href="http://wjwzjz.com/article/hoihh.html">http://wjwzjz.com/article/hoihh.html</a> </div> </div> <div class="othernews"> <h3>其他资讯</h3> <div class="othernews_list"> <ul> <li> <a href="/article/epjgcd.html">广告设置软件,广告设计一般用哪些软件</a> </li><li> <a href="/article/epjgse.html">dedecms怎么打开</a> </li><li> <a href="/article/epjgej.html">网页制作css是什么意思,网页设计css解释</a> </li><li> <a href="/article/epjggs.html">请人做网站要注意什么</a> </li><li> <a href="/article/epjgip.html">国外免费域名邮箱,如何区分免费企业邮局和域名邮箱服务?</a> </li> </ul> </div> </div> </div> </div> <div class="footer"> <div class="footer_content"> <div class="footer_content_top clear"> <div class="content_top_share fl"> <div><img src="/Public/Home/img/logo.png"></div> <div class="top_share_content"> <dd>分享至:</dd> <dt class="bdsharebuttonbox clear" id="share"> <a href="#" class="bds_tsina iconfont fl" data-cmd="tsina" title="分享到新浪微博"></a> <a href="#" class="bds_sqq iconfont fl" data-cmd="sqq" title="分享到QQ好友"></a> <a href="#" class="bds_weixin iconfont fl" data-cmd="weixin" title="分享到微信"></a> <a href="#" class="bds_weixin iconfont fl" data-cmd="tieba" title="分享到贴吧"></a> </dt> <script>window._bd_share_config={"common":{"bdSnsKey":{},"bdText":"","bdMini":"2","bdMiniList":false,"bdPic":"","bdStyle":"0","bdSize":"16"},"share":{}};with(document)0[(getElementsByTagName('head')[0]||body).appendChild(createElement('script')).src='http://bdimg.share.baidu.com/static/api/js/share.js?v=89860593.js?cdnversion='+~(-new Date()/36e5)];</script> </div> </div> <div class="content_top_left fl clear"> <div class="top_left_list fl"> <dd><a href="/about/">关于我们</a></dd> <dt> <a href="/about/#gsjj">公司简介</a> <a href="/about/#fzlc">发展历程</a> </dt> </div> <div class="top_left_list fl"> <dd><a href="/service/">服务项目</a></dd> <dt> <a href="/service/">高端网站建设</a> <a href="/miniprogram/">小程序开发</a> <a href="/service/app.html">APP开发</a> <a href="/service/yingxiao.html">网络营销</a> </dt> </div> <div class="top_left_list fl"> <dd><a href="/jianzhan/">建站知识</a></dd> <dt> <a href="/jianzhan/">行业新闻</a> <a href="/jianzhan/">建站学堂</a> <a href="/jianzhan/">常见问题</a> </dt> </div> <div class="top_left_list fl"> <dd><a href="/contact/">联系我们</a></dd> <dt> <a href="/contact/#lxwm">公司地址</a> <a href="/contact/#rczp">人才招聘</a> </dt> </div> </div> <div class="content_top_right addressR fr"> <div class="top_right_title addressf_title"> <a href="javascript:;" class="on">成都</a> </div> <div class="top_right_content addressf"> <div class="right_content_li on"> <div class="right_content_list clear"> <dd class="fl iconfont"></dd> <dt class="fl">电话:028-86922220</dt> </div> <div class="right_content_list clear"> <dd class="fl iconfont"></dd> <dt class="fl">地址:成都市太升南路288号锦天国际A幢1002号</dt> </div> </div> </div> </div> </div> </div> <div class="footer_content_copyright clear">版权所有:成都新网创想广告设计中心(普通合伙) <a href="http://beian.miit.gov.cn/" rel="nofollow" target="_blank">蜀ICP备11025516号-13</a> </div> </div> <!--浮窗--> <div class="FloatingWindow clear"> <a href="tencent://message/?uin=1683211881&Site=&Menu=yes" class="FloatingWindow_list fr"> <div class="FloatingWindow_list_title"> <dd class="iconfont"></dd> <dt><span>在线</span>咨询</dt> </div> </a> <a href="javascript:;" class="FloatingWindow_list fr"> <div class="FloatingWindow_list_title"> <dd class="iconfont"></dd> <dt>服务热线</dt> </div> <div class="FloatingWindow_list_down fadeInRight animated">服务热线:028-86922220</div> </a> <a href="javascript:;" class="FloatingWindow_list fr STop"> <div class="FloatingWindow_list_title"> <dd class="iconfont"></dd> <dt>TOP</dt> </div> </a> </div> <script src="/Public/Home/js/jquery-1.8.3.min.js"></script> <script src="/Public/Home/js/comm.js"></script> <script src="/Public/Home/js/wow.js"></script> <script src="/Public/Home/js/common.js"></script> </body> </html> <script> $(".cont img").each(function(){ var src = $(this).attr("src"); //获取图片地址 var str=new RegExp("http"); var result=str.test(src); if(result==false){ var url = "https://www.cdcxhl.com"+src; //绝对路径 $(this).attr("src",url); } }); window.onload=function(){ document.oncontextmenu=function(){ return false; } } </script>