Python爬蟲筆記

import需要用到的套件

  • requests : 發出request,並將網頁內容(html)抓下來
  • BeautifulSoup : 從html擷取出需要的資料
import requests
from bs4 import BeautifulSoup

利用requests抓到網頁的內容,並將其交BeautifulSoup

r = requests.get('https://alisa1114.github.io/about/')
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.prettify()) #輸出排版後的html內容
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="https://alisa1114.github.io/feed.xml" rel="alternate" title="Alisa Chen's Website" type="application/atom+xml"/>
  <!-- Begin Jekyll SEO tag v2.7.1 -->
  <title>
   About | Alisa Chen’s Website
  </title>
  <meta content="Jekyll v3.9.0" name="generator">
   <meta content="About" property="og:title">
    <meta content="en_US" property="og:locale"/>
    <link href="https://alisa1114.github.io/about/" rel="canonical"/>
    <meta content="https://alisa1114.github.io/about/" property="og:url"/>
    <meta content="Alisa Chen’s Website" property="og:site_name"/>
    <meta content="summary" name="twitter:card"/>
    <meta content="About" property="twitter:title"/>
    <script type="application/ld+json">
     {"headline":"About","url":"https://alisa1114.github.io/about/","@type":"WebSite","name":"Alisa Chen’s Website","@context":"https://schema.org"}
    </script>
    <!-- End Jekyll SEO tag -->
    <link crossorigin="anonymous" href="https://unpkg.com/purecss@2.0.5/build/pure-min.css" rel="stylesheet"/>
    <link href="https://unpkg.com/purecss@2.0.5/build/grids-responsive-min.css" rel="stylesheet"/>
    <link href="/assets/css/open-color.css" rel="stylesheet"/>
    <link href="/assets/css/hydure.css" rel="stylesheet"/>
    <script async="" src="https://use.fontawesome.com/releases/v5.0.12/js/all.js">
    </script>
    <!-- start custom head snippets -->
    <!-- insert favicons. use https://realfavicongenerator.net/ -->
    <link href="/_includes/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
    <link href="/_includes/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
    <link href="/_includes/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
    <link href="/_includes/manifest.json" rel="manifest"/>
    <link color="#5bbad5" href="/_includes/safari-pinned-tab.svg" rel="mask-icon"/>
    <link href="/_includes/favicon.ico" rel="shortcut icon"/>
    <meta content="#da532c" name="msapplication-TileColor"/>
    <meta content="/_includes/browserconfig.xml" name="msapplication-config"/>
    <meta content="#ffffff" name="theme-color"/>
    <!-- end custom head snippets -->
   </meta>
  </meta>
 </head>
 <body>
  <div class="pure-g" id="layout">
   <div class="sidebar pure-u-1 pure-u-md-1-4" style="background-image: url(https://cdn.jsdelivr.net/gh/zivong/jekyll-theme-hydure@master/cover.jpg);">
    <header class="header">
     <a class="brand-title" href="/">
      Alisa Chen's Website
     </a>
     <p class="brand-tagline">
      Study in computer science now
     </p>
     <nav class="nav pure-menu">
      <ul class="pure-menu-list">
       <li class="nav-item pure-menu-item">
        <a class="pure-menu-link" href="/">
         Home
        </a>
       </li>
       <li class="nav-item pure-menu-item">
        <a class="pure-menu-link current" href="/about/">
         About
        </a>
       </li>
       <li class="nav-item pure-menu-item">
        <a class="pure-menu-link" href="/archive/">
         Archive
        </a>
       </li>
      </ul>
     </nav>
     <div class="social pure-menu pure-menu-horizontal">
      <ul class="social-list pure-menu-list">
       <li class="social-item pure-menu-item">
        <a class="pure-menu-link pure-button" href="kafuchino0410@gmail.com" target="_blank">
         <i class="fas fa-envelope" title="Email">
         </i>
        </a>
       </li>
       <li class="social-item pure-menu-item">
        <a class="pure-menu-link pure-button" href="https://alisachen1114.medium.com/" target="_blank">
         <i class="fab fa-medium" title="Medium">
         </i>
        </a>
       </li>
       <li class="social-item pure-menu-item">
        <a class="pure-menu-link pure-button" href="https://github.com/Alisa1114" target="_blank">
         <i class="fab fa-github" title="GitHub">
         </i>
        </a>
       </li>
      </ul>
     </div>
    </header>
   </div>
   <div class="content pure-u-1 pure-u-md-3-4">
    <article class="page">
     <h1 class="page-title">
      About
     </h1>
     <p>
      <a href="https://i.imgur.com/y1oVZPe.png">
       <img height="35%" src="https://i.imgur.com/y1oVZPe.png" title="source: imgur.com" width="35%"/>
      </a>
     </p>
     <p>
      <strong>
       Diliogent Newbie in Deep Learning
      </strong>
      <br/>
      <strong>
       Alisa Chen
      </strong>
      <br/>
      Species: Human
      <br/>
      Ability: Mainly capable of program coding 
in 
C/C++, C#, Python, etc.
      <br/>
      A newbie in DL since her independent study 
project of computer sciene.
      <br/>
      Like coding, Touhou Project, and music of 
Japanese ACG.
      <br/>
      My Medium:
      <a href="https://alisachen1114.medium.com">
       Medium
      </a>
     </p>
    </article>
    <footer class="footer pure-g">
     <div class="pure-u-1 pure-u-md-1-2">
      <small>
       ©
       <time datetime="2021-05-30T07:45:09+00:00">
        2021
       </time>
       . All right reserved.
      </small>
     </div>
     <div class="pure-u-1 pure-u-md-1-2">
      <small>
       Powered by
       <a href="https://jekyllrb.com/" target="_blank">
        Jekyll
       </a>
       &amp;
       <a href="https://github.com/zivong/jekyll-theme-hydure" target="_blank">
        Hydure
       </a>
      </small>
     </div>
    </footer>
   </div>
  </div>
 </body>
</html>

如果進入網站需要輸入什麼呢

# 封包以POST傳送
r = requests.Session() # 儲存自己的cookie
payload = { # 準備要送出的封包內容
    "from": "/bbs/Gossiping/index.html",
    "yes": "yes"
}
r1 = r.post("https://www.ptt.cc/ask/over18?from=%2Fbbs%2FGossiping%2Findex.html"
            ,payload) # 送出封包
r2 = r.get("https://www.ptt.cc/bbs/Gossiping/index.html")

利用BeautifulSoup的一些函式能夠搜索裡面的內容

1. find() : 只搜索出第一個符合條件的節點


result = soup.find("title")
print(result)
<title>About | Alisa Chen’s Website</title>

2. find_all(“tag we search”, “attribute name”, limit=number) : 找出所有符合條件的節點

results = soup.find_all("li", class_="nav-item pure-menu-item", limit=2) #其實這是利用css的屬性質
print(results)
[<li class="nav-item pure-menu-item">
<a class="pure-menu-link" href="/">
            Home
          </a>
</li>, <li class="nav-item pure-menu-item">
<a class="pure-menu-link current" href="/about/">
            About
          </a>
</li>]

如果要找多個不同的節點,也可以打包成list來搜索

results = soup.find_all(["h1", "p"], limit=5)
print(results)
[<p class="brand-tagline">Study in computer science now</p>, <h1 class="page-title">About</h1>, <p><a href="https://i.imgur.com/y1oVZPe.png"><img height="35%" src="https://i.imgur.com/y1oVZPe.png" title="source: imgur.com" width="35%"/></a></p>, <p><strong>Diliogent Newbie in Deep Learning</strong><br/>
<strong>Alisa Chen</strong><br/>
Species: Human<br/>
Ability: Mainly capable of program coding 
in 
C/C++, C#, Python, etc.<br/>
A newbie in DL since her independent study 
project of computer sciene.<br/>
Like coding, Touhou Project, and music of 
Japanese ACG.<br/>
My Medium: <a href="https://alisachen1114.medium.com">Medium</a></p>]

3. 如果要搜索某一節點下的子節點,可以使用select(),這會搜索所有符合條件的子節點

用select_one()會只搜索一個

result = soup.find("li",class_="nav-item pure-menu-item")
print(result.select("a"))
[<a class="pure-menu-link" href="/">
            Home
          </a>]

4. 得到屬性質,使用get(“attribute”)

results = soup.find_all("a","pure-menu-link pure-button")
print(results)
[<a class="pure-menu-link pure-button" href="kafuchino0410@gmail.com" target="_blank">
<i class="fas fa-envelope" title="Email"></i>
</a>, <a class="pure-menu-link pure-button" href="https://alisachen1114.medium.com/" target="_blank">
<i class="fab fa-medium" title="Medium"></i>
</a>, <a class="pure-menu-link pure-button" href="https://github.com/Alisa1114" target="_blank">
<i class="fab fa-github" title="GitHub"></i>
</a>]
for result in results:
    print("href : "+result.get("href")+"\ntarget : "+result.get("target"))
href : kafuchino0410@gmail.com
target : _blank
href : https://alisachen1114.medium.com/
target : _blank
href : https://github.com/Alisa1114
target : _blank

5. 取得連結文字,使用getText()

results = soup.find_all("p")
print(results)
[<p class="brand-tagline">Study in computer science now</p>, <p><a href="https://i.imgur.com/y1oVZPe.png"><img height="35%" src="https://i.imgur.com/y1oVZPe.png" title="source: imgur.com" width="35%"/></a></p>, <p><strong>Diliogent Newbie in Deep Learning</strong><br/>
<strong>Alisa Chen</strong><br/>
Species: Human<br/>
Ability: Mainly capable of program coding 
in 
C/C++, C#, Python, etc.<br/>
A newbie in DL since her independent study 
project of computer sciene.<br/>
Like coding, Touhou Project, and music of 
Japanese ACG.<br/>
My Medium: <a href="https://alisachen1114.medium.com">Medium</a></p>]
for result in results:
    print(result.getText())
Study in computer science now

Diliogent Newbie in Deep Learning
Alisa Chen
Species: Human
Ability: Mainly capable of program coding 
in 
C/C++, C#, Python, etc.
A newbie in DL since her independent study 
project of computer sciene.
Like coding, Touhou Project, and music of 
Japanese ACG.
My Medium: Medium

6. 取得網站裡的圖片

result = soup.find("img") #圖片的連結位在img這個節點
print(result)
<img height="35%" src="https://i.imgur.com/y1oVZPe.png" title="source: imgur.com" width="35%"/>
img_url = result.get("src") #抓出連結
print(img_url)

img = requests.get(img_url) #得到圖片
with open('images.jpg', 'wb') as file:
    file.write(img.content) #將圖片寫入.jpg檔案
    file.close()
https://i.imgur.com/y1oVZPe.png