Skip to content

a web-based spider app which is scraping and downloads NCKU CSIE website announcement’s appendix with beautifulsoup4, and selenium, uses multiple threading to accelerate.

Notifications You must be signed in to change notification settings

NeroHin/spider_for_ncku_csie

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

原由

  • 實驗室剛好要換新的官網,所以我就想說要不要寫個爬蟲來幫忙下載官網的資料,然後就有了這個 repo。

    • 因為數量不少,所以也有參考一些加速的方法,讓爬蟲的速度更快。
  • cchardet, lxml 都是可以加速 bs4 的套件

    • 使用前要先 pip3 install cchardet lxml
  • webdriver 需要下載 chromedriver


usage

  • 目前有兩個變數可以輸入
    • start_page: 起始頁數
    • end_page: 結束頁數
    • download: 是否要下載文件
      • True: 下載文件
      • False: 只抓取連結
  • 範例
    • python3 app.py --start_page=1 --end_page=20 # 抓取 1~20 頁的連結並下載
    • python3 app.py --start_page=1 --end_page=20 --download=False # 只抓取 1~20 頁的連結

TODO

  • 使用 threading 抓下載文件的連結
  • 使用 threading + selenium 下載文件

requirements

beautifulsoup4
cchardet
lxml
requests
selenium
tqdm

reference:

  1. Python BeautifulSoup 中文亂碼問題
  2. How to download a file using Selenium and Python
  3. [Day23] Beautiful Soup 網頁解析!
  4. Rename downloaded files selenium
  5. https://thehftguy.com/2020/07/28/making-beautifulsoup-parsing-10-times-faster/
  6. Imporve bs4 performance

About

a web-based spider app which is scraping and downloads NCKU CSIE website announcement’s appendix with beautifulsoup4, and selenium, uses multiple threading to accelerate.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages