GitHub - NeroHin/spider_for_ncku_csie: a web-based spider app which is scraping and downloads NCKU CSIE website announcement’s appendix with beautifulsoup4, and selenium, uses multiple threading to accelerate.

原由

實驗室剛好要換新的官網，所以我就想說要不要寫個爬蟲來幫忙下載官網的資料，然後就有了這個 repo。
- 因為數量不少，所以也有參考一些加速的方法，讓爬蟲的速度更快。
cchardet, lxml 都是可以加速 bs4 的套件
- 使用前要先 pip3 install cchardet lxml
webdriver 需要下載 chromedriver
- dowload link: https://chromedriver.chromium.org/downloads

usage

目前有兩個變數可以輸入
- start_page: 起始頁數
- end_page: 結束頁數
- download: 是否要下載文件
  - True: 下載文件
  - False: 只抓取連結
範例
- python3 app.py --start_page=1 --end_page=20 # 抓取 1~20 頁的連結並下載
- python3 app.py --start_page=1 --end_page=20 --download=False # 只抓取 1~20 頁的連結

TODO

requirements

beautifulsoup4
cchardet
lxml
requests
selenium
tqdm

reference:

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
README.md		README.md
app.py		app.py
chromedriver		chromedriver
requirements.txt		requirements.txt

Provide feedback