Skip to content

Scala application that allows web crawling and web scraping of web pages given as input with the use of special rules passed to it through the use of a DSL.

License

Notifications You must be signed in to change notification settings

roostico/scooby

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PPS-22-Scooby 🔍

Team:

👨‍💻 Giovanni Antonioni - giovanni.antonioni2@studio.unibo.it

👨‍💻 Valerio Di Zio - valerio.dizio@studio.unibo.it

👨‍💻 Francesco Magnani - francesco.magnani14@studio.unibo.it

👨‍💻 Luca Rubboli - luca.rubboli2@studio.unibo.it

Technologies:

🔄 Scrum

🛠 SBT

🔗 Git

🎯 YouTrack

🚀 Github Actions

Overview:

PPS-22-Scooby is a web scraping and crawling application. It enables users to extract data from web pages by crawling through links and scraping specific content according to predefined rules.

Features:

🕷 Crawling: The application navigates web pages, follows links, and retrieves content.

🔍 Scraping: Relevant data is extracted from HTML/XML pages using XPath, CSS selectors, or regular expressions.

🛠 Customization: Users can define custom scraping and crawling rules to suit their specific needs.

⚙️ Parallel Processing: Aspects of parallel programming are integrated for efficient execution.

📤 Export: Users can export extracted data in various formats according to their preferences.

Implementation:

PPS-22-Scooby is built using Scala with Actor libraries for concurrency management. The application utilizes Git for version control, YouTrack for project management, and Github Actions for continuous integration.

Get Started:

To use PPS-22-Scooby, have a look at the section Get Started at https://pps-22-scooby.github.io/

About

Scala application that allows web crawling and web scraping of web pages given as input with the use of special rules passed to it through the use of a DSL.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Scala 92.9%
  • Gherkin 5.7%
  • HTML 1.4%