Skip to content

A RESTful API which fetches data from multiple web addresses (URLs) by parsing specific elements from their HTML structure. Implemented concurrency to speed up the process.

Notifications You must be signed in to change notification settings

ceguez/web-scraper

Repository files navigation

Web Scraper: Java, Spring Framework & Jsoup

A RESTful API which fetches data from multiple web addresses (URLs) by parsing specific elements from their HTML structure. Implemented concurrency to speed up the process. Used Swagger for documentation of API.

Features

Completed

It scrapes data about movies from Wikipedia. Currently, it targets the URLs within this URL: List Of American Films

Note: This tool follows the proper etiquette and it does not violate the Terms of Use of Wikipedia. Wikipedia's Licence allows for free and legal usage of the site's public data.

  • It fetches the following data for each movie by year:

    • Title
    • Director
    • Genre
  • It returns data as a 'Set of Strings' in JSON.

In Progress

  • Integrate a PostgreSQL database.
  • In addition, fetch for each movie: cast, year, country and notes.
  • Publish public API.
  • Define requirements for a front-end.
    • Research and evaluate best technology options to develop a dashboard.

Tech stack

Back-End

  • REST API
  • Java 8
    • Stream API: used class Collectors for fast/simpler manipulation of the data (map-reduce paradigm).
      • Functional programs are easier to parallilize.
      • MapReduce allows to do a lot of processing in parallel, which makes processing large amounts of data more scalable.
    • Concurrency: used CompletableFuture class which implements Future & CompletionStage interfaces.
  • Spring Framework
  • Jsoup (Java Library) for fetching URLs, extracting, and manipulating data.

Run Program

  • Run from within the PJ folder in CL: mvn spring-boot:run

Note: You must have installed Java 8 & JDK plus Maven.

Testing & Documentation of API

Completed

In Progress

  • Implement Monte Carlo testing for concurrency.

Demo Pictures

Demo 1

Demo1

Demo 2

Demo2

Demo 3

Demo3

Demo 4

Demo4

Demo 5

Demo5

Demo 6

Demo6

About

A RESTful API which fetches data from multiple web addresses (URLs) by parsing specific elements from their HTML structure. Implemented concurrency to speed up the process.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages