Go back to SafeSurf Documentation
The app.py
file serves as the entry point of the application, defining three API endpoints for interacting with the SafeSurf web application.
/
: Homepage of the application where users can input a URL to assess its safety./preview
: Endpoint to view a preview of the website within SafeSurf./source-code
: Endpoint to view the source code of a website.
The file imports the following external libraries:
- Flask: A micro web framework for building web applications.
- requests: A library for making HTTP requests to external resources.
- BeautifulSoup: A library for parsing HTML and XML documents.
- urljoin: A function for joining relative URLs to absolute URLs.
- Handles GET and POST requests.
- Parses the input URL and passes it to the
controller.main
function for assessment. - Renders the
index.html
template with the assessment result.
- Handles POST requests.
- Retrieves the HTML content of the input URL using the requests library.
- Parses the HTML content with BeautifulSoup and injects external resources into the HTML.
- Renders the
preview.html
template with the prettified HTML content.
- Handles GET and POST requests.
- Retrieves the HTML content of the input URL using the requests library.
- Renders the
source_code.html
template with the prettified HTML content.
The controller.py
file serves as the main controller for the trustworthiness assessment of URLs. It orchestrates various checks and calculations to determine the trust score of a given URL. Below is a detailed explanation of how the code functions:
BASE_SCORE
: This variable holds the default trust score for a URL out of 100. It is initialized to 50.
This function is the entry point for URL assessment. It takes a URL as input and performs the following steps:
-
Input Validation:
- The URL is validated and formatted to include the protocol (HTTP or HTTPS) using the
include_protocol()
function from themodel
module.
- The URL is validated and formatted to include the protocol (HTTP or HTTPS) using the
-
Default Data Initialization:
- The domain of the URL is extracted using
tldextract
to determine the domain name. - A default response dictionary is initialized with 'SUCCESS' status and the input URL.
- The trust score is initialized with the base score.
- The domain of the URL is extracted using
-
URL Assessment:
- Phishtank Check: It checks if the URL is listed as a phishing link in the Phishtank database.
- Website Status: It determines the status of the website's response using the
validate_url()
function from themodel
module. - Domain Rank: It retrieves the domain rank using the
get_domain_rank()
function from themodel
module. Authentic websites typically have high traffic volume, indicating popularity and reliability. - Domain Age: It retrieves WHOIS data to determine the domain's age and includes it in the response. The age of a domain is determined using WHOIS data. Websites less than 2 years old may raise suspicion.
- URL Shortening: Check for the use of URL shortening services, which can obscure the original URL and potentially indicate malicious intent.
- HSTS Support: It checks if the website supports HTTP Strict Transport Security (HSTS). Verify if the domain supports HTTPS and HSTS. Legitimate domains often prioritize security with HSTS support.
- IP Presence: It checks for the presence of an IP address associated with the domain. Phishing links may use IP addresses instead of domain names. Authentic domains typically have dedicated domain names.
- URL Redirects: It checks for URL redirects. Redirection to other pages can be a tactic to hide the original phishing link.
- URL Length: It checks if the URL is too long. URLs exceeding 75 characters may indicate phishing attempts, as attackers may try to conceal suspicious parts in the address bar.
- URL Depth: It checks if the URL depth is too deep. Excessive depth in the URL structure (number of '/') raises suspicion, as legitimate websites typically have simpler structures.
-
IP Address and SSL Certificate:
- It retrieves the IP address associated with the domain and includes it in the response.
- It retrieves SSL certificate details using the
get_certificate_details()
function from themodel
module.
-
Trust Score Calculation:
- The trust score is calculated based on the results of the assessments.
-
Response Generation:
- A response JSON containing the assessment results, including the trust score, is returned.
-
Error Handling:
- Any exceptions that occur during the assessment process are caught, and an error response is generated with details of the error.
This function provides a comprehensive assessment of the trustworthiness of the input URL by conducting various checks and calculations, encapsulating the results in a response dictionary.
The model.py
file contains various functions responsible for conducting different checks and calculations to determine the trustworthiness of a URL. Below is a detailed explanation of each function's purpose and functionality:
BASE_SCORE
: This variable holds the default trust score for a URL out of 100. It is initialized to 50.PROPERTY_SCORE_WEIGHTAGE
: This dictionary holds the weightage of different properties in determining the trust score. Score is added according to the weightage given for each parameters, the value of the weightage is found out through trials and referring papers (there is no any standard value for it).
This function checks whether the provided URL is active by making a request to it. It returns the HTTP response status code if the URL is reachable; otherwise, it returns False
.
This function ensures that the URL provided has a protocol prefix (HTTP or HTTPS). If not, it appends 'https://' as the default protocol.
This function retrieves the domain rank from a precompiled list of the top 1 million domains. If the domain is found in the list, its rank is returned; otherwise, 0 is returned.
This function retrieves WHOIS data for the given domain, including creation date, expiration date, and other relevant information.
This function checks whether the provided URL supports HTTP Strict Transport Security (HSTS) by examining the response headers. It returns 1 if HSTS is supported; otherwise, it returns 0.
This function checks whether the provided domain is a URL shortening service by comparing it against a list of known URL shorteners. It returns 1 if the domain is a URL shortener; otherwise, it returns 0.
This function checks whether an IP address is present in the provided URL. It returns 1 if an IP address is present; otherwise, it returns 0.
This function checks whether the provided URL redirects to another URL. If redirection occurs, it returns an array of redirected URLs; otherwise, it returns 0.
This function checks whether the provided URL is too long by comparing its length to a predefined threshold. It returns 1 if the URL is too long; otherwise, it returns 0.
This function checks whether the provided URL has a deep hierarchy by counting the number of slashes in the URL path. If the number exceeds a predefined threshold, it returns 1; otherwise, it returns 0.
This function analyzes the content of the provided URL's webpage for various potentially malicious elements, such as onmouseover events, right-click disabling, presence of forms, iframes, login keywords, and pop-ups. It returns a dictionary indicating the presence of each element.
This function searches the Phishtank database to determine if the provided URL is listed as a phishing link. It returns 1 if the URL is found in the database; otherwise, it returns 0.
This function retrieves the IP address associated with the provided domain.
This function retrieves SSL certificate details for the provided domain, including issuer information, validity period, revocation status, cipher suite, and SSL/TLS version.
This function calculates the trust score based on the provided parameters. It adjusts the current score based on the specific case and value passed to it.
This module provides a comprehensive set of functions for analyzing various aspects of a URL to determine its trustworthiness.
The onetimescript.py
file is a one-time script used to update the sorted-top1million.txt
and domain-rank.json
files with the latest list of top 1 million websites. The purpose of this script is to maintain an updated list of the top 1 million websites, which can be used for the URL assessment.
Below is a detailed explanation of how the script functions:
This function is responsible for updating the two files with the latest data. It performs the following steps:
-
Reading Data from CSV:
- It reads the data from the
top-1m.csv
file, which contains the list of the top 1 million websites along with their ranks.
- It reads the data from the
-
Populating Arrays and Dictionary:
- It populates two data structures:
domain_data_array
: This list contains the domain names extracted from the CSV file.domain_data_dict
: This dictionary maps each domain name to its rank.
- It populates two data structures:
-
Sorting:
- It sorts the
domain_data_array
alphabetically to create a sorted list of domain names.
- It sorts the
-
Clearing Existing Files:
- It clears the contents of the existing
sorted-top1million.txt
anddomain-rank.json
files.
- It clears the contents of the existing
-
Writing Data to Files:
- It writes the sorted domain names to the
sorted-top1million.txt
file. - It writes the domain-rank dictionary to the
domain-rank.json
file in JSON format.
- It writes the sorted domain names to the
- Download Latest Data: Obtain the latest
top-1m.csv
file from Tranco List. This CSV file is updated monthly. - Update Local Files: Copy the downloaded
top-1m.csv
file to the/static/data/
directory. - Execute Script: Run the
onetimescript.py
file. The script will automatically read thetop-1m.csv
file, extract relevant data, and update thesorted-top1million.txt
anddomain-rank.json
files. The execution time of the script varies but typically takes about 10-20 seconds to complete.
The script was last executed with the latest top-1m.csv
file on: 2024-03-04
This script provides a convenient way to update the files containing the list of top 1 million websites, facilitating the assessment of URLs based on their domain ranks.