URL Text Extraction and Analysis

Overview

This project extracts text from a list of URLs and performs a detailed textual analysis to compute various linguistic and sentiment metrics. The extracted data is saved to text files, and the analysis results are compiled into an Excel file.

Folder Structure:

url-text-extraction-analysis/
├── extracted_texts/         # Directory to store extracted article texts
├── Input.xlsx               # Input file with URLs
├── Output Data Structure.xlsx # Output file with analysis results
├── Article_extrction.py   # Main script for extraction and analysis
├── requirements.txt         # List of required Python libraries
└── README.md                # This README file

Features

Text Extraction: Extracts article text from a list of provided URLs.
Text Analysis: Computes various metrics such as sentiment scores, readability indices, and word counts.
Excel Output: Saves the analysis results to an Excel file for easy interpretation.

Installation

Clone the repository:

   git clone https://github.com/Blacksujit/Data-Extraction-and-NLP.git

run script:

 Article_extrction.py

Install the required Python libraries:

pip install -r requirements.txt

Download the necessary NLTK data:


import nltk
nltk.download('punkt')
nltk.download('cmudict')

Results:

The analysis results are saved in Output Data Structure.xlsx, containing the following metrics for each URL:

1.) Positive Score

2.) Negative Score

3.) Polarity Score

4.) Subjectivity Score

5.) Average Sentence Length

6.) Percentage of Complex Words

7.) Fog Index

8.) Average Words per Sentence

9.) Complex Word Count

10.) Word Count

11.) Syllables per Word

12.) Personal Pronouns Count

13.) Average Word Length

Contributing:

Feel free to contribute to this project by submitting a pull request or opening an issue.

License:

This project is licensed under the MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Clean_text_files_data		Clean_text_files_data
MasterDictionary-20240731T184445Z-001/MasterDictionary		MasterDictionary-20240731T184445Z-001/MasterDictionary
StopWords-20240731T184445Z-001/StopWords		StopWords-20240731T184445Z-001/StopWords
Article_extrction.py		Article_extrction.py
Input (1).xlsx		Input (1).xlsx
Output Data Structure (1).xlsx		Output Data Structure (1).xlsx
Output Data Structure.xlsx		Output Data Structure.xlsx
README.md		README.md
Text Analysis (1).docx		Text Analysis (1).docx
image-1.png		image-1.png
image-2.png		image-2.png
image.png		image.png
instructuions to run files.txt		instructuions to run files.txt
process.gif (4).crdownload		process.gif (4).crdownload
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

URL Text Extraction and Analysis

Overview

Folder Structure:

Features

Installation

Install the required Python libraries:

Download the necessary NLTK data:

Results:

The analysis results are saved in Output Data Structure.xlsx, containing the following metrics for each URL:

Contributing:

License:

About

Releases

Packages

Languages

Blacksujit/Data-Extraction-and-NLP

Folders and files

Latest commit

History

Repository files navigation

URL Text Extraction and Analysis

Overview

Folder Structure:

Features

Installation

Install the required Python libraries:

Download the necessary NLTK data:

Results:

The analysis results are saved in Output Data Structure.xlsx, containing the following metrics for each URL:

Contributing:

License:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages