Source Recommendation System takes an article from the user as input and outputs any relevant article from 8.5 million articles in the dataset to the user. It uses Apache Spark to handle this huge load of articles.
This project uses rake-nltk library to extract keywords.
pip install rake-nltk
FakeNewsCorpus was used as dataset (27 GB) for news articles. Apache Spark has been used to handle this huge dataset. It needs to be correctly installed and configured. The configuration file for Spark can be found at spark-2.4.4-bin-hadoop2.7 folder. Hadoop was used as underlying distributed file system. The configuration for Hadoop can be found at hadoop-conf folder. Both of them needs to changed according to your configuration.
The source code can be found at /src folder.
- The whole dataset was partitioned into smaller files. The code to partition dataset can be found at PartitionFakeNewsCorpus.py file.
- The code to extract keywords from partitioned dataset can be found at ExtractKeywordsFromFakeCorpus.py file.
- The main code to input article and to output relevant articles can be found at FindSimillarDocs.py file.
This idea was implement as project for course work of Distributed System course in Colorado State Univeristy. Detailed description of the algorithm can be found here -