This project scrapes Amazon reviews for Intel processors, pre-processes the data, and applies machine learning techniques for sentiment analysis. We integrated Term Frequency-Inverse Document Frequency (TF-IDF) vectorization with Logistic Regression and utilized Word2Vec embeddings within a Long Short-Term Memory (LSTM) architecture for enhanced semantic understanding. Our workflow includes tokenization, cleaning with the Natural Language Toolkit (NLTK), feature extraction using TF-IDF, and model training with Keras. Key takeaways on processor strengths and weaknesses were compiled for Intel engineers, along with performance analyses between Transformer and LSTM models. Additionally, word clouds and graphs were created to visualize insights, providing valuable data-driven perspectives.
- Developed by: Aviral Srivastava, Garv Bhaskar
- Institution: Vellore Institute of Technology, Chennai
- Go to the Python Notebook notebook.ipynb
- Inside your terminal, run the following command to install all required packages:
pip install scrapy pandas numpy ipykernel tensorflow keras langdetect logging scikit-learn nltk re beautifulsoup4 matplotlib seaborn collections
- Use "Run all cells" command in notebook.
Our solution integrates TF-IDF vectorization with Logistic Regression for baseline performance evaluation, leveraging Word2Vec embeddings initialized within an LSTM architecture for enhanced semantic understanding. We use NLTK for tokenization and cleaning, and TF-IDF for feature extraction. Logistic Regression is fine-tuned via GridSearchCV, and an LSTM model is trained in Keras with pretrained Word2Vec embeddings. Evaluation metrics include accuracy and visualizations like word clouds.
Real-time reviews were collected from various e-commerce websites and social media platforms using web scraping tools like BeautifulSoup and Scrapy.
Focused on cleansing and tokenizing textual reviews, addressing punctuation and stop words, and converting reviews into numerical formats suitable for models like Bag of Words and Word2Vec.
Analyzed sentiment label distribution for balance, visualized brand-specific rating distributions, and examined statistical summaries of review lengths.
- Benchmark Model: CountVectorizer with Multinomial Naive Bayes
- Other Models: TfidfVectorizer with Logistic Regression, Pipeline and GridSearch
- Load pretrained word embedding model.
- Construct embedding layer using embedding matrix as weights.
- Train an LSTM with Word2Vec embedding (embedding layer => LSTM layer => dense layer).
- Compile and fit the model using log loss function and ADAM optimizer.
Generated summaries to capture key insights from the sentiment analysis results.
-
Performance and Efficiency:
- Intel processors are praised for their excellent performance in gaming, video editing, and other demanding applications.
- The energy efficiency of Intel processors is particularly appreciated, especially during times of energy crisis.
-
Customer Satisfaction:
- Users report high satisfaction with the smooth running and fast performance of Intel processors.
- Many customers highlight the processors as the best choice for gaming and creative professionals.
-
Features and Compatibility:
- Intel processors are valued for their compatibility with various motherboards and components.
- Features like multiple cores and high GHz ratings are highly regarded.
-
Customer Experience:
- Positive reviews often mention the enjoyable process of upgrading to Intel processors and the efficient performance once installed.
-
Delivery and Packaging Issues:
- Some customers experienced poor packaging, with processors rattling around loose inside the box or inadequately protected by bubble wrap.
- Instances of receiving used or B-stock products instead of new items have been reported.
-
Quality Control:
- There are complaints about receiving defective or dysfunctional processors.
- Customers have faced issues with processors not performing as expected, requiring multiple adjustments to settings.
-
Performance Concerns:
- Overheating and the need to replace the stock heatsink fan with more efficient air coolers have been mentioned.
-
Cost vs. Performance:
- Some customers feel the high price of Intel processors does not always match the performance gain, leading to buyer’s remorse.
Strengths:
- Well-suited for tasks where order and past information are critical (e.g., sentiment analysis).
- Relatively easy to implement and interpret.
Weaknesses:
- Can struggle with very long sequences due to the vanishing gradient problem.
- May not efficiently capture relationships between distant elements.
Strengths:
- Excel at handling long sequences due to parallel processing and attention mechanism.
- Effective in learning relationships between distant elements.
Weaknesses:
- More computationally expensive to train compared to LSTMs.
- Interpretability can be challenging due to the "black box" nature of attention mechanisms.