OpenFoodFactClustering

This project explores the application of clustering algorithms to categorize food products based on their nutritional content. The goal is to identify distinct nutritional profiles within a diverse dataset using K-Means, Fuzzy C-Means, and DBSCAN clustering methods.

Introduction

The project addresses the challenge of clustering food products based on nutritional attributes to improve dietary recommendations and health outcomes. By leveraging unsupervised learning methods, this research aims to identify meaningful clusters in food data.

Installation

To set up the project environment, follow these steps:

Clone the repository:

git clone https://github.com/Wei-RongRong2/OpenFoodFactClustering.git

Navigate to the project directory:
```
cd OpenFoodFactClustering
```
Install the required Python packages:
```
pip install -r requirements.txt
```

Usage

To run the clustering analysis, follow these steps:

Ensure you have Jupyter Notebook installed. If not, you can install it using:
```
pip install notebook
```
Navigate to the project directory where the Jupyter Notebook is located:
```
cd OpenFoodFactClustering
```
Launch Jupyter Notebook:
```
jupyter notebook
```
In the Jupyter Notebook interface, open the OpenFoodFactClustering.ipynb file.
Download the dataset from Open Food Facts and rename it as en.openfoodfacts.org.products.tsv. Place the file in the same directory as the Jupyter Notebook.
Run the cells in the notebook to execute the clustering analysis.

Dashboard

A simple dashboard has been created using Streamlit to visualize the clustering results. You can view the dashboard online at the following URL:

OpenFoodFactClustering Dashboard

The code for the dashboard and the CSV files containing the results are located in the Dashboard folder within this repository.

To run the dashboard locally, follow these steps:

Navigate to the Dashboard folder:
```
cd Dashboard
```
If you have not installed the full set of dependencies for the project and only want to view the dashboard, install the required packages by running:
```
pip install -r requirements.txt
```
(This requirements.txt file is located in the Dashboard folder.)
Run the Streamlit application:
```
streamlit run Dashboard.py
```

Methodology

The project utilizes the Open Food Facts dataset and applies K-Means, Fuzzy C-Means, and DBSCAN algorithms to cluster food products. The dataset undergoes preprocessing, including missing value handling, data validation, and outlier removal.

Data Collection

Source: Open Food Facts dataset available on Open Food Facts
Size: 356,027 rows and 163 columns
Attributes: Product names, categories, nutritional information, ingredients, labels, and packaging details

Preprocessing

Missing Values: Removed columns with >20% missing data; imputed others.
Data Validation: Identified and corrected/removal of invalid data and extreme outliers.
Data Types: One-hot encoded categorical variables; scaled numerical features.
Duplicate Data: Removed duplicate rows and redundant columns.

Clustering Algorithms

K-Means Clustering: Used for partitioning the data into k clusters based on nutritional attributes.
Fuzzy C-Means Clustering: Allows for overlapping clusters with varying degrees of membership.
DBSCAN Clustering: Density-based algorithm to identify clusters of varying shapes and sizes, with noise detection.

Results

The clustering analysis aimed to uncover distinct patterns within the dataset, though some challenges were encountered due to the complexity of the data. Here are the key findings:

K-Means: Four clusters were identified, but there was notable overlap, which may indicate the inherent complexity of the data.
Fuzzy C-Means: Clustering coherence and separation improved after tuning, yet some overlap persisted.
DBSCAN: Tuning led to better-defined clusters, although overlap remained a challenge.

These results suggest that while clustering algorithms provided some insights, the complexity of the data presented difficulties in achieving clear, non-overlapping clusters. Further refinement or alternative approaches may be needed to enhance cluster distinctiveness.

For a more detailed explanation of these steps and results, refer to the full report: Report - Clustering Food Products based on Nutritional Attributes.pdf.

Contributing

Contributions are welcome! Please fork this repository, make your changes in a new branch, and submit a pull request for review.

Fork the repo
Create a feature branch (git checkout -b feature-name)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin feature-name)
Create a new Pull Request

Acknowledgments

This project was developed in collaboration with limjosun. We worked together on the clustering analysis, dashboard development, and project documentation.

License

This project is part of an academic course and is intended for educational purposes only. It may contain references to copyrighted materials, and the use of such materials is strictly for academic use. Please consult your instructor or institution for guidance on sharing or distributing this work.

For more details, see the LICENSE file.

Contact

Created by Wei-RongRong2 - feel free to contact me!
For any inquiries, you can also reach out to limjosun

References

Open Food Facts Dataset: Kaggle Link
Machine Learning Algorithms: Scikit-Learn Documentation
Evaluation Metrics: "Silhouette Score," "Davies-Bouldin Index," "Calinski-Harabasz Index"

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.idea		.idea
Dashboard		Dashboard
LICENSE		LICENSE
OpenFoodFactClustering.ipynb		OpenFoodFactClustering.ipynb
PresentationSlide.pdf		PresentationSlide.pdf
README.md		README.md
Report - Clustering Food Products based on Nutritional Attributes.pdf		Report - Clustering Food Products based on Nutritional Attributes.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenFoodFactClustering

Table of Contents

Introduction

Installation

Usage

Dashboard

Methodology

Data Collection

Preprocessing

Clustering Algorithms

Results

Contributing

Acknowledgments

License

Contact

References

About

Releases

Packages

Languages

License

Wei-RongRong2/OpenFoodFactClustering

Folders and files

Latest commit

History

Repository files navigation

OpenFoodFactClustering

Table of Contents

Introduction

Installation

Usage

Dashboard

Methodology

Data Collection

Preprocessing

Clustering Algorithms

Results

Contributing

Acknowledgments

License

Contact

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages