Skip to content

Clasterization of TCGA dataset. Data preprocessing, visualization and clasterization with different alghoritms. Done mostly with Python 3.7 and Scikit-Learn library.

Notifications You must be signed in to change notification settings

adrian-aleks/TCGA-clasterization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

TCGA-clasterization

This repo contains jupyter notebook that shows part of my work done for my bachelor thesis. Dataset data and more sophisticated description can be found at:

Short description

The data is part of the RNA-Seq (HiSeq) PANCAN data set, it is a random extraction of gene expressions of patients having different types of tumor: BRCA, KIRC, COAD, LUAD and PRAD. Each of the 801 rows describes genome profile of a particular patient. Conducted analysis aim was to answer the question of how well unsupervised learning alghoritms could sepereate different types of cancer within the dataset or are there any other clusters within or between different kinds of cancer.

How to run this

Simply click on this link!

Tools used

  • Python 3.7
  • Scikit-Learn library
  • Pandas
  • Matplotlib

About

Clasterization of TCGA dataset. Data preprocessing, visualization and clasterization with different alghoritms. Done mostly with Python 3.7 and Scikit-Learn library.

Topics

Resources

Stars

Watchers

Forks