- PHO2
- PHR1
- WRKY75
- SIZ1
- PHT1
- RNS1
- SULTR2;1
- PHF
- PHOX (C. reinhardtii, putative phosphatase)
- EF1a
- EF1b
- EF4a
- Cyclophilin
- Hsp90-1
- Actin
- Tubulin
- Glyceraldehyde-3-phosphate dehydrogenase
- SuCoA (Succinol Co-enzyme A)
- FtSH protease
- 18S (both plants and algae)
- GADPH
- 25S rRNA (Plants only)
- 28S rRNA (C. reinhardtii)
To add:
- actually other housekeeping genes
- SULTR
- SNRK
- ???SAC1 (C. reinhardtii)
- ???SLT (C. reinhardtii)
- ???LPB1 (C. reinhardtii)
- SULP
- ATS
- SBP
- APK
- Eutrema salsugineum
- Arabidopsis thaliana
- Oryza sativa
- Brassica rapa
- Capsella rubella
- Brassica napus
- Medicago trunculata
- Solanum lycopersicum
- Zea Mays (Outgroup)
To add:
- Amborella trichopoda
- Chlamydomonas reinhardtii (new outgroup)
- Solanum tuberosum
- Selaginella moellendorffii
-
Download transcriptomes from SRA
-
Download genomes from NCBI or JGI
-
Align transcripts to genomes
a) Get annotation files from ensemble
b) Get transcript files from ensemble
-
Create annotation with existing annotation (RSEM?)
-
Extract transcript sequences
a) Blast transcript sequences known in E. salsugineum with other transcripts
ls -1 transcripts > config && blast.sh $PWD/config
b) Parse top hit transcript
c) Manual go through each hit D:.i) alternatively, somehow get ThaleMine's Phytozone homologs working (API?)
-
Align transcript sequences
mafft --phylipout --nuc input > output
-
Generate tree from alignment using conserved genes - Using RaxML
ls *.fa.phy | xargs -i raxmlHPC -f d -m GTRCAT -s {} -n {}.out -p 51 #Bootstrapping* ls *.phy | xargs -P 6 -i raxmlHPC -f d -m GTRCAT -s {} -N 1000 -b 51 -p 51 -n {}.bt* #Consensus tree ls *.phy | xargs -P 6 -i raxmlHPC -f b -m GTRCAT -s {} -n {}.cons -z *.boopstrap.{}.bt
-
Infer species trees from bootstrap trees generated from conserved genes (consensus tree is a good way to go)
a) Don't forget to test for nuclotide model using
jModelTest
-
Determine molecular traits of each group: ORF length, GC content, # of exons, transcript length
-
Calculate phylogenetic signal from R package "phylosignal"
-
Conserved region for IPS2 found but not IPS1
a) Possible incomplete genome assembly?
-
Very short transcript (180 bp) mapped to region with 22 bp conserved region, instead usual ~520 bp
a) Possible incomplete transcriptome assembly?
- how do we know that these housekeeping genes don't interact with lncRNAs?
- Are there families of genes close together that modulate Pi homeostasis such as the PHO regulon in bacteria? (The answer is likely: no)
- *Update* Now that I can use Phytozome API, I can find homologs (already annotated)
- Genes that don't have homologs across all the species then what?
- At least I have to build a topology from conserved genes among the
- PSR1 (conserved MYB transcription factor)
How to use the Python API script
phytozome.py -h
This is a version made for phylogenetic signal analysis
Functions to implement:
- Finding all the homologous genes (as much as possible)
- Find the transcript sequences (required for alignment)
- Possible? Number of exons
- Counting GC%
- ORF length (probably through counting the CDS
- Transcript length (relatively EZ)
- Will have to keep track of ones that don't have any homologs...(which is a lot)
- GO analysis to determine what role the transcript has