- PHO2
- PHR1
- WRKY75
- SIZ1
- PHT1
- RNS1
- SULTR2;1
- PHOX (C. reinhardtii, putative phosphatase)
- EF1a
- EF1b
- EF4a
- Cyclophilin
- Hsp90-1
- Actin
- Tubulin
- Glyceraldehyde-3-phosphate dehydrogenase
- SuCoA (Succinol Co-enzyme A)
- FtSH protease
- 18S (both plants and algae)
- 25S rRNA (Plants only)
- 28S rRNA (C. reinhardtii)
To add:
- actually other housekeeping genes
- ???SAC1 (C. reinhardtii)
- ???SLT (C. reinhardtii)
- ???LPB1 (C. reinhardtii)
- Eutrema salsugineum
- Arabidopsis thaliana
- Oryza sativa
- Brassica rapa
- Capsella rubella
- Brassica napus
- Medicago trunculata
- Solanum lycopersicum
- Zea Mays (Outgroup)
To add:
- Amborella trichopoda
- Chlamydomonas reinhardtii (new outgroup)
- Solanum tuberosum
- Selaginella moellendorffii
Download transcriptomes from SRA
Download genomes from NCBI or JGI
Align transcripts to genomes
a) Get annotation files from ensemble
b) Get transcript files from ensemble
Create annotation with existing annotation (RSEM?)
Extract transcript sequences
a) Blast transcript sequences known in E. salsugineum with other transcripts
ls -1 transcripts > config && blast.sh $PWD/config
b) Parse top hit transcript
c) Manual go through each hit D:.i) alternatively, somehow get ThaleMine's Phytozone homologs working (API?)
Align transcript sequences
mafft --phylipout --nuc input > output
Generate tree from alignment using conserved genes - Using RaxML
ls *.fa.phy | xargs -i raxmlHPC -f d -m GTRCAT -s {} -n {}.out -p 51 #Bootstrapping* ls *.phy | xargs -P 6 -i raxmlHPC -f d -m GTRCAT -s {} -N 1000 -b 51 -p 51 -n {}.bt* #Consensus tree ls *.phy | xargs -P 6 -i raxmlHPC -f b -m GTRCAT -s {} -n {}.cons -z *.boopstrap.{}.bt
Infer species trees from bootstrap trees generated from conserved genes (consensus tree is a good way to go)
a) Don't forget to test for nuclotide model using
Determine molecular traits of each group: ORF length, GC content, # of exons, transcript length
Calculate phylogenetic signal from R package "phylosignal"
Conserved region for IPS2 found but not IPS1
a) Possible incomplete genome assembly?
Very short transcript (180 bp) mapped to region with 22 bp conserved region, instead usual ~520 bp
a) Possible incomplete transcriptome assembly?
- how do we know that these housekeeping genes don't interact with lncRNAs?
- Are there families of genes close together that modulate Pi homeostasis such as the PHO regulon in bacteria? (The answer is likely: no)
- *Update* Now that I can use Phytozome API, I can find homologs (already annotated)
- Genes that don't have homologs across all the species then what?
- At least I have to build a topology from conserved genes among the
- PSR1 (conserved MYB transcription factor)
How to use the Python API script
phytozome.py -h
This is a version made for phylogenetic signal analysis
Functions to implement:
- Finding all the homologous genes (as much as possible)
- Find the transcript sequences (required for alignment)
- Possible? Number of exons
- Counting GC%
- ORF length (probably through counting the CDS
- Transcript length (relatively EZ)
- Will have to keep track of ones that don't have any homologs...(which is a lot)
- GO analysis to determine what role the transcript has