“Bioinformatics” homework for
undergraduate (2016)
#1
How many nucleotide sequences from maize (Zea mays) have been stored in the public DNA database (such as GenBank)? How many Waxy (granule-bound starch synthase) gene sequences from maize in the database?
答:2016年11月6日星期日,访问NCBI(网址为https://www.ncbi.nlm.nih.gov/),在Nucleotide数据库中搜索zea mays,Species选择Plants,Molecule types选择genomic DNA/RNA,最终结果显示被存储在NCBI中的Zea mays的nucleotide sequences数量为446964。在搜索框中输入(zea mays[Organism]) AND waxy[Gene name],然后搜索得到玉米中Waxy基因序列在数据库中的数目为175。具体操作及结果如下图所示:
#2
A sequence was generated by a suppression subtractive hybridization (SSH) experiment. Please find the best hit(s) of the unknown sequence in the public database and predict its potential function. >an unknown sequence
CCTCGGAGATCTTCATGGGGGGCAAGAGCACCATCGTGCTgCACAACACCTGCGAGGACTCGCTCCTCGCTGCACCCATCATTCTTGATCTGGTGCTCCTGGCGGAGCTCAGCACCAGGATTCAGCTGAAGGCCGAGGGAGAGGTAAGAGTCTGACGAGATATGTTGCTAGTCTACTCTGTAGTCGAGATATACTTTGGGAGCCAAACTGAAGATTTCGCTGCTCCACTTGCATTTGTGCAGGACAAGTTCCATTCCTTCCATCCGGTTGCCACCATCCTGAGCTACCTCACCAAGGCACCCCTGGTAAGAAACAATTCTCGACTGTTTGCTCTAAATAACCTATAGATAAATAAAGACGATTAACTGACGTGCCACTGAATTCCTCTGTTAACAGGTTCCTCCTGGCACGCCGGTGGTGAACGCCCTGGCGAAGCAAAGGGCGATGCTGGAGAACATCATGAGGGCGTGTGTCGGCCTGGCGCCCGAAAACAACATGATCCTGGAGTACAAGTGAGGAGCGTGGCCCAAGCTCGCGGAGCCGAGAGCGACCGTACGTACGTAGCAAGTGGCGAGGGGCGACGGGAGGGCAGGACGAAGAAGAAGGCGAGATCGGCTGTGGAATTATTTGGCGGCTTGTCTTTAGTTTCCTTTGCGAATCTTTCCCTGGTTAAGTTTACCCCAGTGAGTGTGTGTCCTTGCGAGAAAAG
答:进入NCBI做blast,具体网址为http://blast.ncbi.nlm.nih.gov/Blast.cgi,选择Blastx,将上述序列复制到查询框中,参数选择默认参数,直接Blast,得到最佳联配结果为Inositol-3-phosphate synthase [Dichanthelium oligosanthes]。进入EMBL做blast,具体网址为http://www.ebi.ac.uk/Tools/sss/ncbiblast/,选择Blastx,将上述序列复制到查询框中,参数选择默认参数,直接Blast,得到最佳联配结果为Inositol-3-phosphate synthase。根据两处的联配结果可以推测这个未知序列可能的功能与Inositol-3-phosphate synthase相同。
#3
Use dynamic programming method, the Needleman-Wunsch algorithm, to perform global alignment of the sequences: P1=HEAGAWGHEP P2=EPAWHEAEAG
Scoring system: BLOSUM50 scoring matrix with gap penalty 8. BLOSUM50 (partial) A E G H P W A 5 -1 0 -2 -1 -3 E 6 -3 0 -1 -3 G 8 -2 -2 -3 H 10 -2 -3 P 10 -4 W 15
答:具体每一步动态规划的计算过程如下图所示,以黄颜色突出的部分表示达到最优联配所需经过的每一步。 P2 E P A W H E A E A G P1 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 H -8 0 -8 -16 -24 -22 -30 -38 -46 -54 -72 E -16 -2 -1 -9 -17 -24 -16 -24 -32 -40 -48 A -24 -10 -3 4 -4 -12 -20 -17 -25 -27 -35 G -32 -18 -11 -3 1 -6 -14 -20 -20 -25 -19 A -40 -26 -19 -6 -6 -1 -7 -9 -17 -15 -23 W -48 -34 -27 -14 9 1 -4 -10 -12 -20 -18 G -56 -42 -36 -22 1 7 -1 -4 -12 -12 -12 H -64 -50 -44 -30 -7 11 7 -1 -4 -12 -14 E -72 -58 -51 -38 -15 3 17 9 5 -3 -11 P -80 -66 -48 -46 -23 -5 9 16 8 4 -4 最终可以得到最佳的联配方式如下所示,其中下划线表示空位
P1: HEAGAWGH_EP_ P2: _EP_AWHEAEAG
Score:-8+6-1-8+5+15-2+0-8+6-1-8= - 4
#4
Please find genes in a genomic segment of bamboo (Download).
答:打开如下网址http://www.softberry.com/berry.phtml?topic=fgenesh&group=programs&subgroup=gfind ,Organism选择Monocot plants(因为里面没有竹子对应的选项),然后运行在线程序,最后得到结果如下,它给出了可能的基因及它们编码的蛋白质的碱基序列。
FGENESH 2.6 Prediction of potential genes in Monocot genomic DNA Time : Sun Nov 6 04:05:56 2016 Seq name: test sequence
Length of sequence: 49600
Number of predicted genes 10: in +chain 2, in -chain 8. Number of predicted exons 22: in +chain 10, in -chain 12.
Positions of predicted genes and exons: Variant 1 from 1, Score:299.051538 G Str Feature Start End Score ORF Len 1 - PolA 6777 0.44
1 - 1 CDSo 6884 - 7057 4.33 6884 - 7057 174 1 - TSS 8311 -1.78 2 + TSS 17591 -4.18
2 + 1 CDSf 17742 - 17811 16.41 17742 - 17810 69 2 + 2 CDSl 19834 - 20792 87.86 19836 - 20792 957 2 + PolA 21649 0.44 3 + TSS 21801 -7.58
3 + 1 CDSf 22009 - 22085 19.33 22009 - 22083 75 3 + 2 CDSi 22583 - 22652 3.65 22584 - 22652 69 3 + 3 CDSi 23070 - 23145 6.56 23070 - 23144 75 3 + 4 CDSi 23236 - 23353 18.37 23238 - 23351 114 3 + 5 CDSi 24144 - 24233 8.37 24145 - 24231 87 3 + 6 CDSi 24306 - 24381 6.47 24307 - 24381 75 3 + 7 CDSi 24523 - 24650 5.26 24523 - 24648 126 3 + 8 CDSl 24731 - 24800 8.36 24732 - 24800 69 3 + PolA 25006 0.44 4 - PolA 26777 -1.06
4 - 1 CDSl 27135 - 28019 59.89 27135 - 28019 885 4 - 2 CDSf 28097 - 28504 40.08 28097 - 28504 408 4 - TSS 28623 -6.38 5 - PolA 30964 0.44
5 - 1 CDSl 30993 - 31177 8.88 30993 - 31175 183 5 - 2 CDSi 31212 - 31431 -7.39 31213 - 31431 219 5 - 3 CDSf 31504 - 31548 8.15 31504 - 31548 45 5 - TSS 31608 -1.28 6 - PolA 33364 0.44
6 - 1 CDSo 33766 - 33954 4.42 33766 - 33954 189 6 - TSS 34021 -3.18 7 - PolA 34094 -1.06
7 - 1 CDSo 34700 - 34975 17.27 34700 - 34975 276 7 - TSS 35444 -6.08 8 - PolA 35848 0.44
8 - 1 CDSo 36075 - 36458 20.75 36075 - 36458 384 8 - TSS 37019 -5.38 9 - PolA 40341 -1.06
9 - 1 CDSo 40879 - 41067 9.51 40879 - 41067 189 9 - TSS 41777 -5.68 10 - PolA 43349 -1.06