一、源起
本周,在为本科生设计生物信息学实验的时候,突发奇想,想要利用Galaxy工具计算一下人类基因组的常识性信息,如:每条染色体上的基因密度,外显子、内含子等特征(feature)的平均长度,SNP在UTR、编码区、内含子等不同特征区域的密度。
二、工具
系统平台:Linux(Ubuntu 12.04,AMD64)。
UCSC Table:下载基因组数据至本地。
Galaxy:用于在线处理基因组数据。
BEDTools(v2.16.2):用于本地处理基因组数据
R(v2.15.1):绘制图表。
其他:Vim(v7.3.429)。
三、数据库
human genome:hg19
dbSNP:135
四、结果
1.基因在每条染色体上的数目与密度。
- 数据表格
chromosome | length.bp | length.100kb | geneNumber | geneDensity.numberPer100Mb |
---|---|---|---|---|
chr1 | 249250621 | 2492.50621 | 4177 | 1675.82330718 |
chr2 | 243199373 | 2431.99373 | 2563 | 1053.86784858 |
chr3 | 198022430 | 1980.2243 | 2251 | 1136.73991376 |
chr4 | 191154276 | 1911.54276 | 1592 | 832.835149343 |
chr5 | 180915260 | 1809.1526 | 1739 | 961.223503203 |
chr6 | 171115067 | 1711.15067 | 2071 | 1210.29669468 |
chr7 | 159138663 | 1591.38663 | 1940 | 1219.06264853 |
chrX | 155270560 | 1552.7056 | 2083 | 1341.5292635 |
chr8 | 146364022 | 1463.64022 | 1437 | 981.79865541 |
chr9 | 141213431 | 1412.13431 | 1573 | 1113.91670669 |
chr10 | 135534747 | 1355.34747 | 1749 | 1290.4439922 |
chr11 | 135006516 | 1350.06516 | 2485 | 1840.65189861 |
chr12 | 133851895 | 1338.51895 | 2102 | 1570.39241021 |
chr13 | 115169878 | 1151.69878 | 711 | 617.348921738 |
chr14 | 107349540 | 1073.4954 | 1334 | 1242.66950748 |
chr15 | 102531392 | 1025.31392 | 1357 | 1323.49710028 |
chr16 | 90354753 | 903.54753 | 1600 | 1770.79782399 |
chr17 | 81195210 | 811.9521 | 2331 | 2870.85900757 |
chr18 | 78077248 | 780.77248 | 599 | 767.188925511 |
chr20 | 63025520 | 630.2552 | 1167 | 1851.63089491 |
chrY | 59373566 | 593.73566 | 347 | 584.435167664 |
chr19 | 59128983 | 591.28983 | 2716 | 4593.34807095 |
chr22 | 51304566 | 513.04566 | 924 | 1801.00929028 |
chr21 | 48129895 | 481.29895 | 534 | 1109.49753786 |
- 条形图展示
2.人类基因组不同特征的长度统计。
使用的特征包括:基因,基因间,外显子,内含子,5‘UTR 外显子,编码区外显子,3’UTR 外显子。
- 基本统计信息
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | #Summary of gene length: Min. 1st Qu. Median Mean 3rd Qu. Max. 20 6442 20250 56380 57220 2305000 #Summary of intergenic length: Min. 1st Qu. Median Mean 3rd Qu. Max. 1 4359 18160 90600 58490 31220000 #Summary of exon length: Min. 1st Qu. Median Mean 3rd Qu. Max. 2.0 93.0 133.0 307.4 199.0 91670.0 #Summary of intron length: Min. 1st Qu. Median Mean 3rd Qu. Max. 1 473 1516 6127 4228 1044000 #Summary of utr5 length: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 63.0 114.0 203.3 204.0 37030.0 #Summary of coding length: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 85.0 122.0 165.3 169.0 21690.0 #Summary of utr3 length: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 128.0 395.0 989.5 1318.0 91670.0 |
- 盒形图展示
3.每条染色体与不同特征上的SNP密度。
使用的特征包括:基因,基因间,基因上游200bp,基因下游200bp,外显子,内含子,5‘UTR 外显子,编码区外显子,3’UTR 外显子。
- 数据表格
chromosome | length.feature | number.snp | density.snp | density.snp.deviationFromMean | feature |
---|---|---|---|---|---|
chr1 | 213911052 | 761239 | 3.55867 | -0.102761224115423 | gene |
chr10 | 149514361 | 587986 | 3.93264 | 0.271208775884577 | gene |
chr11 | 118226679 | 446851 | 3.77961 | 0.118178775884576 | gene |
chr12 | 125921633 | 467697 | 3.71419 | 0.0527587758845764 | gene |
chr13 | 57288717 | 219129 | 3.82499 | 0.163558775884577 | gene |
chr14 | 76719243 | 291346 | 3.79756 | 0.136128775884576 | gene |
chr15 | 73034780 | 266568 | 3.64988 | -0.0115512241154234 | gene |
chr16 | 66082724 | 313525 | 4.74443 | 1.08299877588458 | gene |
chr17 | 82934662 | 305530 | 3.68398 | 0.0225487758845766 | gene |
chr18 | 59512691 | 226227 | 3.80132 | 0.139888775884577 | gene |
chr19 | 52373248 | 220849 | 4.21683 | 0.555398775884576 | gene |
chr2 | 193485650 | 695370 | 3.59391 | -0.0675212241154233 | gene |
chr20 | 52658205 | 202952 | 3.85414 | 0.192708775884577 | gene |
chr21 | 31028365 | 144481 | 4.65642 | 0.994988775884576 | gene |
chr22 | 37155901 | 152010 | 4.09114 | 0.429708775884577 | gene |
chr3 | 175007776 | 645859 | 3.69046 | 0.0290287758845764 | gene |
chr4 | 128246414 | 491009 | 3.82864 | 0.167208775884577 | gene |
chr5 | 125381870 | 433527 | 3.45765 | -0.203781224115423 | gene |
chr6 | 128486960 | 514399 | 4.00351 | 0.342078775884577 | gene |
chr7 | 145574888 | 541402 | 3.71906 | 0.0576287758845764 | gene |
chr8 | 108462645 | 431213 | 3.97568 | 0.314248775884577 | gene |
chr9 | 92206982 | 359917 | 3.90336 | 0.241928775884577 | gene |
chrX | 118936354 | 206613 | 1.73717 | -1.92426122411542 | gene |
chrY | 26795282 | 4338 | 0.161894 | -3.49953722411542 | gene |
chr1 | 142348117 | 480587 | 3.37614 | -0.233800339848831 | intergenic |
chr10 | 73986912 | 313907 | 4.24274 | 0.632799660151169 | intergenic |
chr11 | 75974507 | 318865 | 4.197 | 0.587059660151169 | intergenic |
chr12 | 74944890 | 308142 | 4.11158 | 0.501639660151169 | intergenic |
chr13 | 80027662 | 264360 | 3.30336 | -0.306580339848831 | intergenic |
chr14 | 71735267 | 228874 | 3.19054 | -0.419400339848831 | intergenic |
chr15 | 61859579 | 172907 | 2.79515 | -0.814790339848831 | intergenic |
chr16 | 54941077 | 191431 | 3.4843 | -0.125640339848831 | intergenic |
chr17 | 38206463 | 142810 | 3.73785 | 0.127909660151169 | intergenic |
chr18 | 48128078 | 198986 | 4.13451 | 0.524569660151168 | intergenic |
chr19 | 30201709 | 124720 | 4.12957 | 0.519629660151169 | intergenic |
chr2 | 143672524 | 560801 | 3.90333 | 0.293389660151169 | intergenic |
chr20 | 35578133 | 143678 | 4.03838 | 0.428439660151169 | intergenic |
chr21 | 34366932 | 100481 | 2.92377 | -0.686170339848831 | intergenic |
chr22 | 32785971 | 78250 | 2.38669 | -1.22325033984883 | intergenic |
chr3 | 109260742 | 449832 | 4.11705 | 0.507109660151169 | intergenic |
chr4 | 123988589 | 537758 | 4.33716 | 0.727219660151169 | intergenic |
chr5 | 115934956 | 470533 | 4.05859 | 0.448649660151168 | intergenic |
chr6 | 101899062 | 465443 | 4.56769 | 0.957749660151169 | intergenic |
chr7 | 86317978 | 357387 | 4.14035 | 0.530409660151169 | intergenic |
chr8 | 89280197 | 376829 | 4.22075 | 0.610809660151169 | intergenic |
chr9 | 91692703 | 283428 | 3.09106 | -0.518880339848831 | intergenic |
chrX | 106492843 | 205906 | 1.93352 | -1.67642033984883 | intergenic |
chrY | 54878730 | 5371 | 0.0978703 | -3.51207003984883 | intergenic |
chr1 | 835400 | 3463 | 4.14532 | 0.0605169074655709 | up200 |
chr10 | 350800 | 1448 | 4.12771 | 0.0429069074655715 | up200 |
chr11 | 497000 | 2321 | 4.67002 | 0.585216907465571 | up200 |
chr12 | 420400 | 1714 | 4.07707 | -0.007733092534429 | up200 |
chr13 | 142200 | 569 | 4.00141 | -0.0833930925344291 | up200 |
chr14 | 266800 | 1024 | 3.83808 | -0.246723092534429 | up200 |
chr15 | 271400 | 1090 | 4.01621 | -0.0685930925344289 | up200 |
chr16 | 320000 | 1345 | 4.20312 | 0.118316907465571 | up200 |
chr17 | 466200 | 1859 | 3.98756 | -0.0972430925344288 | up200 |
chr18 | 119800 | 528 | 4.40735 | 0.322546907465571 | up200 |
chr19 | 543200 | 2544 | 4.68336 | 0.598556907465571 | up200 |
chr2 | 512600 | 2127 | 4.14943 | 0.0646269074655708 | up200 |
chr20 | 233400 | 947 | 4.05741 | -0.027393092534429 | up200 |
chr21 | 107000 | 567 | 5.29907 | 1.21426690746557 | up200 |
chr22 | 184800 | 951 | 5.1461 | 1.06129690746557 | up200 |
chr3 | 450200 | 1750 | 3.88716 | -0.197643092534429 | up200 |
chr4 | 318600 | 1244 | 3.90458 | -0.180223092534429 | up200 |
chr5 | 348000 | 1291 | 3.70977 | -0.375033092534429 | up200 |
chr6 | 414200 | 2542 | 6.13713 | 2.05232690746557 | up200 |
chr7 | 388000 | 1497 | 3.85825 | -0.226553092534429 | up200 |
chr8 | 287400 | 1174 | 4.0849 | 9.69074655712276e-05 | up200 |
chr9 | 314600 | 1273 | 4.04641 | -0.0383930925344291 | up200 |
chrX | 416600 | 536 | 1.28661 | -2.79819309253443 | up200 |
chrY | 69400 | 10 | 0.144092 | -3.94071109253443 | up200 |
chr1 | 835400 | 3155 | 3.77663 | 0.0788769195457841 | down200 |
chr10 | 350800 | 1355 | 3.8626 | 0.164846919545784 | down200 |
chr11 | 497000 | 1880 | 3.7827 | 0.0849469195457844 | down200 |
chr12 | 420400 | 1497 | 3.56089 | -0.136863080454216 | down200 |
chr13 | 142200 | 492 | 3.45992 | -0.237833080454216 | down200 |
chr14 | 266800 | 995 | 3.72939 | 0.0316369195457842 | down200 |
chr15 | 271400 | 1014 | 3.73618 | 0.0384269195457843 | down200 |
chr16 | 320000 | 1247 | 3.89688 | 0.199126919545784 | down200 |
chr17 | 466200 | 1682 | 3.60789 | -0.089863080454216 | down200 |
chr18 | 119800 | 467 | 3.89816 | 0.200406919545784 | down200 |
chr19 | 543200 | 2269 | 4.1771 | 0.479346919545784 | down200 |
chr2 | 512600 | 1740 | 3.39446 | -0.303293080454216 | down200 |
chr20 | 233400 | 917 | 3.92888 | 0.231126919545784 | down200 |
chr21 | 107000 | 544 | 5.08411 | 1.38635691954578 | down200 |
chr22 | 184800 | 896 | 4.84848 | 1.15072691954578 | down200 |
chr3 | 450200 | 1587 | 3.5251 | -0.172653080454216 | down200 |
chr4 | 318600 | 1193 | 3.74451 | 0.0467569195457842 | down200 |
chr5 | 348000 | 1316 | 3.78161 | 0.0838569195457843 | down200 |
chr6 | 414200 | 2205 | 5.32352 | 1.62576691954578 | down200 |
chr7 | 388000 | 1382 | 3.56186 | -0.135893080454216 | down200 |
chr8 | 287400 | 1073 | 3.73347 | 0.0357169195457843 | down200 |
chr9 | 314600 | 1182 | 3.75715 | 0.0593969195457844 | down200 |
chrX | 416600 | 517 | 1.241 | -2.45675308045422 | down200 |
chrY | 69400 | 5 | 0.0720461 | -3.62570698045422 | down200 |
chr1 | 12660160 | 38512 | 3.04198 | 0.019269884972156 | exon |
chr10 | 5621527 | 17326 | 3.08208 | 0.0593698849721558 | exon |
chr11 | 7073434 | 21703 | 3.06824 | 0.0455298849721557 | exon |
chr12 | 6869799 | 19605 | 2.8538 | -0.168910115027844 | exon |
chr13 | 2212460 | 6447 | 2.91395 | -0.108760115027844 | exon |
chr14 | 3912101 | 12014 | 3.07098 | 0.0482698849721559 | exon |
chr15 | 4099640 | 11383 | 2.77659 | -0.246120115027844 | exon |
chr16 | 4539134 | 14600 | 3.21647 | 0.193759884972156 | exon |
chr17 | 6744232 | 20834 | 3.08916 | 0.066449884972156 | exon |
chr18 | 2177258 | 6513 | 2.99138 | -0.0313301150278442 | exon |
chr19 | 6462276 | 25009 | 3.87 | 0.847289884972156 | exon |
chr2 | 9173129 | 24809 | 2.70453 | -0.318180115027844 | exon |
chr20 | 3188751 | 10342 | 3.24328 | 0.220569884972156 | exon |
chr21 | 1396473 | 5449 | 3.90197 | 0.879259884972156 | exon |
chr22 | 2733051 | 10098 | 3.69477 | 0.672059884972156 | exon |
chr3 | 7504631 | 21055 | 2.8056 | -0.217110115027844 | exon |
chr4 | 5140247 | 15248 | 2.96639 | -0.0563201150278441 | exon |
chr5 | 5935961 | 17200 | 2.89759 | -0.125120115027844 | exon |
chr6 | 6264781 | 25638 | 4.0924 | 1.06968988497216 | exon |
chr7 | 5951915 | 18436 | 3.09749 | 0.0747798849721559 | exon |
chr8 | 4476053 | 14421 | 3.22181 | 0.199099884972156 | exon |
chr9 | 4822501 | 14241 | 2.95303 | -0.0696801150278441 | exon |
chrX | 5621251 | 8240 | 1.46587 | -1.55684011502784 | exon |
chrY | 968489 | 376 | 0.388234 | -2.63447611502784 | exon |
chr1 | 201250892 | 722730 | 3.59119 | -0.104952486392963 | intron |
chr10 | 143892834 | 570675 | 3.96597 | 0.269827513607037 | intron |
chr11 | 111153245 | 425149 | 3.82489 | 0.128747513607037 | intron |
chr12 | 119051834 | 448094 | 3.76386 | 0.0677175136070374 | intron |
chr13 | 55076257 | 212682 | 3.86159 | 0.165447513607037 | intron |
chr14 | 72807142 | 279332 | 3.8366 | 0.140457513607037 | intron |
chr15 | 68935140 | 255210 | 3.70218 | 0.00603751360703697 | intron |
chr16 | 61543590 | 298926 | 4.85714 | 1.16099751360704 | intron |
chr17 | 76190430 | 284701 | 3.7367 | 0.0405575136070371 | intron |
chr18 | 57335433 | 219714 | 3.83208 | 0.135937513607037 | intron |
chr19 | 45910972 | 195843 | 4.26571 | 0.569567513607038 | intron |
chr2 | 184312521 | 670562 | 3.63818 | -0.0579624863929626 | intron |
chr20 | 49469454 | 192610 | 3.89351 | 0.197367513607037 | intron |
chr21 | 29631892 | 139038 | 4.69217 | 0.996027513607037 | intron |
chr22 | 34422850 | 141914 | 4.12267 | 0.426527513607037 | intron |
chr3 | 167503145 | 624810 | 3.73014 | 0.0339975136070372 | intron |
chr4 | 123106167 | 475761 | 3.86464 | 0.168497513607037 | intron |
chr5 | 119445909 | 416340 | 3.48559 | -0.210552486392963 | intron |
chr6 | 122222179 | 488769 | 3.99902 | 0.302877513607037 | intron |
chr7 | 139622973 | 522980 | 3.74566 | 0.0495175136070372 | intron |
chr8 | 103986592 | 416792 | 4.00813 | 0.311987513607038 | intron |
chr9 | 87384481 | 345681 | 3.95586 | 0.259717513607037 | intron |
chrX | 113315103 | 198373 | 1.75063 | -1.94551248639296 | intron |
chrY | 25826793 | 3962 | 0.153407 | -3.54273548639296 | intron |
chr1 | 1501412 | 5467 | 3.64124 | 0.0411945865830874 | utr5 |
chr10 | 655271 | 2453 | 3.74349 | 0.143444586583088 | utr5 |
chr11 | 801123 | 2720 | 3.39523 | -0.204815413416912 | utr5 |
chr12 | 765747 | 2796 | 3.65134 | 0.0512945865830874 | utr5 |
chr13 | 261673 | 937 | 3.58081 | -0.0192354134169124 | utr5 |
chr14 | 474814 | 1768 | 3.72356 | 0.123514586583088 | utr5 |
chr15 | 554992 | 1497 | 2.69734 | -0.902705413416912 | utr5 |
chr16 | 499164 | 1687 | 3.37965 | -0.220395413416913 | utr5 |
chr17 | 777051 | 3063 | 3.94183 | 0.341784586583088 | utr5 |
chr18 | 210888 | 831 | 3.94048 | 0.340434586583088 | utr5 |
chr19 | 732863 | 3185 | 4.34597 | 0.745924586583088 | utr5 |
chr2 | 927321 | 3094 | 3.33649 | -0.263555413416912 | utr5 |
chr20 | 408422 | 1505 | 3.68491 | 0.0848645865830875 | utr5 |
chr21 | 214951 | 901 | 4.19165 | 0.591604586583088 | utr5 |
chr22 | 395782 | 1556 | 3.93146 | 0.331414586583088 | utr5 |
chr3 | 835162 | 3070 | 3.67593 | 0.0758845865830877 | utr5 |
chr4 | 566024 | 2169 | 3.83199 | 0.231944586583087 | utr5 |
chr5 | 634733 | 2157 | 3.39828 | -0.201765413416912 | utr5 |
chr6 | 704241 | 3917 | 5.56202 | 1.96197458658309 | utr5 |
chr7 | 747133 | 2604 | 3.48532 | -0.114725413416912 | utr5 |
chr8 | 497581 | 1917 | 3.85264 | 0.252594586583088 | utr5 |
chr9 | 554969 | 1992 | 3.58939 | -0.0106554134169126 | utr5 |
chrX | 659875 | 976 | 1.47907 | -2.12097541341691 | utr5 |
chrY | 151958 | 58 | 0.381684 | -3.21836141341691 | utr5 |
chr1 | 5648070 | 15077 | 2.66941 | 0.0520799475195366 | coding |
chr10 | 2458928 | 6376 | 2.593 | -0.0243300524804635 | coding |
chr11 | 3250750 | 9041 | 2.7812 | 0.163869947519537 | coding |
chr12 | 3047119 | 7065 | 2.31858 | -0.298750052480464 | coding |
chr13 | 968554 | 2221 | 2.29311 | -0.324220052480463 | coding |
chr14 | 1711626 | 4461 | 2.60629 | -0.0110400524804635 | coding |
chr15 | 1860352 | 4322 | 2.32322 | -0.294110052480463 | coding |
chr16 | 2262655 | 6356 | 2.80909 | 0.191759947519536 | coding |
chr17 | 3235121 | 8608 | 2.6608 | 0.0434699475195366 | coding |
chr18 | 888606 | 2253 | 2.53543 | -0.0819000524804636 | coding |
chr19 | 3431310 | 12051 | 3.51207 | 0.894739947519537 | coding |
chr2 | 4531228 | 10513 | 2.32012 | -0.297210052480463 | coding |
chr20 | 1366887 | 3955 | 2.89344 | 0.276109947519537 | coding |
chr21 | 610222 | 1926 | 3.15623 | 0.538899947519536 | coding |
chr22 | 1190786 | 3784 | 3.17773 | 0.560399947519536 | coding |
chr3 | 3347788 | 7766 | 2.31974 | -0.297590052480464 | coding |
chr4 | 2345492 | 5416 | 2.30911 | -0.308220052480463 | coding |
chr5 | 2620918 | 6490 | 2.47623 | -0.141100052480463 | coding |
chr6 | 2863757 | 10749 | 3.75346 | 1.13612994751954 | coding |
chr7 | 2588695 | 6762 | 2.61213 | -0.0052000524804634 | coding |
chr8 | 1915541 | 5190 | 2.70942 | 0.0920899475195367 | coding |
chr9 | 2293613 | 5755 | 2.50914 | -0.108190052480464 | coding |
chrX | 2569877 | 3686 | 1.43431 | -1.18302005248046 | coding |
chrY | 289808 | 144 | 0.496881 | -2.12044905248046 | coding |
chr1 | 5510678 | 17968 | 3.26058 | -0.0385129867774729 | utr3 |
chr10 | 2507328 | 8507 | 3.39285 | 0.0937570132225272 | utr3 |
chr11 | 3021561 | 9942 | 3.29035 | -0.00874298677747287 | utr3 |
chr12 | 3056933 | 9744 | 3.18751 | -0.111582986777473 | utr3 |
chr13 | 982233 | 3289 | 3.34849 | 0.049397013222527 | utr3 |
chr14 | 1725661 | 5785 | 3.35234 | 0.0532470132225269 | utr3 |
chr15 | 1684296 | 5564 | 3.30346 | 0.00436701322252686 | utr3 |
chr16 | 1777315 | 6557 | 3.68927 | 0.390177013222527 | utr3 |
chr17 | 2732060 | 9163 | 3.35388 | 0.0547870132225272 | utr3 |
chr18 | 1077764 | 3429 | 3.18159 | -0.117502986777473 | utr3 |
chr19 | 2298103 | 9773 | 4.25264 | 0.953547013222527 | utr3 |
chr2 | 3714580 | 11202 | 3.01568 | -0.283412986777473 | utr3 |
chr20 | 1413442 | 4882 | 3.45398 | 0.154887013222527 | utr3 |
chr21 | 571300 | 2622 | 4.58953 | 1.29043701322253 | utr3 |
chr22 | 1146483 | 4758 | 4.15008 | 0.850987013222527 | utr3 |
chr3 | 3321681 | 10219 | 3.07645 | -0.222642986777473 | utr3 |
chr4 | 2228731 | 7663 | 3.43828 | 0.139187013222527 | utr3 |
chr5 | 2680310 | 8553 | 3.19105 | -0.108042986777473 | utr3 |
chr6 | 2696783 | 10972 | 4.06855 | 0.769457013222527 | utr3 |
chr7 | 2616087 | 9070 | 3.46701 | 0.167917013222527 | utr3 |
chr8 | 2062931 | 7314 | 3.54544 | 0.246347013222527 | utr3 |
chr9 | 1973919 | 6494 | 3.2899 | -0.00919298677747316 | utr3 |
chrX | 2391499 | 3578 | 1.49613 | -1.80296298677747 | utr3 |
chrY | 526723 | 174 | 0.330344 | -2.96874898677747 | utr3 |
- 条形图展示
五、扩展
六、下载
所有数据(统计数据、程序脚本、结果图表)打包下载。
PS:人类基因组范围(包括性染色体和线粒体,不包括chr*_*)的SNP密度(snp总数/基因组长度)为:11329891/3095693983*1000=3.659887,即每kb大约有3-4个SNP。