对FASTA格式的简单处理与统计

  • 缘起

模仿lh3开发的网页版工具SeqTools

  • 功能

1.格式化FASTA文件。
2.反向互补FASTA序列。
3.获取FASTA序列的长度。
4.计算GC含量并对ATGC计数。
5.搜索模式(子序列、motif等)。

  • 改进

1.添加了GC含量的计算和ATGC的计数。
2.扩展了search功能(得益于Perl的正则表达式)。

  • 不足

没有网页界面。
(有熟悉perl与网页编程的朋友可以改写一下:只要在浏览器中打开就可以使用)

  • 源码

  • 1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    
    #!/usr/bin/perl 
     
    #Please replace all "《" and "》" with "",respectively.
     
    use strict;
    use warnings;
     
    use Getopt::Std;
    use Bio::SeqIO;
     
    &main;
    exit;
     
    sub main {
        &usage if ( @ARGV1 );
        my $command  = shift(@ARGV);
        my %function = (
            format  =&format,
            revcom  =&revcom,
            length  =&length,
            content =&content,
            search  =&search
        );
        die("Unknown command "$command".n")
          if ( !defined( $function{$command} ) );
        };
    }
     
    sub format {
        my %opts = ( w =60 );
        getopts( 'i:o:w:', %opts );
        die(
            qq/
    Usage:    seqTools.pl format [options]
     
    Options:  -i STR   Input file in FASTA format.
              -o STR   Output file in FASTA format.
              -w INT   The line width for FASTA output. [$opts{w}]
    n/
        ) unless ( exists $opts{i} && exists $opts{o} );
     
        my ( $fi, $fo, $width ) = ( $opts{i}, $opts{o}, $opts{w} );
        my $in  = Bio::SeqIO-new( -file ="$fi",  -format ='fasta' );
        my $out = Bio::SeqIO-new( -file ="》$fo", -format ='fasta' );
        while ( my $seq = $in-》next_seq() ) {
            $out-》width($width);
            $out-》write_seq($seq);
        }
    }
     
    sub revcom {
        my %opts = ( w =60 );
        getopts( 'i:o:', %opts );
        die(
            qq/
    Usage:    seqTools.pl revcom [options]
     
    Options:  -i STR   Input file in FASTA format.
              -o STR   Output file in FASTA format.
              -w INT   The line width for FASTA output. [$opts{w}]
    n/
        ) unless ( exists $opts{i} && exists $opts{o} );
     
        my ( $fi, $fo, $width ) = ( $opts{i}, $opts{o}, $opts{w} );
        my $in  = Bio::SeqIO-new( -file ="$fi",  -format ='fasta' );
        my $out = Bio::SeqIO-new( -file ="》$fo", -format ='fasta' );
        while ( my $seq = $in-》next_seq() ) {
            my $rc_seq = $seq-》revcom;
            $out-》width($width);
            $out-》write_seq($rc_seq);
        }
    }
     
    sub length {
        my %opts;
        getopts( 'i:', %opts );
        die(
            qq/
    Usage:    seqTools.pl length [options]
     
    Options:  -i STR   Input file in FASTA format.
    n/
        ) unless ( exists $opts{i} );
     
        my $fi = $opts{i};
        my $in = Bio::SeqIO-new( -file ="$fi", -format ='fasta' );
        while ( my $seq = $in-》next_seq() ) {
            print $seq-》id, "t", $seq-length, "n";
        }
    }
     
    sub content {
        my %opts;
        getopts( 'i:p:', %opts );
        die(
            qq/
    Usage:    seqTools.pl content [options]
     
    Options:  -i STR   Input file in FASTA format.
    n/
        ) unless ( exists $opts{i} );
     
        my $fi = $opts{i};
        my $in = Bio::SeqIO-new( -file ="$fi", -format ='fasta' );
        while ( my $seq = $in-》next_seq() ) {
            print $seq-》id, "t";
            my $length   = $seq-length;
            my $sequence = $seq-》seq;
            my $num_a    = $sequence =~ tr/A//;
            my $num_c    = $sequence =~ tr/C//;
            my $num_g    = $sequence =~ tr/G//;
            my $num_t    = $sequence =~ tr/T//;
            my $num_o    = $length - $num_a - $num_c - $num_g - $num_t;
            my $gc       = sprintf( "%.2f", ( $num_c + $num_g ) / $length * 100 );
            print
    "Length=$lengthtGC=${gc}%tA=$num_a,C=$num_c,G=$num_g,T=$num_t,Others=$num_on";
        }
    }
     
    sub search {
        my %opts;
        getopts( 'i:p:', %opts );
        die(
            qq/
    Usage:    seqTools.pl search [options]
     
    Options:  -i STR   Input file in FASTA format.
              -p STR   Pattern you want to search.
     
    Notes:    1. You can use RegExp supported by Perl.
              2. It uses case-sensitive mode.
    n/
        ) unless ( exists $opts{i} && $opts{p} );
     
        my ( $fi, $pattern ) = ( $opts{i}, $opts{p} );
        my $in = Bio::SeqIO-new( -file ="$fi", -format ='fasta' );
        while ( my $seq = $in-》next_seq() ) {
            print $seq-》id, "t";
            my @starts;
            my $sequence = $seq-》seq;
            while ( $sequence =~ m/($pattern)/g ) {
                my $start = pos($sequence) - CORE::length($1) + 1;
                pos($sequence) = $start;
                push @starts, $start;
            }
            my $num = @starts;
            my $str = join ",", @starts;
            print "$numt$strn";
        }
    }
     
    sub usage {
        die(
            qq/
    Usage:   seqTools.pl 《command》 [《arguments》]n
    Command: format   Format the FASTA record(s).
             revcom   Reverse and complement the FASTA record(s).
             length   Get the length of FASTA record(s).
             content  Calculate the GC content of FASTA record(s).
             search   Find subseq in the FASTA record(s).
     
    Author:  Yixf, yixf1986@gmail.com
    Version: 20110601
    Notes:  1. A similar webtool developed by lh3: http://lh3lh3.users.sourceforge.net/fasta.shtml
            2. When you search, you can use RegExp supported by Perl.
            3. The search function use case-sensitive mode.
    n/
        );
    }
    • 这个程序是我模仿lh3老师的一个Perl程序写的,我自己一点不懂C。
      PS:你的BioPerl教程啥时候继续呀?俺还一直期待着呢。
      现在国内基本上没有BioPerl的系统教程,希望你能尽快完工,填补此项空白。

    • PS2:如果你不抓紧时间,等7.7《仙剑奇侠传5》上市了,估计忙着玩游戏,更没时间了。嘿嘿。

  1. 呵呵,我没那么痴迷的。一般每周就玩一两小时左右吧。
    BioPerl的其他内容涉及得比较深,不是很容易写出来。而且我发现以前写的内容还有些问题。现正在整理中……很感谢你的鼓励!
    建议你给博客加一个评论回复和邮件通知功能,否则我根本看不到回复呢~