newick 格式和phyloxml格式是遗传发育树的两种最为常见的格式,所以它们两者的互转非常重要。
首先讲讲两种文件的特点:
newick: http://evolution.genetics.washington.edu/phylip/newicktree.html
phyloxml: http://www.phyloxml.org
两者除了格式上一个采用括号,一个采用XML标签的差别外,主要是newick一般包含的信息比较少,只有id和branch length, 而phyloxml 包含的信息相当丰富,包括 scientific name, name, gene accession, sequence, 等等,一切能想到到的信息都可以通过拓展phyloxml schema来实现(详情可phyloxml的说明)。
newick转phyloxml我这里提供2种方式:
1. forester decorater: https://sites.google.com/site/cmzmasek/home/software/forester/decorator (可能被屏蔽,若不能访问得翻墙)
因为newick包含的信息少于phyloxml的信息,所以最重的就是根据newick的ID通过其他方式提取到其他你需要添加到phyloxml的信息,并按decorater说明各项以Tab键分隔。下面的这个程序就是从uniport下载的specieslist.txt中提取信息输出为一个以Tab分隔的信息文件。
#! /usr/bin/perl use strict; use warnings; open IN,"<","$ARGV[0]"; open OUT,">","$ARGV[1]"; open IN2,"<","/Users/yangl/Documents/test/specieslist.txt"; my %sci_name; my %tax_id; while(){ #ALCCR E 85095: N=Alcedo cristata chomp; if(/(\w+)\s+\w+\s+(\d+):\s+N=(.+)$/){ $tax_id{$1}=$2; $sci_name{$1}=$3; } } while(){ my @name = split/:/,$_; foreach my $name(@name){ if ($name=~/[\(|\)|,]([A-Z]+.*)$/){ my $accession = $1; # print $accession . "\n"; if ($1=~/(.+)_(\w+)/){ #print OUT "$accession\tTAXONOMY_CODE:$2\tTAXONOMY_ID:$tax_id{$2}\tTAXONOMY_SN:$sci_name{$2}\tSEQ_ACCESSION:$1\tSEQ_ACCESSION_SOURCE:ncbi\tSEQ_NAME:$1\n"; print OUT "$accession\t"; #print "$2\n"; print OUT "TAXONOMY_SN:$sci_name{$2}\t"; print OUT "SEQ_ACCESSION:$1\t"; print OUT "SEQ_ACCESSION_SOURCE:ncbi\t"; print OUT "SEQ_NAME:$1\n"; } } } }
2.perl 包: Bio::TreeIO::phyloxml (利用perl包可进行newick和phyloxml格式的互转),程序示例如下:
#! /usr/bin/perl use strict; use warnings; use Bio::TreeIO; use Data::Dumper; ### #注意: "Bio::TreeIO 的write_tree重新生成的树文件格式需要重排" -已修改Pm文件 ### open OUT,">","name_accession_sciname.txt"; my $treeio = Bio::TreeIO->new(-format => 'phyloxml', -file => '/Users/yangl/Documents/test/wnt_gene_tree.xml'); my $tree = $treeio->next_tree; my $newfile="newfile.newick"; #print Dumper($treeio); my @nodes = $tree->get_nodes(); my @not_leaf_nodes; #print scalar @nodes; foreach my $node (@nodes){ if($node->is_Leaf){ my ($sci_name) = $treeio -> read_annotation('-obj'=>$node, '-path'=>'taxonomy/scientific_name',-attr=>0); my $accession = $node -> {"_sequence"}[0] -> {"_annotation"} -> {"_annotation"} -> {"accession"}[0] -> {"_annotation"} -> {"_text"}[0] -> {"value"}; #print $accession . "\n"; my $seq_name = $node -> {"_sequence"}[0] -> {"_annotation"} -> {"_annotation"} -> {"name"}[0] -> {"_annotation"} -> {"_text"}[0] -> {"value"}; my $acc = $accession; $acc=~s/\s/\-/g; $node->id($acc); print OUT "$acc\tTAXONOMY_SN:$sci_name\tSEQ_ACCESSION_SOURCE:ncbi\tSEQ_ACCESSION:$accession\tSEQ_NAME:$seq_name\n"; #$node = $treeio->add_phyloXML_annotation(-obj=>$node,-attr=>"name = $accession"); $node -> add_tag_value("name",$sci_name); }else{ #push @not_leaf_nodes,$node; } } my $treeout = Bio::TreeIO->new (-format => 'phyloxml',-file=>"> newfile2.xml"); $treeout->write_tree($tree); #phyloxml转phyloxml(修改phyloxml的信息后重新生成phyloxml文件,如修改scitific name),其实newick转phyloxml也一样,只是需要添加你想要的其他信息而不是修改信息 #my $treeout = Bio::TreeIO->new (-format => 'newick',-file=>"> newfile_newick.xml"); #$treeout->write_tree($tree) #phyloxml转newick
注意:利用phyloxml包转换成phyloxml,会导致以下问题:1.失去缩进,整个文件变成一行; 2.各XML标签的顺序会随机出现
1. 失去缩进:没有修改程序所进,但可以通过将模式 ” >< ” 替换成 ” >\n< ” ,来换行显示
2. 已经修改 AnnotatableNode.pm 和 phyloxml.pm 使得我们能自定义标签顺序。
a. 在 AnnotatableNode.pm 倒数第二行(” 1;” 这一行之前)加入自定义排序函数:
=head2 _ann_sort (written by yangli) Title : _ann_sort Usage : @ann = $tree->($order, @ann) Function: Rearrange Annotation to requested order. Returns : An array of annotation tag in requested order. Args : An array of annotationI. =cut sub _ann_sort { my ($self, $order,@ann) = @_; my @ordered; my @order_index; my %ann; my @other; foreach my $ann(@ann){ my $key = $ann->tagname; $ann{$key}=$ann; } foreach my $order_key (@$order){ foreach my $key (keys %ann){ if ($key eq $order_key){ push @ordered,$ann{$key}; delete $ann{$key}; } } } @other = values %ann; return (@ordered,@other); }
b.在phyloxml.pm对应位置调用。
在print_annotation函数中调用自定义函数_ann_sort():
=head2 print_annotation Title : print_annotation Usage : $str = $annotatablenode->print_annotation($str, $annotationcollection) Function: prints the annotationCollection in a phyloXML format. Returns : string of annotation information Args : string to which the Annotation should be concatenated to, annotationCollection that holds the Annotations =cut # Again, it may be more appropriate to make a separate Annotation::phyloXML object # and have this subroutine within that object so it can handle the # reading and writing of the values and attributes. # especially since this function is used both by # Bio::TreeIO::phyloxml (through write_tree) and # Bio::Node::AnnotatableNode (through node_to_string). # but since tagTree is a temporary stub and I didn't want to make # a redundant object I've put it here for now. sub print_annotation { my ($self, $str, $ac) = @_; my @all_anns = $ac->get_Annotations(); #print blessed($self); #调用自定义_ann_sorted函数,规定顺序依次是name branch_length taxonomy,若还存在其他标签其他标签顺序随机 my @all_anns_sorted = $self->_ann_sort([qw(name branch_length taxonomy)],@all_anns); foreach my $ann (@all_anns_sorted) { my $key = $ann->tagname; if ($key eq '_attr') { next; } # attributes are already printed in the previous level if ($ann->isa('Bio::Annotation::SimpleValue')) { if ($key eq '_text') { $str .= $ann->value; } else { $str .= "<$key>"; $str .= $ann->value; $str .= ""; } } elsif ($ann->isa('Bio::Annotation::Collection')) { my @attrs = $ann->get_Annotations('_attr'); if (@attrs) { # if there is a attribute collection $str .= "<$key"; $str = print_attr($self, $str, $attrs[0]); $str .= ">"; } else { $str .= "<$key>"; } $str = print_annotation($self, $str, $ann); $str .= ""; } } return $str; }
在print_seq_annotation函数中调用自定义函数_ann_sort():
=head2 print_sequence_annotation Title : print_sequence_annotation Usage : $str = $node->print_seq_annotation( $str, $seq ); Function: prints the Bio::Seq object associated with the node in a phyloXML format. Returns : string that describes the sequence Args : string to which the Annotation should be concatenated to, Seq object to print in phyloXML =cut # Again, it may be more appropriate to make a separate Annotation::phyloXML object # and have this subroutine within that object so it can handle the # reading and writing of the values and attributes. # especially since this function is used both by # Bio::TreeIO::phyloxml (through write_tree) and # Bio::Node::AnnotatableNode (through node_to_string). # but since tagTree is a temporary stub and I didn't want to make # a redundant object I've put it here for now. sub print_seq_annotation { my ($self, $str, $seq) = @_; $str .= "annotation->get_Annotations('_attr'); # check id_source if ($attr) { my ($id_source) = $attr->get_Annotations('id_source'); if ($id_source) { $str .= " id_source=\"".$id_source->value."\""; } } $str .= ">"; my @all_anns = $seq->annotation->get_Annotations(); #调用排序子程序_ann_sort(),若有accession和name,先排accession然后name,再排其他标签。 my @all_anns_sorted = $self->_ann_sort([qw(accession name)],@all_anns); foreach my $ann (@all_anns_sorted) { my $key = $ann->tagname; if ($key eq '_attr') { next; } # attributes are already printed in the previous level if ($ann->isa('Bio::Annotation::SimpleValue')) { if ($key eq '_text') { $str .= $ann->value; } else { $str .= "<$key>"; $str .= $ann->value; $str .= "$key>"; } } elsif ($ann->isa('Bio::Annotation::Collection')) { my @attrs = $ann->get_Annotations('_attr'); if (@attrs) { # if there is a attribute collection $str .= "<$key"; $str = print_attr($self, $str, $attrs[0]); $str .= ">"; } else { $str .= "<$key>"; } $str = print_annotation($self, $str, $ann); $str .= "$key>"; } } # print mol_seq if ($seq->seq()) { $str .= " "; return $str; }"; $str .= $seq->seq(); $str .= " "; } $str .= "
修改以后就能使XML标签按你自己的顺序排序,修改后的.pm文件下载地址:链接http://pan.baidu.com/s/1bndh9D1 密码: xbe6,对应覆盖Perl/5.18/Bio/Tree目录中的AnnotatableNode.pm和Perl/5.18/Bio/TreeIO目录中的phyloxml.pm即可。
尊重他人劳动成果,转载请注明出处:Bluesky's blog » newick转phyloxml