Footsteps on my way !
perl/linux/测序分析

newick转phyloxml

newick 格式和phyloxml格式是遗传发育树的两种最为常见的格式,所以它们两者的互转非常重要。

首先讲讲两种文件的特点:

newick: http://evolution.genetics.washington.edu/phylip/newicktree.html

phyloxml: http://www.phyloxml.org

两者除了格式上一个采用括号,一个采用XML标签的差别外,主要是newick一般包含的信息比较少,只有id和branch length, 而phyloxml 包含的信息相当丰富,包括 scientific name, name, gene accession, sequence, 等等,一切能想到到的信息都可以通过拓展phyloxml schema来实现(详情可phyloxml的说明)。

newick转phyloxml我这里提供2种方式:

1. forester decorater: https://sites.google.com/site/cmzmasek/home/software/forester/decorator (可能被屏蔽,若不能访问得翻墙)

因为newick包含的信息少于phyloxml的信息,所以最重的就是根据newick的ID通过其他方式提取到其他你需要添加到phyloxml的信息,并按decorater说明各项以Tab键分隔。下面的这个程序就是从uniport下载的specieslist.txt中提取信息输出为一个以Tab分隔的信息文件。

#! /usr/bin/perl
use strict;
use warnings;

open IN,"<","$ARGV[0]"; open OUT,">","$ARGV[1]";
open IN2,"<","/Users/yangl/Documents/test/specieslist.txt";

my %sci_name;
my %tax_id;
while(){
	#ALCCR E   85095: N=Alcedo cristata
	chomp;
	if(/(\w+)\s+\w+\s+(\d+):\s+N=(.+)$/){
		$tax_id{$1}=$2;
		$sci_name{$1}=$3;
	}
	
}

while(){
	my @name = split/:/,$_;
	foreach my $name(@name){

		if ($name=~/[\(|\)|,]([A-Z]+.*)$/){
			my $accession = $1;
#			print $accession . "\n";
			if ($1=~/(.+)_(\w+)/){
				#print OUT "$accession\tTAXONOMY_CODE:$2\tTAXONOMY_ID:$tax_id{$2}\tTAXONOMY_SN:$sci_name{$2}\tSEQ_ACCESSION:$1\tSEQ_ACCESSION_SOURCE:ncbi\tSEQ_NAME:$1\n";
				print OUT "$accession\t";
				#print "$2\n";
				print OUT "TAXONOMY_SN:$sci_name{$2}\t";
				print OUT "SEQ_ACCESSION:$1\t";
				print OUT "SEQ_ACCESSION_SOURCE:ncbi\t";
				print OUT "SEQ_NAME:$1\n";
			}
		}
	}
}

2.perl 包: Bio::TreeIO::phyloxml (利用perl包可进行newick和phyloxml格式的互转),程序示例如下:

#! /usr/bin/perl
use strict;
use warnings;
use Bio::TreeIO;
use Data::Dumper;
###
#注意: "Bio::TreeIO 的write_tree重新生成的树文件格式需要重排" -已修改Pm文件
###

open OUT,">","name_accession_sciname.txt";
my $treeio = Bio::TreeIO->new(-format => 'phyloxml',
								-file => '/Users/yangl/Documents/test/wnt_gene_tree.xml');
my $tree = $treeio->next_tree;
my $newfile="newfile.newick";
#print Dumper($treeio);


my @nodes = $tree->get_nodes();

my @not_leaf_nodes;
#print scalar @nodes;
foreach my $node (@nodes){
	if($node->is_Leaf){
		my ($sci_name) = $treeio -> read_annotation('-obj'=>$node, '-path'=>'taxonomy/scientific_name',-attr=>0);
		my $accession = $node -> {"_sequence"}[0] -> {"_annotation"} -> {"_annotation"} -> {"accession"}[0] -> {"_annotation"} -> {"_text"}[0] -> {"value"};
		#print $accession . "\n";
		my $seq_name = $node -> {"_sequence"}[0] -> {"_annotation"} -> {"_annotation"} -> {"name"}[0] -> {"_annotation"} -> {"_text"}[0] -> {"value"};
		my $acc = $accession;
		$acc=~s/\s/\-/g;
		$node->id($acc);
		print OUT "$acc\tTAXONOMY_SN:$sci_name\tSEQ_ACCESSION_SOURCE:ncbi\tSEQ_ACCESSION:$accession\tSEQ_NAME:$seq_name\n";
		#$node = $treeio->add_phyloXML_annotation(-obj=>$node,-attr=>"name = $accession");
		$node -> add_tag_value("name",$sci_name);
	}else{
		#push @not_leaf_nodes,$node;
	}
}
my $treeout = Bio::TreeIO->new (-format => 'phyloxml',-file=>"> newfile2.xml");
$treeout->write_tree($tree); #phyloxml转phyloxml(修改phyloxml的信息后重新生成phyloxml文件,如修改scitific name),其实newick转phyloxml也一样,只是需要添加你想要的其他信息而不是修改信息
#my $treeout = Bio::TreeIO->new (-format => 'newick',-file=>"> newfile_newick.xml");
#$treeout->write_tree($tree) #phyloxml转newick

注意:利用phyloxml包转换成phyloxml,会导致以下问题:1.失去缩进,整个文件变成一行; 2.各XML标签的顺序会随机出现

1. 失去缩进:没有修改程序所进,但可以通过将模式 ” >< ” 替换成 ” >\n< ” ,来换行显示

2. 已经修改 AnnotatableNode.pm 和 phyloxml.pm 使得我们能自定义标签顺序。

a. 在 AnnotatableNode.pm 倒数第二行(” 1;” 这一行之前)加入自定义排序函数:

=head2 _ann_sort (written by yangli)

Title   : _ann_sort
Usage   : @ann = $tree-&gt;($order, @ann)
Function: Rearrange Annotation to requested order.
Returns : An array of annotation tag in requested order.
Args    : An array of annotationI.

=cut

sub _ann_sort
{
	my ($self, $order,@ann) = @_;
	my @ordered;
	my @order_index;
	my %ann;
	my @other;

	foreach my $ann(@ann){
		my $key = $ann-&gt;tagname;
		$ann{$key}=$ann;
	}

	foreach my $order_key (@$order){
		foreach my $key (keys %ann){
			if ($key eq $order_key){
				push @ordered,$ann{$key};
				delete $ann{$key};
			}
		}
	}
	@other = values %ann;
	return (@ordered,@other);
}

b.在phyloxml.pm对应位置调用。

在print_annotation函数中调用自定义函数_ann_sort():

=head2 print_annotation

 Title   : print_annotation
 Usage   : $str = $annotatablenode->print_annotation($str, $annotationcollection)
 Function: prints the annotationCollection in a phyloXML format.
 Returns : string of annotation information
 Args    : string to which the Annotation should be concatenated to,
           annotationCollection that holds the Annotations

=cut

# Again, it may be more appropriate to make a separate Annotation::phyloXML object
# and have this subroutine within that object so it can handle the 
# reading and writing of the values and attributes.
# especially since this function is used both by 
# Bio::TreeIO::phyloxml (through write_tree) and 
# Bio::Node::AnnotatableNode (through node_to_string).
# but since tagTree is a temporary stub and I didn't want to make 
# a redundant object I've put it here for now.

sub print_annotation 
{
  my ($self, $str, $ac) = @_; 
 
  my @all_anns = $ac->get_Annotations();
  
  #print blessed($self);
  #调用自定义_ann_sorted函数,规定顺序依次是name branch_length taxonomy,若还存在其他标签其他标签顺序随机
  my @all_anns_sorted = $self->_ann_sort([qw(name branch_length taxonomy)],@all_anns);
  foreach my $ann (@all_anns_sorted) {
    my $key = $ann->tagname;
    if ($key eq '_attr') { next; } # attributes are already printed in the previous level 
    if  ($ann->isa('Bio::Annotation::SimpleValue')) 
    {
      if ($key eq '_text') {
        $str .= $ann->value;
      }
      else {
        $str .= "<$key>";
        $str .= $ann->value;
        $str .= "";
      }
    }
    elsif ($ann->isa('Bio::Annotation::Collection')) 
    {
      my @attrs = $ann->get_Annotations('_attr');
      if (@attrs) {   # if there is a attribute collection
        $str .= "<$key"; $str = print_attr($self, $str, $attrs[0]); $str .= ">";
      }
      else {
        $str .= "<$key>";
      }
      $str = print_annotation($self, $str, $ann);
      $str .= "";
    }
  } 
  return $str;
}

在print_seq_annotation函数中调用自定义函数_ann_sort():

=head2 print_sequence_annotation

 Title   : print_sequence_annotation
 Usage   : $str = $node->print_seq_annotation( $str, $seq );
 Function: prints the Bio::Seq object associated with the node 
           in a phyloXML format.
 Returns : string that describes the sequence
 Args    : string to which the Annotation should be concatenated to,
           Seq object to print in phyloXML

=cut

# Again, it may be more appropriate to make a separate Annotation::phyloXML object
# and have this subroutine within that object so it can handle the 
# reading and writing of the values and attributes.
# especially since this function is used both by 
# Bio::TreeIO::phyloxml (through write_tree) and 
# Bio::Node::AnnotatableNode (through node_to_string).
# but since tagTree is a temporary stub and I didn't want to make 
# a redundant object I've put it here for now.

sub print_seq_annotation 
{
  my ($self, $str, $seq) = @_; 
  
  $str .= "annotation->get_Annotations('_attr'); # check id_source
  if ($attr) { 
    my ($id_source) = $attr->get_Annotations('id_source');
    if ($id_source) {
      $str .= " id_source=\"".$id_source->value."\"";
    }
  }
  $str .= ">";

  my @all_anns = $seq->annotation->get_Annotations();
   #调用排序子程序_ann_sort(),若有accession和name,先排accession然后name,再排其他标签。
   my @all_anns_sorted = $self->_ann_sort([qw(accession name)],@all_anns);
  foreach my $ann (@all_anns_sorted) {
    my $key = $ann->tagname;
    if ($key eq '_attr') { next; } # attributes are already printed in the previous level 
    if  ($ann->isa('Bio::Annotation::SimpleValue')) 
    {
      if ($key eq '_text') {
        $str .= $ann->value;
      }
      else {
        $str .= "<$key>";
        $str .= $ann->value;
        $str .= "";
      }
    }
    elsif ($ann->isa('Bio::Annotation::Collection')) 
    {
      my @attrs = $ann->get_Annotations('_attr');
      if (@attrs) {   # if there is a attribute collection
        $str .= "<$key";
        $str = print_attr($self, $str, $attrs[0]);
        $str .= ">";
      }
      else {
        $str .= "<$key>";
      }
      $str = print_annotation($self, $str, $ann);
      $str .= "";
    }
  }
  # print mol_seq 
  if ($seq->seq()) {
    $str .= "";
    $str .= $seq->seq();
    $str .= "";
  }

  $str .= "";
  return $str;
}

修改以后就能使XML标签按你自己的顺序排序,修改后的.pm文件下载地址:链接http://pan.baidu.com/s/1bndh9D1 密码: xbe6,对应覆盖Perl/5.18/Bio/Tree目录中的AnnotatableNode.pm和Perl/5.18/Bio/TreeIO目录中的phyloxml.pm即可。

尊重他人劳动成果,转载请注明出处:Bluesky's blog » newick转phyloxml

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址