Footsteps on my way !

从SAM文件中判断unique reads

在NGS数据中,Unique reads指的是只能map到一个位置的reads。

要判断一条read是否unique,需要看SAM文件的optional fields。
optional fields的格式是TAG:TYPE:VALUE,在SAM文件的第12列。多个optional fields由空格分开。
与unique read相关的主要TAG是“NH”,代表一条read贴在了基因组的多少个位置,数据类型为i,即整数,所以”NH:i:1″就代表此条read只贴到了基因组上的一个位置。(但align的目标未必总是基因组,也有可能是转录组,小RNA组等,取决于实验设计。)


BWA的结果比Tophat的结果在mapping信息上更为详细。在BWA结果中,”X0″代表打开gap的个数(BWA不支持splice junction,但支持gap),数据类型为i。”XT”代表mapping类型,数据类型为”A”,即可打印字符,其值可以是Unique/Repeat/N/Mate-sw,有”XT:A:U”就代表这一条是unique read。由于BWA还提供了”AS”(alignment score)和”XS”(suboptimal alignment score)这两项,所以即使有些reads没有被Uniquely mapped,只要AS的值和XS的值相差足够大,也有一定价值。

补充1 :

We prefer to say an alignment is ‘reliable’ rather than ‘unique’ as ‘uniqueness’ is not well defined in general cases. You can get reliable alignments by setting a threshold on mapping quality:

samtools view -bq 1 aln.bam > aln-reliable.bam

You may want to set a more stringent threshold to get more reliable alignments.

补充2 :
I’d suggest you follow bwa’s FAQ advice and rely more in the MAPQ. Still, if you want to

filter uniquely mapped:

$ samtools view bwa.bam | grep "XT:A:U"
Unique alignments in bowtie2 have MAPQ>=2, so you can just filter the results by that.
补充3 :
10-07-2011, 08:26 AM

If you use BWA, the output bam or sam will have an optional flag = XT:A:U set in case of unique reads. So, you could just write a script to pull those lines containing XT:A:U. Similarly, for tophat there’s an optional NH:i:1 (1 for unique hit).True, but you also want the M reads. Those are the ones where the read wasn’t unique, but the mate was, so that made the mapping of the repetative bit sure. And I’ve unfortunately seen R reads where the mate mapped uniquely. I think those are cases where it should have been changed to M, but it wasn’t.

So I don’t think that pulling from the flags alone will do it. You’ll have to pull down pairs, and check to see if either end maps uniquely.

In my case, if both ends don’t map uniquely, I throw them. If one of the pair maps uniquely (that would have XT:A:U and its mate with XT:A:M), I consider the unique hit as a single-end read and remove the other one alone.

couple questions, 1) what do you mean here? “so that made the mapping of the repetative bit sure”. 2) why do you want the other pair when its not uniquely mapped? (assuming you remove non-unique reads otherwise).


尊重他人劳动成果,转载请注明出处:Bluesky's blog » 从SAM文件中判断unique reads

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址