Skip to content
Snippets Groups Projects
Commit a5dc7e00 authored by James Bonfield's avatar James Bonfield
Browse files

Improved round-trip support for NM and MD tags in CRAM.

See samtools/samtools#717 for discussion.

The NM tag is ambiguous and infact differs in implementation between
htsjdk and htslib.  Specifically N in both ref and seq is considered
to be a mismatch for samtools and a match in picard.

If we detect this case we now also store NM and MD verbatim, along
with the suspect case of falling off the end of the reference (who
knows what people write to these fields in that ill-defined case).

This makes it more likely that a round-trip from SAM -> CRAM -> SAM
will work even when the input SAM was produced via htsjdk.

To be ultra careful, we also add store_md and store_nm options to
always store this data verbatim.  When combined with decode_md (note
this implicitly also implies decode_nm) this means it is possible to
round-trip while keeping these fields perfect even when they are set
to complete hogwash that neither picard nor samtools accepts, and also
to distinguish between the case of some reads having these fields
while others do not.

For example:

    samtools view -O cram,store_md=1,store_nm=1 in.sam -o out.cram
    samtools view -I decode_md=0 out.cram -o out.cram.sam

I thought about having a join store_md field that covers NM too, but
there are reasons why we'd want to store NM verbatim and not MD such
as NM being tiny in comparison to MD and MD being more tightly defined
in the spec.
parent 11813d54
Branches
Tags
No related merge requests found
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment