Commits · develop · Thierry Schuepbach / htslib

May 24, 2018
- Optimisation for mutithreaded uncompressed bgzf · a13ac99e
  Rob Davies authored 7 years ago
  
  Eliminates a memcpy from the uncompressed to compressed buffer.
  a13ac99e
- Faster writing of uncompressed data for builds without libdeflate · f49e916e
  Rob Davies authored 7 years ago
  
  f49e916e
- Improve hwrite efficiency for large writes when the buffer is empty · aba6ca46
  Rob Davies authored 7 years ago
  
  When the buffer is empty and the request won't fit, go straight to hwrite2() without trying to copy data into the buffer first. Avoids a memcpy and an extra call to fp->backend->write().
  aba6ca46
- Only signal input queue not full if it really is not full · 49b6d959
  Rob Davies authored 7 years ago
  
  49b6d959
- Only broadcast result_avail if it's the next one to be picked up · 979e6f21
  Rob Davies authored 7 years ago
  
  Reduces the number of times the output consumer will be woken up, only to be disappointed.
  979e6f21
- Fix incorrect bracket placement · 699ed53d
  Rob Davies authored 7 years ago
  
  Compiler warning was correct.
  699ed53d
- Compiler warning whack-a-mole. · 96defa1c
  James Bonfield authored 7 years ago
  
  Silence a clang 5.0.0 warning caused by the recent gcc 8.1.0 warning fixes.
  96defa1c
May 18, 2018

Rob Davies authored 7 years ago

Change kputsn() and kputsn_() to take size_t instead ot int to
fix possible integer overflow in kputs.  They are both static inline
so we can do this without breaking the ABI.  Guards on integer
wrap-around should catch attempts to add over- or negatively- long
strings.

Possible buffer overflow in cram_populate_ref().

Possible use of uninitialised value in bcf_sr_regions_next().  In
that function it is possible that `from` and `to` may not be set if
either `ifrom` or `ito` is negative as that could cause
_regions_parse_line() to skip setting them.

String truncation warning in bam_hdr_write() when compiled with
no optimisation.  There was no need to copy the string anyway,
it can just be written out directly.

Not caught by gcc: Possible overflow in expand_cache_path().

Silence false positive uninitialised value warning in cram_encode.c
process_one_read() when using optimisation -Og or -Os.  gcc fails
to spot that `new = 1` prevents `k` from being used if not set.

55d7bef5

May 16, 2018

New interfaces to add or update bam integer, float and array aux tags (#694) · bceff25a

daviesrob authored 7 years ago

* Pull repeated code to expand bam data to its own function

Adds some missing overflow checks and fixes a few places where
l_data was incremented before trying to expand the data buffer
so it would no longer be valid on failure.

* Add bam_aux_update_int() interface

Makes adding or changing the values of integer tags much easier.
Updated tags will grow in size if needed, including moving
any following data.  They will not shrink - if the new data fits
in the old space the size will remain unchanged even if it is
bigger than stricly necessary.

* Add bam_aux_update_float() interface

* Add bam_aux_update_array() interface

bceff25a

Apr 30, 2018

Undo accidental loss of "static" keyword. · 19a66ce8
James Bonfield authored 7 years ago

19a66ce8

Small speed up to cram indexing. · 53177bfb

James Bonfield authored 7 years ago

With slices that use the multi-ref mode (ref id -2 and tid encoded
using RI data series) we need to decode the slice, but only enough to
get rname, pos and alignment end.  We use the required fields
parameter to prevent wasting time decompressing quality values and
auxiliary tags.

53177bfb

Apr 27, 2018
- Fixes a bug in the creation of the multi-region iterator, that · eeda089c
  Valeriu Ohan authored 7 years ago
  
  caused some reads to be wrongly skipped when reading a BAM file.
  eeda089c
Apr 26, 2018

Bugfix in VCF record REF length update · 71b00a89

Petr Danecek authored 7 years ago

When alleles are updated also bcf1_t.rlen needs to be udpated.
In absence of INFO/END tag, the length of the REF allele is used,
otherwise rlen is calculated from INFO/END. However, by mistake
the END coordinate was used instead of the allele length.

71b00a89

Make bcf_hdr_set_samples(..,NULL,..) work also in write mode. Resolves #692 · ac9c273d
Petr Danecek authored 7 years ago

ac9c273d

Apr 24, 2018
- Update internal state on synced reader's seek · 5fae73fd
  Petr Danecek authored 7 years ago
  
  Fixes #691
  5fae73fd
Apr 19, 2018

Replace tabs with spaces to the next tab stop (multiples of 8). · f1000743

James Bonfield authored 7 years ago

Plus also a few manual tweaks to indentation levels, mostly caused by
failure to keep indentation correct after old search and replaces.

See also e770a1b3 for the earlier commit
to change these elsewhere in htslib.  Use git blame -w to compare beyond
these commits.

For reference, the command used to do this was perl:

    perl -e 'while(<>){while(s/^(.*?)\t/"$1".(" "x(8-length($1)%8))/e){};s/\s*$/\n/;print}'

However GNU "expand | sed 's/ *$//'" also works if you have it
installed.

f1000743

Apr 18, 2018

Fixes `cram_pseek` function by clearing the `ctr_mt` pointer. · c4a7c728

Valeriu Ohan authored 7 years ago

Fixes `cram_ptell` function by using the correct indicator of
maximum records in a container.
Minor code improvements.

c4a7c728

Apr 11, 2018
- Improves the error reporting. · f31f2c00
  Valeriu Ohan authored 7 years ago
  
  f31f2c00
Apr 05, 2018
- fail is uncompressed input file is detected · e90ad09e
  jrayner authored 7 years ago
  
  e90ad09e
Apr 04, 2018
- allow integrity check on files without .gz suffix · 830e0225
  jrayner authored 7 years ago
  
  830e0225
- add integrity check option · ef895567
  jrayner authored 7 years ago
  
  ef895567
Apr 03, 2018

Merge version number bump and NEWS file from master · 5b60dd3f
Rob Davies authored 7 years ago

5b60dd3f
Release 1.8 · be22a2a1
Rob Davies authored 7 years ago

View commits for tag 1.8 1.8

be22a2a1

More tweaks to cram threading. · 107e7d17

James Bonfield authored 7 years ago

The previous commit, while valid, also revealed more woes of multi-slice
containers, specifically when cancelling the current read by seeking
(draining the read-ahead decode queue).

107e7d17

Memory leak iterating multiple queries over cram · 651a936b

rpetrovski authored 7 years ago

The following code consumes memory indefinitely. Memory leak is gone once the change is applied.
Steps:
1. build htslib
2. compile test.c with:
gcc -O0  -ggdb -I htslib/install/include test.c -L htslib/install/lib/ -l:libhts.a -lz -lpthread -llzma -lbz2
3. run ./a.out some.cram chr1
4. watch virtual memory going up in top
5. apply patch, rebuild test.c, notice virtual memory does not change

test.c:
```c++

int main(int argc,char** argv)
{
        hts_itr_t *iter=NULL;
        hts_idx_t *idx=NULL;
        samFile *in = NULL;
        bam1_t *b= NULL;
        bam_hdr_t *header = NULL;
        if(argc!=3) return -1;
        in = sam_open(argv[1], "r");

        if(in==NULL) return -1;
        if ((header = sam_hdr_read(in)) == 0) return -1;

        idx = sam_index_load(in,  argv[1]);
        if(idx==NULL) return -1;

        b = bam_init1();
        fputs("reading\n",stdout);
        do
        {
                if (iter) hts_itr_destroy(iter);
                iter = sam_itr_querys(idx, header, argv[2]);
                if(!iter) return -1;
//              fputs("DO STUFF\n",stdout);
        }
        while (sam_itr_next(in, iter, b) >= 0);

        fputs("done reading\n",stdout);

        hts_itr_destroy(iter);
        bam_destroy1(b);
        hts_idx_destroy(idx);
        bam_hdr_destroy(header);
        sam_close(in);
        return 0;
}
```

651a936b

Mar 29, 2018
- Prevent export of .appveyor.yml [minor] · 816a220c
  Rob Davies authored 7 years ago
  
  816a220c
- Pass in desired compression level on bgzf_open · 6b85e52b
  Rob Davies authored 7 years ago
  
  Avoids fiddling with the internals of the BGZF struct.
  6b85e52b
Mar 28, 2018
- Add bgzip manual page and NEWS item · 2d5fa5df
  Rob Davies authored 7 years ago
  
  Also minor tweak to bgzip usage.
  2d5fa5df
- Add -l / --compress-level option to bgzip · effcb1e1
  Nathan T. Weeks authored 7 years ago
  
  effcb1e1
Mar 26, 2018

Improved round-trip support for NM and MD tags in CRAM. · a5dc7e00

James Bonfield authored 7 years ago

See samtools/samtools#717 for discussion.

The NM tag is ambiguous and infact differs in implementation between
htsjdk and htslib.  Specifically N in both ref and seq is considered
to be a mismatch for samtools and a match in picard.

If we detect this case we now also store NM and MD verbatim, along
with the suspect case of falling off the end of the reference (who
knows what people write to these fields in that ill-defined case).

This makes it more likely that a round-trip from SAM -> CRAM -> SAM
will work even when the input SAM was produced via htsjdk.

To be ultra careful, we also add store_md and store_nm options to
always store this data verbatim.  When combined with decode_md (note
this implicitly also implies decode_nm) this means it is possible to
round-trip while keeping these fields perfect even when they are set
to complete hogwash that neither picard nor samtools accepts, and also
to distinguish between the case of some reads having these fields
while others do not.

For example:

    samtools view -O cram,store_md=1,store_nm=1 in.sam -o out.cram
    samtools view -I decode_md=0 out.cram -o out.cram.sam

I thought about having a join store_md field that covers NM too, but
there are reasons why we'd want to store NM verbatim and not MD such
as NM being tiny in comparison to MD and MD being more tightly defined
in the spec.

a5dc7e00

Mar 21, 2018
- News for spring release. · 11813d54
  Andrew Whitwham authored 7 years ago
  
  11813d54
Mar 15, 2018

Suppress unused function warnings in kseq · 53241915
Rob Davies authored 7 years ago
```
Left over work from commit 5ffc4a20
```
53241915

Avoid unintended macro expansion in KSORT_INIT_GENERIC · b49bb2d7

Rob Davies authored 7 years ago

Netbsd's libc #defines uint16_t to __uint16_t (and similarly for
other stdint types).  This was expanded by KSORT_INIT_GENERIC()
resulting in functions being defined with slightly different names
compared to the ones produced by using KSORT_INIT() directly.
The names also no longer matched the results of expanding
ks_mergesort() and friends.

Fix this by adjusting where the underscore gets pasted into the
names.  This means KSORT_INIT_GENERIC can use the argument
in a token pasting operation, which prevents it from being
expanded.

b49bb2d7

Fix use of (possibly) signed char with ctype functions · 738c16d2

Rob Davies authored 7 years ago

For portability with platforms that still implement isspace() etc.
as an array.

cram/cram_io.c, vcf.c and hfile_libcurl.c use the wrappers in
textutils_internal.h

knetfile.c and kstring.c use casts so we can push the changes
upstream if we want to.

738c16d2

Turn off format warnings in appveyor for mingw-w64 · 5fef1be2

Rob Davies authored 7 years ago

It appears -Wformat is broken in mingw-w64's gcc at present.
For example, see discussion at:
https://github.com/facebook/rocksdb/pull/2052#discussion_r108785499

As there is no easy fix, turn off format warnings in appveyor
configuration to reduce clutter in the output.

5fef1be2

Change some thread workers to use return instead of pthread_exit · fc6a5f93
Rob Davies authored 7 years ago
```
The effect is the same and it fixes some warnings in MinGW's
gcc.
```
fc6a5f93

Fix SunStudio compiler warnings · 2796b437

Rob Davies authored 7 years ago

Mainly unreachable code, a couple of integer issues and anonymous
unions.  Suppressed an anonymous union warning in cram_codecs.h
as it would need a lot of changes to fix.  A bogus 'end of loop not
reached' error is suppressed in knetfile.c (the compiler does
not like macros of the form `do { return; } while (0)`).

Replaced use of `diff -q` with `cmp` in test.pl as the -q option
is not supported in Solaris-derivatives.

2796b437

More general CRAM multi-threading fixes (not related to multi-slice). · 1c4acb41

James Bonfield authored 7 years ago

These were spotted by gcc -fsanitize=thread.

1. Decoding: fd->no_ref is set while decoding the compression header
and used in subsequent slice decodes, but we may then decode the next
container compression header before the previous slices have finished
decoding.

2. Decoding: avoid race when limiting data via required_fields.

fd->decode_md is now copied to s->decode_md, so the slice can disable
this itself if required (such as when the user asked for MD tag to be
filled out while also asking not to return any auxiliary blocks).

Although technically fixing a threading violation, the practical
implementations means this is just a tidyup rather than any real
behavior changes.

3. Encoding: Cram_encode_aux needed an extra guard surrounding the
fd->tags_used field.  This is used to hold tag types seen so far in
the file, within any container or block, so we can keep track of the
compression methods that work best for a any specific tag type.

1c4acb41

Fixed trivial memory leak in test harness · 18931338
James Bonfield authored 7 years ago

18931338

Bug fixes to multi-slice containers (mostly threading related). · 6b6c8a3f

James Bonfield authored 7 years ago

1) When skipping past slices to find one that overlaps the start of
our region, don't free container here (affects non-threading also).
Fixes a crash reported by xiaofeng.liu@sentieon.com.

2) Don't cache cram_get_block_by_id values in the codec as multiple
slices may be decoding in parallel using the same codec.  This also
removes the need for the reset function.

Instead we use the already existing per-slice lookup array, but
improved so it works (mostly without linear scan) on large ID aux
blocks too.  We could conceivably go the whole hog of using a hash
table, but I think it's overkill and this is minimum code.

3) We now distinguish between fd->ctr, c->curr_slice (being consumed
by get_seq calls) and fd->ctr_mt, c->curr_slice_mt (the read-ahead
for dispatching thread tasks).  Similarly for EOF / OOC (out of
container) parameters.

4) Cram multi-threaded flush now does the freeing of containers
better.

5) Added a larger input file of 1000 reads and a test using
multi-slice containers.

Also added the ability to debug the test harness with e.g. valgrind

Set the TEST_PRECMD first.  For example:

    TEST_PRECMD="valgrind --leak-check=full" make check

6b6c8a3f

Admin message