Skip to content
Snippets Groups Projects
Select Git revision
  • master
  • PBdocfixes
  • rce
  • docker
  • harmopt
  • X-SEGV
  • asTbl
  • wFF
  • nnxpl
  • zotf
  • toolsdoc
  • help
  • trap
  • fix-sort
  • sepckfn
  • systest
  • noshell
  • notfound
  • sutd
  • ugperl
  • v1.5.5
21 results

chipseq

  • Clone with SSH
  • Clone with HTTPS
  • ChIP-Seq

    The ChIP-seq software provides methods for the analysis of ChIP-seq data and other types of mass genome annotation data.

    Description

    DNA sequencing has recently been pushed to a new era with the development of massively parallel sequencing technologies. Chromatin Immuno Precipitation (ChIP) allows the enrichment of genomic DNA fragments based on their interaction with specific proteins. In combination with high-throughput sequencing (ChIP-seq) of these fragments, the technique generates millions of short sequence reads (generally 30 to 50 bp in length) that are subsequently mapped back to the reference genome. The ChIP-seq protocol generates thereby a comprehensive definition of genomic loci sharing a common binding site or a particular epigenetic modification. The exploitation of such high-throughput experiments calls consequently for the development of new computational tools for handling ChIP-Seq data as well as other types of next generation sequencing (NGS) data.

    We propose a set of useful tools performing common ChIP-Seq data analysis tasks, including positional correlation analysis, peak detection, and genome partitioning into signal-rich and signal-poor regions.

    These tools exist as stand-alone programs and perform the following tasks:

    1. Positional correlation and generation of an aggregation plot for two genomic features (chipcor);
    2. Extraction of specific genome annotation features around reference genomic anchor points (chipextract);
    3. Read shifting (chipcenter);
    4. Narrow peak caller that uses a fixed width peak size (chippeak);
    5. Broad peak caller algorithm used for broad regions of enrichment (i.e. histone marks) (chippart);
    6. Feature selection tool based on a read count threshold (chipscore).

    The C programs are primarily optimized for speed. For this reason, they use their own compact format for ChIP-Seq data representation called SGA (Simplified Genome Annotation). SGA is a single-line-oriented and tab-delimited format, very similar to BED, with the following five obligatory fields:

    1. Sequence name/ID (Char String),
    2. Feature (Char String),
    3. Sequence Position (Integer),
    4. Strand (+/- or 0),
    5. Read Counts (Integer).

    Additional fields may be added containing application-specific information used by other programs. In the case of ChIP-seq data, SGA files represent genome-wide read count distributions from one or several experiments. The 'feature' field (identified by field 2) contains a short code which identifies an experiment. It often corresponds to the name of the molecular target of a ChIP-seq experiment. Sequences are identified by NCBI/RefSeq chromosome IDs, which are assembly specific in order to prevent mixing of different assemblies. The position field (field 3) represents the start position of the sequence read. The strand field indicates the strand to which the feature has been mapped. Read counts represent the number of sequence reads that have been mapped to a specific position in the genome.

    Input features may be ChIP-seq read positions, peaks identified by ChIP-peak, or any type of genome annotation that can be mapped to a single base on a chromosome.

    An example of SGA-formatted file is shown here below:

    NC_000001.9     H3K4me3 4794    +       1
    NC_000001.9     H3K4me3 6090    +       1
    NC_000001.9     H3K4me3 6099    +       1
    NC_000001.9     H3K4me3 6655    +       1
    NC_000001.9     H3K4me3 18453   -       1
    NC_000001.9     H3K4me3 19285   +       1
    NC_000001.9     H3K4me3 44529   +       1
    NC_000001.9     H3K4me3 46333   +       1
    NC_000001.9     H3K4me3 46349   -       1
    NC_000001.9     H3K4me3 52929   +       1
    NC_000001.9     H3K4me3 59412   +       1
    ...

    Chip-Seq programs require SGA input files to be sorted by sequence name, position, and strand. In the UNIX environment, the command to properly sort SGA files is the following:

    sort -s -k1,1 -k3,3n -k4,4 <SGA file>

    SGA is a generic format can be used to represent a large variety of genome annotations, e.g. the location of transcription start sites (TSS), matches to consensus sequences, or sequence conservation scores. Orientation-less features will be associated with a strand value of 0.

    An example of use of the ChIP-Seq correlation tool (chipcor) is the following:

    chipcor -A "H3K4me3 +" -B "H3K4me3 -" -b -1000 -e 1000 -w 1 -c 20 -n 1 H3K4me3.sga > H3K4me3_fc_n1.out

    Where 'H3K4me3.sga' is the file containing the ChIP-Seq sequence read distribution, which correspond to the H3K4me3 histone modification data. The '-c' option specifies the cut-off in input counts.

    Reads corresponding to histone modifications along the positive strand (option '-A "H3K4me3 +"') are correlated with reads corresponding to the same histone modification pattern on the opposite strand (option '-B "H3K4me3 -"'), and their relative distances are distributed in a histogram within the range [- 1000; + 1000] (options: '-b -1000', '-e 1000').

    The output file (H3K4me3_fc_n1.out) contains all histogram entries in simple text format. Histogram entries show count density values (option '-n 1') of the target feature at relative distances to the reference features, namely all bin entries are normalized by the total number of reference read counts and the histogram window width.

    Such types of histograms are also called aggregation plots (APs). An aggregation plot shows the distribution of a particular genomic feature (e.g. a ChIP-seq signal) relative to a specified anchor point (e.g. a transcription start site) within a set of genomic regions.


    ChIP-Seq has a web interface which is freely available at:

    https://epd.expasy.org/chipseq/