Skip to content
Snippets Groups Projects
Commit 71e277e1 authored by Giovanna Ambrosini's avatar Giovanna Ambrosini
Browse files

Changes to be committed:

	modified:   README.txt
	modified:   examples/README.html
	modified:   examples/README.txt
	new file:   examples/ctcf_real.mat
	new file:   examples/stat1_jaspar.mat
	new file:   examples/stat1_lpm.mat
	new file:   examples/stat1_pfm.mat
	new file:   examples/statx_transfac.mat
	modified:   perl_tools/lpmconvert.pl
	modified:   perl_tools/pwmconvert.pl
	modified:   pwm_bowtie_wrapper
	new file:   pwm_convert
	modified:   pwm_scan
parent 1e948e4e
Branches
Tags v1.1.2
No related merge requests found
......@@ -132,6 +132,10 @@ We also provide three (bash) shell wrapper scripts that embed the entire analysi
- pwm_bowtie_wrapper Scan a genome with a PWM and a p-value using Bowtie
- pwm_mscan_wrapper Scan a genome with a PWM and a p-value using matrix_scan
For matrix conversion into integer log-odds, we also provide the following bash script:
- pwm_convert Convert JASPAR, TRANSFAC, PFM, LPM and real PWM formats to integer log-odds
PWMScan pipeline and application example
============================================================================
......
......@@ -217,7 +217,7 @@
</li>
<p class="code">
mba -c 1128 -l 19 chen10_ctcf.mat | sort -k2,2 -nr &gt chen10_ctcf_co1128.dat<br><br>
mba -c 1128 -l 19 chen10_ctcf.mat &gt chen10_ctcf_co1128.dat<br>
<font color=firebrick># Convert the tag list into FASTA format (for bowtie):</font><br>
awk '{print "&gt"$2"\n"$1}' chen10_ctcf_co1128.dat &gt chen10_ctcf_co1128_taglist.fa<br>
</p>
......@@ -244,12 +244,12 @@
<p>
The command to map the list of PWM tags to hg19 is the follwing:
<p class="code">
<font color=firebrick># It takes of the order of 20 seconds for a full scan of the hg19 genome assembly (with the exclusion of the mitochondrial chromosome). The total number of hits is 143597.</font><br><br>
bowtie -l 19 -n0 -a BOWTIE_DIR/h_sapiens_hg19 -f chen10_ctcf_co1128_taglist.fa --un unmapped.dat &gt chen10_ctcf_co1128_bowtie.out<br>
<br>
bowtie -l 19 -n0 -a /home/local/db/bowtie/h_sapiens_hg19 -f chen10_ctcf_co1128_taglist.fa --un unmapped.dat &gt chen10_ctcf_co1128_bowtie.out<br>
<font color=firebrick># It takes of the order of 12 seconds for a full scan of the hg19 genome assembly (with the exclusion of the mitochondrial chromosome). The total number of hits is 143597.</font><br>
bowtie --threads 4 -l 19 -n0 -a BOWTIE_DIR/h_sapiens_hg19 -f chen10_ctcf_co1128_taglist.fa --un unmapped.dat &gt chen10_ctcf_co1128_bowtie.out<br>
<font color=firebrick># Example:</font>
<br>
<font color=firebrick># Convert the bowtie output to <b><span style='font-size:13px'>BEDdetail</span></b> via the) following command:</font><br>
bowtie --threads 4 -l 19 -n0 -a /home/local/db/bowtie/h_sapiens_hg19 -f chen10_ctcf_co1128_taglist.fa --un unmapped.dat &gt chen10_ctcf_co1128_bowtie.out<br>
<font color=firebrick># Convert the bowtie output to <b><span style='font-size:13px'>BEDdetail</span></b> via the following command:</font><br>
sort -s -k3,3 -k4,4n chen10_ctcf_co1128_bowtie.out | bowtie2bed -s hg19 -l 19 -i chr_NC_PATH | awk 'BEGIN { while((getline line < "chen10_ctcf_co1128_scoretab.txt") &gt 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close("chen10_ctcf_co1128_scoretab.txt")} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t""chen10_ctcf""\t""P-value="pvalue[$5]}' &gt chen10_ctcf_co1128_bowtie.bed<br>
<font color=firebrick># Example (use default chr_NC_PATH):</font><br>
......@@ -261,18 +261,13 @@
# Also note that the bowtie ouput reports the match score in the first field so that it can be easily retrieved.<br>
<br>
# To run the bowtie-based pipeline, type:</font><br>
bowtie -l 19 -n0 -a BOWTIE_DIR/h_sapiens_hg19 -f chen10_ctcf_co1128_taglist.fa --un unmapped.dat | sort -s -k3,3 -k4,4n |
bowtie --threads 4 -l 19 -n0 -a BOWTIE_DIR/h_sapiens_hg19 -f chen10_ctcf_co1128_taglist.fa --un unmapped.dat | sort -s -k3,3 -k4,4n |
bowtie2bed -s hg19 -l 19 -i chr_NC_PATH | awk 'BEGIN { while((getline line < "chen10_ctcf_co1128_scoretab.txt") &gt 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close("chen10_ctcf_co1128_scoretab.txt")} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t""chen10_ctcf""\t""P-value="pvalue[$5]}' &gt chen10_ctcf_co1128_bowtie.bed<br>
<br>
<font color=firebrick># Bowtie can read input files from stdin. You should specify "-" for stdin.<br>
# In such case, you can run the entire pipeline as follows:</font><br>
mba -c 1128 -l 19 chen10_ctcf.mat | sort -k2,2 -nr | awk '{print ">"$2"\n"$1}' | bowtie -l 19 -n0 -a BOWTIE_DIR/h_sapiens_hg19 -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s hg19 -l 19 -i chr_NC_PATH | awk 'BEGIN { while((getline line < "chen10_ctcf_co1128_scoretab.txt") &gt 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close("chen10_ctcf_co1128_scoretab.txt")} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t""chen10_ctcf""\t""P-value="pvalue[$5]}' &gt chen10_ctcf_co1128_bowtie.bed<br>
<br>
mba -c 1128 -l 19 chen10_ctcf.mat | awk '{print ">"$2"\n"$1}' | bowtie --threads 4 -l 19 -n0 -a BOWTIE_DIR/h_sapiens_hg19 -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s hg19 -l 19 -i chr_NC_PATH | awk 'BEGIN { while((getline line < "chen10_ctcf_co1128_scoretab.txt") &gt 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close("chen10_ctcf_co1128_scoretab.txt")} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t""chen10_ctcf""\t""P-value="pvalue[$5]}' &gt chen10_ctcf_co1128_bowtie.bed<br>
<font color=firebrick># Example:</font><br>
bowtie -l 19 -n0 -a /home/local/db/bowtie/h_sapiens_hg19 -f chen10_ctcf_co1128_taglist.fa --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s hg19 -l 19 | awk 'BEGIN { while((getline line < "chen10_ctcf_co1128_scoretab.txt") &gt 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close("chen10_ctcf_co1128_scoretab.txt")} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t""chen10_ctcf""\t""P-value="pvalue[$5]}' &gt chen10_ctcf_co1128_bowtie.bed<br>
<br>
<font color=firebrick># or</font><br>
mba -c 1128 -l 19 chen10_ctcf.mat | sort -k2,2 -nr | awk '{print "&gt"$2"\n"$1}' | bowtie -l 19 -n0 -a /home/local/db/bowtie/h_sapiens_hg19 -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s hg19 -l 19 | awk 'BEGIN { while((getline line < "chen10_ctcf_co1128_scoretab.txt") &gt 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close("chen10_ctcf_co1128_scoretab.txt")} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t""chen10_ctcf""\t""P-value="pvalue[$5]}' &gt chen10_ctcf_co1128_bowtie.bed<br>
mba -c 1128 -l 19 chen10_ctcf.mat | awk '{print "&gt"$2"\n"$1}' | bowtie --threads 4 -l 19 -n0 -a /home/local/db/bowtie/h_sapiens_hg19 -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s hg19 -l 19 | awk 'BEGIN { while((getline line < "chen10_ctcf_co1128_scoretab.txt") &gt 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close("chen10_ctcf_co1128_scoretab.txt")} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t""chen10_ctcf""\t""P-value="pvalue[$5]}' &gt chen10_ctcf_co1128_bowtie.bed<br>
</p>
</li>
<p>
......@@ -298,7 +293,7 @@
chr|NC_000004|NC_000004.11
</pre></font>
<p>
<font color=firebrick># Parallelization of matrix_scan<br></font>
<font color=firebrick># Parallelization of matrix_scan</font><br>
<font color=black>To improve performance, we can run matrix_scan in parallel (on each chromosome file) using a simple python script called <b>matrix_scan_parallel.py</b> that dispatches the jobs to available CPU-cores. In this way, matrix_scan competes with the bowtie-based approach.</font>
<p>
<p class="code">
......@@ -307,20 +302,20 @@
<br>
<font color=firebrick># To use matrix_scan_parallel.py, type the following command:<br> </font>
<i>python /local/bin/matrix_scan_parallel.py -m chen10_ctcf.mat -f "$(ls /home/local/db/genome/hg19/chrom*.seq|grep -v chromMt)" -c 1128 -p 15 | sort -s -k1,1 -k2,2n -k6,6 &gt chen10_ctcf_co1128_matrix_scan.out</i>
<br>
<font color=firebrick># The -p option is used to set the maximum number of parallel processes for execution of matrix_scan. The above example takes only about 20 seconds to complete the task in agreement with Bowtie.<br>
<br><br>
<font color=firebrick># The -p option is used to set the maximum number of parallel processes for execution of matrix_scan. The above example takes only about 20 seconds to complete the task in agreement with Bowtie.<br><br>
# Convert the matrix scan output to <b><span style='font-size:13px'>BEDdetail</span></b> by issuing the following command:</font>
<br>
<i>mscan2bed -s hg19 -i GENOME_DIR chen10_ctcf_co1128_matrix_scan.out | awk 'BEGIN { while((getline line < "chen10_ctcf_co1128_scoretab.txt") &gt 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close("chen10_ctcf_co1128_scoretab.txt")} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t""chen10_ctcf""\t""P-value="pvalue[$5]}' &gt chen10_ctcf_co1128_matrix_scan.bed</i>
<br>
<font color=firebrick># Example:<br></font>
<i>mscan2bed -s hg19 chen10_ctcf_co1128_matrix_scan.out | awk 'BEGIN { while((getline line < "chen10_ctcf_co1128_scoretab.txt") &gt 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close("chen10_ctcf_co1128_scoretab.txt")} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t""chen10_ctcf""\t""P-value="pvalue[$5]}' &gt chen10_ctcf_co1128_matrix_scan.bed</i>
<br>
<br><br>
<font color=firebrick>
# Note : to convert matrix_scan output to BED format, we use the chr_NC_gi file for assembly hg19.<br>
# By default, the chr_NC_gi is located at chr_NC_PATH/hg19 = /home/local/db/genome/hg19.<br>
&nbsp;&nbsp;chr_NC_PATH can be changed by using the &lt-i&gt option.
<br>
<br><br>
# To run the entire pipeline:
<br></font>
<i>cat GENOME_DIR/hg19/chrom*.seq | matrix_scan -m chen10_ctcf.mat -c 1128 | sort -s -k1,1 -k2,2n -k6,6 | mscan2bed -s hg19 -i GENOME_DIR | awk 'BEGIN { while((getline line < "chen10_ctcf_co1128_scoretab.txt") &gt 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close("chen10_ctcf_co1128_scoretab.txt")} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t""chen10_ctcf""\t""P-value="pvalue[$5]}' &gt chen10_ctcf_co1128_matrix_scan.bed</i>
......@@ -372,7 +367,7 @@
<ul style='list-style-type: none;'>
<li><font color=firebrick><b>1) Bowtie (pwm_bowtie_wrapper)</b> script</font>
<p class="code">
# Usage: `basename $0` &ltmatrix-file&gt &ltp-value&gt &ltbowtie-dir&gt &ltidx-file&gt &ltassembly[hg19|mm9|..]&gt [&ltnon-overlapping matches [1/0]&gt]
# Usage: pwm_bowtie_wrapper &ltmatrix-file&gt &ltp-value&gt &ltbowtie-dir&gt &ltidx-file&gt &ltassembly[hg19|mm9|..]&gt [&ltnon-overlapping matches [1/0]&gt]
<br>
<font color=firebrick># Example:</font>
<br>
......@@ -384,7 +379,7 @@
</li><li><font color=firebrick><b>2) Matrix Scan (pwm_mscan_wrapper)</b> script</font>
<p class="code">
# Usage: `basename $0` &ltmatrix-file&gt &ltp-value&gt &ltgenome-dir&gt &ltassembly[hg19|mm9|..]&gt [&ltnon-overlapping matches [1/0]&gt] [&ltparallel [1/0]&gt]
# Usage: pwm_mscan_wrapper &ltmatrix-file&gt &ltp-value&gt &ltgenome-dir&gt &ltassembly[hg19|mm9|..]&gt [&ltnon-overlapping matches [1/0]&gt] [&ltparallel [1/0]&gt]
<br>
<font color=firebrick># Example:</font>
<br>
......@@ -397,7 +392,7 @@
</li><li><font color=firebrick><b>3) PWM Scan (pwm_scan)</b> script</font>
<p class="code">
# Usage: `basename $0` &ltmatrix-file&gt &ltp-value&gt &ltassembly[hg19|mm9|..]&gt &ltgenome-root-dir [ex:/db]&gt [&ltnon-overlapping matches [1/0]&gt] [&ltparallel [1/0]&gt]
# Usage: pwm_scan &ltmatrix-file&gt &ltp-value&gt &ltassembly[hg19|mm9|..]&gt &ltgenome-root-dir [ex:/db]&gt [&ltnon-overlapping matches [1/0]&gt] [&ltparallel [1/0]&gt]
<br>
<font color=firebrick># Examples:</font>
<br>
......@@ -425,13 +420,66 @@
Of course, this only works on a multi-core processor machine.
<br>
</li></ul>
<p>
<h3> How to build Bowtie index files</h3>
<h3> Shell (bash) Wrapper for matrix conversion</h3>
<p class="tab">
The pwm_convert bash script is used to convert JASPAR, TRANSFAC, PFM, LPM and real PWM formats to integer log-odds format.
<ul style='list-style-type: none;'>
<p class="code">
# Usage: pwm_convert &ltmatrix-file&gt -f=&ltformat [jaspar|transfac|lpm|pfm|real]&gt
<br>
&nbsp;&nbsp; OPTIONS:
<br>
&nbsp;&nbsp;&nbsp;&nbsp; -h show help message
<br>
&nbsp;&nbsp;&nbsp;&nbsp; -b=<bg-freq-file> bg nucleotide composition file
<br>
&nbsp;&nbsp;&nbsp;&nbsp; -c=<pseudo-cnt> pseudo-count fraction [def=0]
<br>
&nbsp;&nbsp;&nbsp;&nbsp; -m=<minscore> minimal score value [def=-10000]
<br>
&nbsp;&nbsp;&nbsp;&nbsp; -n=<log-scaling> log scaling factor [def=100]
<br>
&nbsp;&nbsp;&nbsp;&nbsp; -s=<mult> matrix multiplication factor (for real PWMs) [def=100]
<br>
&nbsp;&nbsp;&nbsp;&nbsp; Convert JASPAR, TRANSFAC, PFM, LPM and real PWM formats to integer log likelihoods (log-odds) format.
<p>
Let's define BOWTIE_DIR as as the directory containing the bowtie indices.
<font color=black>By default, the prior background nucleotide frequencies are set to 0.25.</font>
<br>
<p class="code">
<font color=firebrick># Examples</font>
<br>
<font color=firebrick># Convert JASPAR to log-odds:</font>
<br>
<i>pwm_convert stat1_jaspar.mat -f=jaspar 2&gt/dev/null</i><br>
<font color=firebrick># If we want to use different backgroung nucleotide frequencies:</font>
<br>
<i>pwm_convert stat1_jaspar.mat -f=jaspar -b=bg_probs_2.txt 2&gt/dev/null</i><br>
<font color=black># Where the bg_probs_2.txt file includes the prior bg frequencies, defined as follows:</font>
<br>
&nbsp;&nbsp;A 0.29 <br>
&nbsp;&nbsp;C 0.21 <br>
&nbsp;&nbsp;G 0.21 <br>
&nbsp;&nbsp;T 0.29 <br>
<font color=firebrick># Convert TRANSFAC to log-odds:</font>
<br>
<i>pwm_convert statx_transfac.mat -f=transfac -c=0.0000001 2&gt/dev/null</i><br>
<font color=firebrick># Convert PFM (Position Frequency Matrix) to log-odds:</font>
<br>
<i>pwm_convert pwm_convert stat1_pfm.mat -f=pfm 2&gt/dev/null</i><br>
<br>
</ul>
<p>
<h3> How to build Bowtie index files</h3>
<p class="tab">
<ul style='list-style-type: none;'>
<p>
Let's define BOWTIE_DIR as as the directory containing the bowtie indices.<br>
Go to the BOWTIE_DIR directory, and concatenate all chromosome files into a sequence file that includes the entire hg19 genome assembly:
<p class="code">
ls -1 -v DB_DIR/chrom*.seq | xargs cat &gt h_sapiens_hg19.fa
......
......@@ -153,7 +153,7 @@ matrix_prob --bg "0.29,0.21,0.21,0.29" chen10_ctcf.mat > chen10_ctcf_co1128_scor
This step is only necessary if fast string-matchig tools (such as Bowtie) are used to scan the genome.
mba -c 1128 -l 19 chen10_ctcf.mat | sort -k2,2 -nr > chen10_ctcf_co1128.dat
mba -c 1128 -l 19 chen10_ctcf.mat > chen10_ctcf_co1128.dat
Remove the score column:
......@@ -182,11 +182,11 @@ The reason for generating a match list in FASTA format is that we want to carry
bowtie -l 19 -n0 -a BOWTIE_DIR/h_sapiens_hg19 -f chen10_ctcf_co1128_taglist.fa --un unmapped.dat > chen10_ctcf_co1128_bowtie.out
# It takes of the order of 20 seconds for a full scan of the hg19 genome assembly (with the exclusion of the mitochondrial chromosome).
# It takes of the order of 12-15 seconds for a full scan of the hg19 genome assembly (with the exclusion of the mitochondrial chromosome).
The total number of hits is 143597.
# Example:
bowtie -l 19 -n0 -a /home/local/db/bowtie/h_sapiens_hg19 -f chen10_ctcf_co1128_taglist.fa --un unmapped.dat > chen10_ctcf_co1128_bowtie.out
bowtie --threads 4 -l 19 -n0 -a /home/local/db/bowtie/h_sapiens_hg19 -f chen10_ctcf_co1128_taglist.fa --un unmapped.dat > chen10_ctcf_co1128_bowtie.out
# Convert the bowtie output to BEDdetail via the following command:
......@@ -202,20 +202,16 @@ The reason for generating a match list in FASTA format is that we want to carry
# To run the bowtie-based pipeline, type:
bowtie -l 19 -n0 -a BOWTIE_DIR/h_sapiens_hg19 -f chen10_ctcf_co1128_taglist.fa --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s hg19 -l 19 -i chr_NC_PATH | awk 'BEGIN { while((getline line < "chen10_ctcf_co1128_scoretab.txt") > 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close("chen10_ctcf_co1128_scoretab.txt")} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t""chen10_ctcf""\t""P-value="pvalue[$5]}' > chen10_ctcf_co1128_bowtie.bed
bowtie --threads 4 -l 19 -n0 -a BOWTIE_DIR/h_sapiens_hg19 -f chen10_ctcf_co1128_taglist.fa --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s hg19 -l 19 -i chr_NC_PATH | awk 'BEGIN { while((getline line < "chen10_ctcf_co1128_scoretab.txt") > 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close("chen10_ctcf_co1128_scoretab.txt")} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t""chen10_ctcf""\t""P-value="pvalue[$5]}' > chen10_ctcf_co1128_bowtie.bed
# Bowtie can read input files from stdin. You should specify "-" for stdin.
In such case, you can run the entire pipeline as follows:
mba -c 1128 -l 19 chen10_ctcf.mat | sort -k2,2 -nr | awk '{print ">"$2"\n"$1}' | bowtie -l 19 -n0 -a BOWTIE_DIR/h_sapiens_hg19 -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s hg19 -l 19 -i chr_NC_PATH | awk 'BEGIN { while((getline line < "chen10_ctcf_co1128_scoretab.txt") > 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close("chen10_ctcf_co1128_scoretab.txt")} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t""chen10_ctcf""\t""P-value="pvalue[$5]}' > chen10_ctcf_co1128_bowtie.bed
mba -c 1128 -l 19 chen10_ctcf.mat | awk '{print ">"$2"\n"$1}' | bowtie --threads 4 -l 19 -n0 -a BOWTIE_DIR/h_sapiens_hg19 -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s hg19 -l 19 -i chr_NC_PATH | awk 'BEGIN { while((getline line < "chen10_ctcf_co1128_scoretab.txt") > 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close("chen10_ctcf_co1128_scoretab.txt")} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t""chen10_ctcf""\t""P-value="pvalue[$5]}' > chen10_ctcf_co1128_bowtie.bed
# Example:
bowtie -l 19 -n0 -a /home/local/db/bowtie/h_sapiens_hg19 -f chen10_ctcf_co1128_taglist.fa --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s hg19 -l 19 | awk 'BEGIN { while((getline line < "chen10_ctcf_co1128_scoretab.txt") > 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close("chen10_ctcf_co1128_scoretab.txt")} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t""chen10_ctcf""\t""P-value="pvalue[$5]}' > chen10_ctcf_co1128_bowtie.bed
# Or:
mba -c 1128 -l 19 chen10_ctcf.mat | sort -k2,2 -nr | awk '{print ">"$2"\n"$1}' | bowtie -l 19 -n0 -a /home/local/db/bowtie/h_sapiens_hg19 -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s hg19 -l 19 | awk 'BEGIN { while((getline line < "chen10_ctcf_co1128_scoretab.txt") > 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close("chen10_ctcf_co1128_scoretab.txt")} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t""chen10_ctcf""\t""P-value="pvalue[$5]}' > chen10_ctcf_co1128_bowtie.bed
mba -c 1128 -l 19 chen10_ctcf.mat | awk '{print ">"$2"\n"$1}' | bowtie --threads 4 -l 19 -n0 -a /home/local/db/bowtie/h_sapiens_hg19 -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s hg19 -l 19 | awk 'BEGIN { while((getline line < "chen10_ctcf_co1128_scoretab.txt") > 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close("chen10_ctcf_co1128_scoretab.txt")} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t""chen10_ctcf""\t""P-value="pvalue[$5]}' > chen10_ctcf_co1128_bowtie.bed
5) Using matrix_scan
......@@ -249,7 +245,7 @@ The reason for generating a match list in FASTA format is that we want to carry
python /local/bin/matrix_scan_parallel.py -m chen10_ctcf.mat -f "$(ls /home/local/db/genome/hg19/chrom*.seq|grep -v chromMt)" -c 1128 -p 15 | sort -s -k1,1 -k2,2n -k6,6 > chen10_ctcf_co1128_matrix_scan.out
The -p option is used to set the maximum number of parallel processes for execution of matrix_scan. The above example takes only about 20 seconds to complete the task in agreement with Bowtie.
The -p option is used to set the maximum number of parallel processes for execution of matrix_scan. The above example takes only about 15 seconds to complete the task in agreement with Bowtie.
# Convert the matrix scan output to BEDdetail by issuing the following command:
......@@ -287,7 +283,7 @@ Shell (bash) Wrappers to execute the analysis pipeline
The output file is a match list in BEDdetail format.
-------------------------------------------------------------------------------
------------------------------------------------------------------------------
A few remarks on the general pwm_scan script.
......@@ -303,33 +299,33 @@ Shell (bash) Wrappers to execute the analysis pipeline
# The chrNC_dir variable (used for locating the the chr_NC_gi/chr_hdr files) is set to chrNC_dir=<genome-root-dir>/genome.
# The bin_dir varaible (used for locating the matrix_scan_parallel.py script) is set to /home/local/bin.
-------------------------------------------------------------------------------
------------------------------------------------------------------------------
1) Bowtie wrapper script
# Usage: `basename $0` <matrix-file> <p-value> <bowtie-dir> <genome-idx-file> <assembly[hg19|mm9|..]> [<non-overlapping matches [1/0]>]
# Usage: pwm_bowtie_wrapper <matrix-file> <p-value> <bowtie-dir> <genome-idx-file> <assembly[hg19|mm9|..]> [<non-overlapping matches [1/0]>]
# Example:
pwm_bowtie_wrapper chen10_ctcf.mat 0.00001 /home/local/db/bowtie h_sapiens_hg19 hg19 0
# Output file (BED) : chen10_ctcf_co1128_bowtie.bed
-------------------------------------------------------------------------------
------------------------------------------------------------------------------
2) Matrix Scan (matrix_scan) wrapper script
# Usage: `basename $0` <matrix-file> <p-value> <genome-dir> <assembly[hg19|mm9|..]> [<non-overlapping matches [1/0]>] [<parallel [1/0]>]
# Usage: pwm_mscan_wrapper <matrix-file> <p-value> <genome-dir> <assembly[hg19|mm9|..]> [<non-overlapping matches [1/0]>] [<parallel [1/0]>]
# Example:
pwm_mscan_wrapper chen10_ctcf.mat 0.00001 /home/local/db/genome hg19 0 1
# Output file (BED) : chen10_ctcf_co1128_matrix_scan.bed
-------------------------------------------------------------------------------
------------------------------------------------------------------------------
3) PWM Scan (pwm_scan) wrapper script
# Usage: `basename $0` <matrix-file> <p-value> <assembly[hg19|mm9|..]> <genome-root-dir [ex:/db]> [<non-overlapping matches [1/0]>] [<parallel [1/0>]>]
# Usage: pwm_scan <matrix-file> <p-value> <assembly[hg19|mm9|..]> <genome-root-dir [ex:/db]> [<non-overlapping matches [1/0]>] [<parallel [1/0>]>]
# Examples:
......@@ -354,6 +350,52 @@ Shell (bash) Wrappers to execute the analysis pipeline
# Of course it only works on a multi-core processor machine.
Shell (bash) Wrapper for matrix conversion
==============================================================================
The pwm_convert bash script is used to convert JASPAR, TRANSFAC, PFM, LPM and real PWM formats to integer log-odds format.
# Usage: pwm_convert <matrix-file> -f=<format [jaspar|transfac|lpm|pfm|real]>
OPTIONS:
-h show help message
-b=<bg-freq-file> bg nucleotide composition file
-c=<pseudo-cnt> pseudo-count fraction [def=0]
-m=<minscore> minimal score value [def=-10000]
-n=<log-scaling> log scaling factor [def=100]
-s=<mult> matrix multiplication factor (for real PWMs) [def=100]
Convert JASPAR, TRANSFAC, PFM, LPM and real PWM formats to integer log likelihoods (log-odds) format.
By default, the prior background nucleotide frequencies are set to 0.25.
# Examples
# Convert JASPAR to log-odds:
pwm_convert stat1_jaspar.mat -f=jaspar 2>/dev/null
# If we want to use different backgroung nucleotide frequencies:
pwm_convert stat1_jaspar.mat -f=jaspar -b=bg_probs_2.txt 2>/dev/null
# Where the bg_probs_2.txt file includes the prior bg frequencies, defined as follows:
A 0.29
C 0.21
G 0.21
T 0.29
# Convert TRANSFAC to log-odds:
pwm_convert statx_transfac.mat -f=transfac -c=0.0000001 2>/dev/null
# Convert PFM (Position Frequency Matrix) to to log-odds:
pwm_convert stat1_pfm.mat -f=pfm 2>/dev/null
------------------------------------------------------------------------------
How to build Bowtie index files
==============================================================================
......
>CTCF-motif
-0.96 -0.89 0.00 -0.72
-0.59 -1.83 0.00 -1.22
-2.64 0.00 -3.50 -3.10
-4.48 0.00 -4.70 -4.71
0.00 -2.54 -2.29 -1.84
-2.12 -0.12 0.00 -3.09
-0.31 0.00 -2.13 -0.32
0.00 -3.88 -2.90 -2.74
-4.15 -4.69 0.00 -5.28
0.00 -4.05 -0.05 -3.56
-3.20 -4.33 0.00 -0.89
-4.02 -5.28 0.00 -4.95
-3.36 -2.68 0.00 -2.74
-1.83 0.00 -2.99 -2.21
0.00 -2.37 -0.04 -2.77
-1.73 0.00 -0.53 -1.17
-0.96 -0.08 -1.95 0.00
0.00 -0.74 -0.54 -1.63
>MA0137.2 STAT1
A [ 208 859 251 10 8 106 23 528 696 53 1900 2030 954 336 417 ]
C [1076 496 574 22 14 1921 1900 762 31 30 124 17 263 760 804 ]
G [ 415 279 144 11 38 14 7 115 1292 1700 29 23 552 270 425 ]
T [ 378 446 1112 2038 2023 44 155 680 66 302 32 15 315 714 431 ]
> letter-probability matrix V_STAT1_01: alength= 4 w= 21 nsites= 100.1 E= 0
0.080000 0.180000 0.419000 0.321000
0.232232 0.097097 0.372372 0.298298
0.138723 0.312375 0.197605 0.351297
0.175175 0.379379 0.381381 0.064064
0.595405 0.134865 0.150849 0.118881
0.311688 0.262737 0.093906 0.331668
0.000000 0.000000 0.000000 1.000000
0.000000 0.000000 0.000000 1.000000
0.039000 0.909000 0.020000 0.032000
0.000000 1.000000 0.000000 0.000000
0.000000 0.325000 0.675000 0.000000
0.000000 0.000000 1.000000 0.000000
0.038038 0.000000 0.949950 0.012012
1.000000 0.000000 0.000000 0.000000
1.000000 0.000000 0.000000 0.000000
0.362637 0.087912 0.294705 0.254745
0.111000 0.111000 0.111000 0.667000
0.198198 0.105105 0.696697 0.000000
0.248000 0.127000 0.466000 0.159000
0.125125 0.367367 0.492492 0.015015
0.317682 0.125874 0.396603 0.159840
> M00224 STAT1
8.0 18.0 41.9 32.1
23.2 9.7 37.2 29.8
13.9 31.3 19.8 35.2
17.5 37.9 38.1 6.4
59.6 13.5 15.1 11.9
31.2 26.3 9.4 33.2
0 0 0 100
0 0 0 100
3.9 90.9 2 3.2
0 100 0 0
0 32.5 67.5 0
0 0 100 0
3.8 0 94.9 1.2
100 0 0 0
100 0 0 0
36.3 8.8 29.5 25.5
11.1 11.1 11.1 66.7
19.8 10.5 69.6 0
24.8 12.7 46.6 15.9
12.5 36.7 49.2 1.5
31.8 12.5 39.7 16
AC M00223
XX
ID V$STAT_01
XX
DT 29.11.1995 (created); ewi.
DT 11.03.2003 (updated); dtc.
CO Copyright (C), Biobase GmbH.
XX
NA STATx
XX
DE signal transducers and activators of transcription
XX
BF T01575 STAT1alpha; Species: mouse, Mus musculus.
BF T01492 STAT1alpha; Species: human, Homo sapiens.
BF T01573 STAT1beta; Species: human, Homo sapiens.
XX
PO A C G T
01 0 0 0 14 T
02 0 0 0 14 T
03 2 12 0 0 C
04 0 12 0 2 C
05 0 9 2 3 C
06 4 2 8 0 G
07 0 0 8 6 K
08 13 1 0 0 A
09 14 0 0 0 A
XX
BA 14 genomic sequences of 14 different genes
XX
CC compiled sequences
XX
RN [1]; RE0003481.
RX PUBMED: 7774815.
RA Horvath C. M., Wen Z., Darnell jr J. E.
RT A STAT protein domain that determines DNA sequence recognition suggests a novel DNA-binding domain
RL Genes Dev. 9:984-994 (1995).
XX
......@@ -207,18 +207,26 @@ while ($line = <MF>) {
for ($i_motif = 0; $i_motif < $len; $i_motif++) {
for ($i_base = 0; $i_base < $num_bases; $i_base++) {
if ($ofile) {
if ($motif{$i_base, $i_motif} > 0) {
if ($c > 0) {
printf(OF "%7d ", round((log( ($motif{$i_base, $i_motif} + $bg{$bases[$i_base]} * $c) / ($bg{$bases[$i_base]} * (1 + $c)) )/log(2.0))*$logscl) );
} else {
if ($motif{$i_base, $i_motif} > 0) {
printf(OF "%7d ", round((log( $motif{$i_base, $i_motif} / $bg{$bases[$i_base]} )/log(2.0))*$logscl) );
} else {
printf(OF "%7d ", $minscore);
}
}
if ($motif{$i_base, $i_motif} > 0) {
}
if ($c > 0) {
printf("%7d ", round((log( ($motif{$i_base, $i_motif} + $bg{$bases[$i_base]} * $c) / ($bg{$bases[$i_base]} * (1 + $c)) )/log(2.0))*$logscl) );
} else {
if ($motif{$i_base, $i_motif} > 0) {
printf("%7d ", round((log( $motif{$i_base, $i_motif} / $bg{$bases[$i_base]} )/log(2.0))*$logscl) );
} else {
printf("%7d ", $minscore);
}
}
}
if ($ofile) {
print(OF "\n");
}
......
......@@ -26,7 +26,6 @@ $bg{"T"} = 0.25;
my $usage = "USAGE: pwmconvert [options] <matrix file>
Options:
-bg <background file> set of f_a
-c check format : check whether PWM scores are Real or Integer
and print out the corresponding Real/Integer flag
-n <scaling factor> set scaling factor (int) for real format only
......@@ -58,9 +57,7 @@ my $id_list = "";
my $header = 1;
while (scalar(@ARGV) > 1) {
$next_arg = shift(@ARGV);
if ($next_arg eq "-bg") {
$bg_file = shift(@ARGV);
} elsif ($next_arg eq "-c") {
if ($next_arg eq "-c") {
$check_flag = 1;
} elsif ($next_arg eq "-n") {
$scale = shift(@ARGV);
......@@ -83,34 +80,6 @@ while (scalar(@ARGV) > 1) {
}
($matrix_file) = @ARGV;
# read the background file
if (defined($bg_file)) {
open($bg_file, "<$bg_file") || die("Can't open $bg_file.\n");
$total_bg = 0;
while (<$bg_file>) {
next if (/^#/); # skip comments
($a, $f) = split;
if ($a eq "A" || $a eq "a") {
$bg{"A"} = $f;
$total_bg += $f;
} elsif ($a eq "C" || $a eq "c") {
$bg{"C"} = $f;
$total_bg += $f;
} elsif ($a eq "G" || $a eq "g") {
$bg{"G"} = $f;
$total_bg += $f;
} elsif ($a eq "T" || $a eq "t") {
$bg{"T"} = $f;
$total_bg += $f;
}
}
# make sure they sum to 1
foreach $key (keys %bg) {
$bg{$key} /= $total_bg;
#printf STDERR "$key $bg{$key}\n";
}
} # background file
# Open the matrix file for reading.
open(MF, "<$matrix_file") || die("Can't open $matrix_file.\n");
# Open the output file
......
......@@ -126,13 +126,13 @@ pwmout_bed=${matrix_name}_co${matrix_score}_bowtie.bed
echo "======== Bowtie-based pipeline ========"
if [ $non_overlapping == 0 ]
then
echo "mba -c $matrix_score -l $matrix_len $matrix_file | sort -k2,2 -nr | awk '{print \">\"\$2\"\n\"\$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len -i $chrNC_dir | awk 'BEGIN { while((getline line < \"$pwmScore_tab\") > 0 ) {split(line,f,\" \"); pvalue[f[1]]=f[2]} close(\"$pwmScore_tab\")} {print \$1\"\t\"\$2\"\t\"\$3\"\t\"\$4\"\t\"\$5\"\t\"\$6\"\t\"\"$matrix_name\"\"\t\"\"P-value=\"pvalue[\$5]}' > $pwmout_bed"
echo "mba -c $matrix_score -l $matrix_len $matrix_file | awk '{print \">\"\$2\"\n\"\$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len -i $chrNC_dir | awk 'BEGIN { while((getline line < \"$pwmScore_tab\") > 0 ) {split(line,f,\" \"); pvalue[f[1]]=f[2]} close(\"$pwmScore_tab\")} {print \$1\"\t\"\$2\"\t\"\$3\"\t\"\$4\"\t\"\$5\"\t\"\$6\"\t\"\"$matrix_name\"\"\t\"\"P-value=\"pvalue[\$5]}' > $pwmout_bed"
echo "..."
mba -c $matrix_score -l $matrix_len $matrix_file | sort -k2,2 -nr | awk '{print ">"$2"\n"$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len -i $chrNC_dir | awk -v scoretab="$pwmScore_tab" -v pwmname="$matrix_name" 'BEGIN { while((getline line < scoretab) > 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close(scoretab)} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"pwmname"\t""P-value="pvalue[$5]}' > $pwmout_bed
mba -c $matrix_score -l $matrix_len $matrix_file | awk '{print ">"$2"\n"$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len -i $chrNC_dir | awk -v scoretab="$pwmScore_tab" -v pwmname="$matrix_name" 'BEGIN { while((getline line < scoretab) > 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close(scoretab)} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"pwmname"\t""P-value="pvalue[$5]}' > $pwmout_bed
else
echo "mba -c $matrix_score -l $matrix_len $matrix_file | sort -k2,2 -nr | awk '{print \">\"\$2\"\n\"\$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len -i $chrNC_dir | filterOverlaps -l$matrix_len | awk 'BEGIN { while((getline line < \"$pwmScore_tab\") > 0 ) {split(line,f,\" \"); pvalue[f[1]]=f[2]} close(\"$pwmScore_tab\")} {print \$1\"\t\"\$2\"\t\"\$3\"\t\"\$4\"\t\"\$5\"\t\"\$6\"\t\"\"$matrix_name\"\"\t\"\"P-value=\"pvalue[\$5]}' > $pwmout_bed"
echo "mba -c $matrix_score -l $matrix_len $matrix_file | awk '{print \">\"\$2\"\n\"\$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len -i $chrNC_dir | filterOverlaps -l$matrix_len | awk 'BEGIN { while((getline line < \"$pwmScore_tab\") > 0 ) {split(line,f,\" \"); pvalue[f[1]]=f[2]} close(\"$pwmScore_tab\")} {print \$1\"\t\"\$2\"\t\"\$3\"\t\"\$4\"\t\"\$5\"\t\"\$6\"\t\"\"$matrix_name\"\"\t\"\"P-value=\"pvalue[\$5]}' > $pwmout_bed"
echo "..."
mba -c $matrix_score -l $matrix_len $matrix_file | sort -k2,2 -nr | awk '{print ">"$2"\n"$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len -i $chrNC_dir | filterOverlaps -l$matrix_len | awk -v scoretab="$pwmScore_tab" -v pwmname="$matrix_name" 'BEGIN { while((getline line < scoretab) > 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close(scoretab)} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"pwmname"\t""P-value="pvalue[$5]}' > $pwmout_bed
mba -c $matrix_score -l $matrix_len $matrix_file | awk '{print ">"$2"\n"$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len -i $chrNC_dir | filterOverlaps -l$matrix_len | awk -v scoretab="$pwmScore_tab" -v pwmname="$matrix_name" 'BEGIN { while((getline line < scoretab) > 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close(scoretab)} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"pwmname"\t""P-value="pvalue[$5]}' > $pwmout_bed
fi
echo "List of matches (BED format) : $pwmout_bed"
......
#!/bin/bash
# pwm_convert: Matrix conversion utility
#
# Convert JASPAR, TRANSFAC, PFM, LPM, and real PWM formats
# to integer log likelihoods (log-odds) format.
#
# arguments:
# matrix-file
# matrix format
# bg base composition
# pseudo-count fraction
# minimal log score
# log scaling factor
#
# 02.11.2107 Giovanna Ambrosini
ARGS=2 # Script takes 6 arguments, but at least 2 arguments are mandatory.
E_BADARGS=85 # Wrong number of arguments passed to script.
bg_freq=""
pseudocnt=0
minscore=-10000
logscl=100
mult=100
format=""
function usage()
{
echo "Usage: `basename $0` <matrix-file> -f=<format [jaspar|transfac|lpm|pfm|real]>"
echo ""
echo "OPTIONS:"
echo -e "\t-h show help message"
echo -e "\t-b=<bg-freq-file> bg nucleotide composition file"
echo -e "\t-c=<pseudo-cnt> pseudo-count fraction [def=$pseudocnt]"
echo -e "\t-m=<minscore> minimal score value [def=$minscore]"
echo -e "\t-n=<log-scaling> log scaling factor [def=$logscl]"
echo -e "\t-s=<mult> matrix multiplication factor (for real PWMs) [def=$mult]"
echo -e ""
echo -e "\tConvert JASPAR, TRANSFAC, PFM, LPM and real PWM formats to integer log likelihoods (log-odds) format."
echo ""
}
if [ $# -lt "$ARGS" ]
then
usage
exit $E_BADARGS
fi
if [ -f "$1" ]
then
matrix_file=$1
else
echo "File \"$1\" does not exist."
exit $E_BADARGS
fi
bg_freq="bg_probs.txt"
echo -e "A 0.25\nC 0.25\nG 0.25\nT 0.25" > $bg_freq
if [ "$2" != "" ]
then
format=`echo $2 | awk -F= '{print $2}'`
fi
while [ "$3" != "" ]; do
PARAM=`echo $3 | awk -F= '{print $1}'`
VALUE=`echo $3 | awk -F= '{print $2}'`
case $PARAM in
-h)
usage
exit
;;
-b)
bg_freq=$VALUE
;;
-c)
pseudocnt=$VALUE
;;
-m)
minscore=$VALUE
;;
-n)
logscl=$VALUE
;;
-s)
mult=$VALUE
;;
*)
echo "ERROR: unknown parameter \"$PARAM\""
usage
exit 1
;;
esac
shift
done
echo "BG base composition file: $bg_freq" >&2;
cat $bg_freq >&2;
echo "Pseudo-count fraction: $pseudocnt" >&2;
echo "Monimal score value: $minscore" >&2;
echo "Log scaling factor: $logscl" >&2;
# Extract basename from matrix file (without path)
matrix_name=$(basename "$matrix_file")
extension="${matrix_name##*.}"
matrix_name="${matrix_name%.*}"
echo "PWM name: $matrix_name" >&2
echo "PWM format: $format" >&2
echo "======== Computing PWM length ========" >&2
matrix_len=$(cat $matrix_file | perl -ane 'next if (/^#/ or /^>/); print;' | wc -l)
file_len=$(cat $matrix_file | wc -l)
if [ $file_len -gt $matrix_len ]
then
sed 1d $matrix_file > $matrix_file".tmp"
matrix_file=$matrix_file".tmp"
fi
echo "PWM file: $matrix_file" >&2
echo "PWM length: $matrix_len" >&2
echo "PWM file length : $file_len" >&2
matrix_file_tmp=$matrix_file".tmp"
echo "======== PWM conversion ========" >&2
if [ $format == 'jaspar' ]
then
if [ $(echo "$pseudocnt > 0" | bc) -eq 1 ]
then
echo "jasparconvert.pl -c $pseudocnt -m $minscore -n $logscl -bg $bg_freq -w l -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null" >&2
jasparconvert.pl -c $pseudocnt -m $minscore -n $logscl -bg $bg_freq -w l -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null; cat $matrix_file_tmp
else
echo "jasparconvert.pl -m $minscore -n $logscl -bg $bg_freq -w l -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null" >&2
jasparconvert.pl -m $minscore -n $logscl -bg $bg_freq -w l -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null; cat $matrix_file_tmp
fi
fi
if [ $format == 'transfac' ]
then
if [ $(echo "$pseudocnt > 0" | bc) -eq 1 ]
then
echo "transfaconvert.pl -c $pseudocnt -m $minscore -n $logscl -bg $bg_freq -w l -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null" >&2
transfaconvert.pl -c $pseudocnt -m $minscore -n $logscl -bg $bg_freq -w l -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null; cat $matrix_file_tmp
else
echo "transfaconvert.pl -m $minscore -n $logscl -bg $bg_freq -w l -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null" >&2
transfaconvert.pl -m $minscore -n $logscl -bg $bg_freq -w l -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null; cat $matrix_file_tmp
fi
fi
if [ $format == 'pfm' ]
then
if [ $(echo "$pseudocnt > 0" | bc) -eq 1 ]
then
echo "pfmconvert.pl -c $pseudocnt -m $minscore -n $logscl -bg $bg_freq -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null" >&2
pfmconvert.pl -c $pseudocnt -m $minscore -n $logscl -bg $bg_freq -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null; cat $matrix_file_tmp
else
echo "pfmconvert.pl -m $minscore -n $logscl -bg $bg_freq -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null" >&2
pfmconvert.pl -m $minscore -n $logscl -bg $bg_freq -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null; cat $matrix_file_tmp
fi
fi
if [ $format == 'lpm' ]
then
if [ $(echo "$pseudocnt > 0" | bc) -eq 1 ]
then
echo "lpmconvert.pl -c $pseudocnt -m $minscore -n $logscl -bg $bg_freq -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null" >&2
lpmconvert.pl -c $pseudocnt -m $minscore -n $logscl -bg $bg_freq -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null; cat $matrix_file_tmp
else
echo "lpmconvert.pl -m $minscore -n $logscl -bg $bg_freq -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null" >&2
lpmconvert.pl -m $minscore -n $logscl -bg $bg_freq -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null; cat $matrix_file_tmp
fi
fi
if [ $format == 'real' ]
then
echo "pwmconvert.pl -n $mult -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null" >&2
pwmconvert.pl -n $mult -noheader -o $matrix_file_tmp $matrix_file 1>/dev/null 2>/dev/null; cat $matrix_file_tmp
fi
# Remove temporary files
if [ $file_len -gt $matrix_len ]
then
rm $matrix_file
fi
rm $matrix_file_tmp $bg_freq
......@@ -220,13 +220,13 @@ then
echo "======== Bowtie-based pipeline ========"
if [ $non_overlapping == 0 ]
then
echo "mba -c $matrix_score -l $matrix_len $matrix_file | sort -k2,2 -nr | awk '{print \">\"\$2\"\n\"\$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len $chrNC_dir | awk 'BEGIN { while((getline line < \"$pwmScore_tab\") > 0 ) {split(line,f,\" \"); pvalue[f[1]]=f[2]} close(\"$pwmScore_tab\")} {print \$1\"\t\"\$2\"\t\"\$3\"\t\"\$4\"\t\"\$5\"\t\"\$6\"\t\"\"$matrix_name\"\"\t\"\"P-value=\"pvalue[\$5]}' > $pwmout_bed"
echo "mba -c $matrix_score -l $matrix_len $matrix_file | awk '{print \">\"\$2\"\n\"\$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len $chrNC_dir | awk 'BEGIN { while((getline line < \"$pwmScore_tab\") > 0 ) {split(line,f,\" \"); pvalue[f[1]]=f[2]} close(\"$pwmScore_tab\")} {print \$1\"\t\"\$2\"\t\"\$3\"\t\"\$4\"\t\"\$5\"\t\"\$6\"\t\"\"$matrix_name\"\"\t\"\"P-value=\"pvalue[\$5]}' > $pwmout_bed"
echo "..."
mba -c $matrix_score -l $matrix_len $matrix_file | sort -k2,2 -nr | awk '{print ">"$2"\n"$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len -i $chrNC_dir | awk -v scoretab="$pwmScore_tab" -v pwmname="$matrix_name" 'BEGIN { while((getline line < scoretab) > 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close(scoretab)} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"pwmname"\t""P-value="pvalue[$5]}' > $pwmout_bed
mba -c $matrix_score -l $matrix_len $matrix_file | awk '{print ">"$2"\n"$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len -i $chrNC_dir | awk -v scoretab="$pwmScore_tab" -v pwmname="$matrix_name" 'BEGIN { while((getline line < scoretab) > 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close(scoretab)} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"pwmname"\t""P-value="pvalue[$5]}' > $pwmout_bed
else
echo "mba -c $matrix_score -l $matrix_len $matrix_file | sort -k2,2 -nr | awk '{print \">\"\$2\"\n\"\$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len $chrNC_dir | filterOverlaps -l$matrix_len | awk 'BEGIN { while((getline line < \"$pwmScore_tab\") > 0 ) {split(line,f,\" \"); pvalue[f[1]]=f[2]} close(\"$pwmScore_tab\")} {print \$1\"\t\"\$2\"\t\"\$3\"\t\"\$4\"\t\"\$5\"\t\"\$6\"\t\"\"$matrix_name\"\"\t\"\"P-value=\"pvalue[\$5]}' > $pwmout_bed"
echo "mba -c $matrix_score -l $matrix_len $matrix_file | awk '{print \">\"\$2\"\n\"\$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len $chrNC_dir | filterOverlaps -l$matrix_len | awk 'BEGIN { while((getline line < \"$pwmScore_tab\") > 0 ) {split(line,f,\" \"); pvalue[f[1]]=f[2]} close(\"$pwmScore_tab\")} {print \$1\"\t\"\$2\"\t\"\$3\"\t\"\$4\"\t\"\$5\"\t\"\$6\"\t\"\"$matrix_name\"\"\t\"\"P-value=\"pvalue[\$5]}' > $pwmout_bed"
echo "..."
mba -c $matrix_score -l $matrix_len $matrix_file | sort -k2,2 -nr | awk '{print ">"$2"\n"$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len -i $chrNC_dir | filterOverlaps -l$matrix_len | awk -v scoretab="$pwmScore_tab" -v pwmname="$matrix_name" 'BEGIN { while((getline line < scoretab) > 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close(scoretab)} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"pwmname"\t""P-value="pvalue[$5]}' > $pwmout_bed
mba -c $matrix_score -l $matrix_len $matrix_file | awk '{print ">"$2"\n"$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len -i $chrNC_dir | filterOverlaps -l$matrix_len | awk -v scoretab="$pwmScore_tab" -v pwmname="$matrix_name" 'BEGIN { while((getline line < scoretab) > 0 ) {split(line,f," "); pvalue[f[1]]=f[2]} close(scoretab)} {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"pwmname"\t""P-value="pvalue[$5]}' > $pwmout_bed
fi
echo "List of matches (BED format) : $pwmout_bed"
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment