diff --git a/examples/README.html b/examples/README.html index 2f406ce58b7e84e0eab8f7e7d1f85b856ea32390..c527dc21086b913e75cacb5893b234d7ebf7232b 100644 --- a/examples/README.html +++ b/examples/README.html @@ -224,7 +224,7 @@ <p> The command to map the list of PWM tags to hg19 is the follwing: <p class="code"> - <font color=firebrick># It takes of the order of 30 seconds for a full scan of the hg19 genome assembly (with the exclusion of the mitochondrial chromosome). The total number of hits is 143597.</font><br><br> + <font color=firebrick># It takes of the order of 20 seconds for a full scan of the hg19 genome assembly (with the exclusion of the mitochondrial chromosome). The total number of hits is 143597.</font><br><br> bowtie -l 19 -n0 -a BOWTIE_DIR/h_sapiens_hg19 -f chen10_ctcf_co1128_taglist.fa --un unmapped.dat > chen10_ctcf_co1128_bowtie.out<br> <br> bowtie -l 19 -n0 -a /home/local/db/bowtie/h_sapiens_hg19 -f chen10_ctcf_co1128_taglist.fa --un unmapped.dat > chen10_ctcf_co1128_bowtie.out<br> @@ -280,18 +280,19 @@ Ex: chr|NC_000004|NC_000004.11 </pre></font> + <p> + <font color=firebrick># Parallelization of matrix_scan<br></font> + <font color=black>To improve performance, we can run matrix_scan in parallel (on each chromosome file) using a simple perl script called <b>matrix_scan_parallel.pl</b>. In this way, matrix_scan competes with the bowtie-based approach.</font> <p> <p class="code"> <font color=firebrick># Example:<br></font> - <i>cat /home/local/db/genome/hg19/chrom*.seq | matrix_scan -m chen10_ctcf.mat -c 1128 | sort -s -k1,1 -k2,2n -k6,6 > chen10_ctcf_co1128_matrix_scan.out</i></font> + <i>cat /home/local/db/genome/hg19/chrom*.seq | matrix_scan -m chen10_ctcf.mat -c 1128 | sort -s -k1,1 -k2,2n -k6,6 > chen10_ctcf_co1128_matrix_scan.out</i> <br> - <font color=firebrick># Parallelization of matrix_scan<br></font> - <font color=black>To improve performance, we can run matrix_scan in parallel (on each chromosome file) using a simple perl script called matrix_scan_parallel.pl. - The above example takes only 30-40 seconds to complete the task, thus competing with the bowtie-based approach.<br> - To use matrix_scan_parallel.pl, type the following command:<br> </font> + <font color=firebrick># To use matrix_scan_parallel.pl, type the following command:<br> </font> <i>matrix_scan_parallel.pl /home/local/db/genome/hg19/chrom\*.seq chen10_ctcf.mat 1128 | sort -s -k1,1 -k2,2n -k6,6 > chen10_ctcf_co1128_matrix_scan.out</i> <br> - <font color=firebrick># Convert the matrix scan output to BED by issuing the following command:</font> + <font color=firebrick># The above example takes only about 20 seconds to complete the task in agreement with Bowtie.<br> + # Convert the matrix scan output to BED by issuing the following command:</font> <br> <i>mscan2bed -s hg19 -i GENOME_DIR chen10_ctcf_co1128_matrix_scan.out > chen10_ctcf_co1128_matrix_scan.bed</i> <br> @@ -301,7 +302,7 @@ <font color=firebrick> # Note : to convert matrix_scan output to BED format, we use the chr_NC_gi file for assembly hg19.<br> # By default, the chr_NC_gi is located at chr_NC_PATH/hg19 = /home/local/db/genome/hg19.<br> - chr_NC_PATH can be changed by using option -i. + chr_NC_PATH can be changed by using the <-i> option. <br> # To run the entire pipeline: <br></font> @@ -386,19 +387,18 @@ <br> <font color=firebrick># Output file (BED) : chen10_ctcf_co1128_matrix_scan.bed</font> <br> - <font color=firebrick># Try also:</font> + <font color=firebrick># Also try the following commands (with p-value=10<sup>-6</sup>):</font> <br> - <i>pwm_scan chen10_ctcf.mat 0.000001 hg19 /home/local/db 1</i></font> + <i>pwm_scan chen10_ctcf.mat 0.000001 hg19 /home/local/db 1</i></font><br> <i>pwm_scan chen10_ctcf.mat 0.000001 hg19 /home/local/db </i></font> <br> <font color=firebrick># Output file (BED) : chen10_ctcf_co1549.bed</font> <br> <font color=firebrick># N.B.</font> <br> - <font color=black> If the <parallel> option is specified, matrix_scan is run in parallel by splitting the processing of the entire genome - on multiple processes for each chromosome file in parallel. + <font color=black> If the <parallel> option is specified, matrix_scan is run in parallel by splitting the processing of the entire genome on multiple processes for each chromosome file in parallel. <br> - Of course it only works on a multi-core processor machine.</font> + Of course, this only works on a multi-core processor machine.</font> <br> </p> </ul> diff --git a/examples/README.txt b/examples/README.txt index 3951b17f523101d9b6135c5582f0fa2f0f6fc6f4..4a77d6e3d317645a30e7785f66f35e31262d0988 100644 --- a/examples/README.txt +++ b/examples/README.txt @@ -157,7 +157,7 @@ The reason for generating a match list in FASTA format is that we want to carry bowtie -l 19 -n0 -a BOWTIE_DIR/h_sapiens_hg19 -f chen10_ctcf_co1128_taglist.fa --un unmapped.dat > chen10_ctcf_co1128_bowtie.out - # It takes of the order of 30 seconds for a full scan of the hg19 genome assembly (with the exclusion of the mitochondrial chromosome). + # It takes of the order of 20 seconds for a full scan of the hg19 genome assembly (with the exclusion of the mitochondrial chromosome). The total number of hits is 143597. # Example: @@ -212,17 +212,18 @@ The reason for generating a match list in FASTA format is that we want to carry Ex: chr|NC_000004|NC_000004.11 - # Example: - cat /home/local/db/genome/hg19/chrom*.seq | matrix_scan -m chen10_ctcf.mat -c 1128 | sort -s -k1,1 -k2,2n -k6,6 > chen10_ctcf_co1128_matrix_scan.out - # Parallelization of matrix_scan - To improve performance, we can run matrix_scan in parallel (on each chromosome file) using a simple perl script called matrix_scan_parallel.pl. + To improve performance, we can run matrix_scan in parallel (on each chromosome file) using a simple perl script called matrix_scan_parallel.pl that dispatches the jobs to available cores. In this way, matrix_scan competes with the bowtie-based approach. + + # Example: + + cat /home/local/db/genome/hg19/chrom*.seq | matrix_scan -m chen10_ctcf.mat -c 1128 | sort -s -k1,1 -k2,2n -k6,6 > chen10_ctcf_co1128_matrix_scan.out - The above example takes only 30-40 seconds to complete the task, thus competing with the bowtie-based approach. To use matrix_scan_parallel.pl, type the following command: matrix_scan_parallel.pl /home/local/db/genome/hg19/chrom\*.seq chen10_ctcf.mat 1128 | sort -s -k1,1 -k2,2n -k6,6 > chen10_ctcf_co1128_matrix_scan.out + The above example takes only about 20 seconds to complete the task in agreement with Bowtie. # Convert the matrix scan output to BED by issuing the following command: @@ -233,7 +234,7 @@ The reason for generating a match list in FASTA format is that we want to carry # Note : to convert matrix_scan output to BED format, we use the chr_NC_gi file for assembly hg19. # By default, the chr_NC_gi is located at chr_NC_PATH/hg19 = /home/local/db/genome/hg19. - chr_NC_PATH can be changed by using option -i. + chr_NC_PATH can be changed by using the <-i> option. # To run the entire pipeline: @@ -303,7 +304,7 @@ Shell (bash) Wrappers to execute the analysis pipeline # Output file (BED) : chen10_ctcf_co1128_matrix_scan.bed - # Try also: + # Also try the following commands (with p-value=10^6): pwm_scan chen10_ctcf.mat 0.000001 hg19 /home/local/db 1 pwm_scan chen10_ctcf.mat 0.000001 hg19 /home/local/db @@ -311,8 +312,7 @@ Shell (bash) Wrappers to execute the analysis pipeline # Output file (BED) : chen10_ctcf_co1549.bed # N.B. - # If the <parallel> option is specified, matrix_scan is run in parallel by splitting the processing of the entire genome on - # multiple processes for each chromosome file in parallel. + # If the <parallel> option is specified, matrix_scan is run in parallel by splitting the processing of the entire genome on multiple processes for each chromosome file in parallel. # Of course it only works on a multi-core processor machine. diff --git a/examples/results/chen10_ctcf_co1128_taglist.fa.gz b/examples/results/chen10_ctcf_co1128_taglist.fa.gz deleted file mode 100644 index 467ceee51a8f331b1372e75589b3f018862c87ba..0000000000000000000000000000000000000000 Binary files a/examples/results/chen10_ctcf_co1128_taglist.fa.gz and /dev/null differ diff --git a/pwm_bowtie_wrapper b/pwm_bowtie_wrapper index 7d1bb4520ffdf14921e8f9ee84360721a6c849c9..71d9b40481982c1f76aa19c7fffc73171d539ccb 100755 --- a/pwm_bowtie_wrapper +++ b/pwm_bowtie_wrapper @@ -19,6 +19,35 @@ then exit $E_BADARGS fi +# Declare associative array for background base composition +declare -A BGFREQ +BGFREQ[hg19]=0.29,0.21,0.21,0.29 +BGFREQ[hg18]=0.29,0.21,0.21,0.29 +BGFREQ[hg38]=0.29,0.21,0.21,0.29 +BGFREQ[mm9]=0.29,0.21,0.21,0.29 +BGFREQ[mm10]=0.29,0.21,0.21,0.29 +BGFREQ[bosTau3]=0.29,0.21,0.21,0.29 +BGFREQ[bosTau8]=0.29,0.21,0.21,0.29 +BGFREQ[canFam2]=0.29,0.21,0.21,0.29 +BGFREQ[canFam3]=0.29,0.21,0.21,0.29 +BGFREQ[panTro2]=0.29,0.21,0.21,0.29 +BGFREQ[panTro5]=0.29,0.21,0.21,0.29 +BGFREQ[rn5]=0.28,0.22,0.22,0.28 +BGFREQ[rn6]=0.28,0.22,0.22,0.28 +BGFREQ[dm3]=0.29,0.21,0.21,0.29 +BGFREQ[dm6]=0.29,0.21,0.21,0.29 +BGFREQ[ce6]=0.32,0.18,0.18,0.32 +BGFREQ[ce10]=0.32,0.18,0.18,0.32 +BGFREQ[ce11]=0.32,0.18,0.18,0.32 +BGFREQ[danRer5]=0.32,0.18,0.18,0.32 +BGFREQ[danRer7]=0.32,0.18,0.18,0.32 +BGFREQ[danRer10]=0.32,0.18,0.18,0.32 +BGFREQ[sacCer2]=0.31,0.19,0.19,0.31 +BGFREQ[sacCer3]=0.31,0.19,0.19,0.31 + +echo "BG nucleotide composition: ${BGFREQ[$5]}" +bg_freq=${BGFREQ[$5]} + chrNC_dir=/home/local/db/genome if [ -f "$1" ] @@ -65,7 +94,7 @@ echo "PWM length: $matrix_len" echo "PWM file length : $file_len lines" echo "======== Calculating PWM score ========" -matrix_score=$(matrix_prob -e $p_value --bg "0.29,0.21,0.21,0.29" $matrix_file \ +matrix_score=$(matrix_prob -e $p_value --bg "$bg_freq" $matrix_file \ | grep SCORE | sed 's/:/\ /'\ | awk -F " " '{print $2}') diff --git a/pwm_mscan_wrapper b/pwm_mscan_wrapper index 5ce3c52dcc3289ba2cebc98127272018a2f2a023..fa3a3cd56ca567d6e678d3fdd1556ca4ddfca766 100755 --- a/pwm_mscan_wrapper +++ b/pwm_mscan_wrapper @@ -19,6 +19,35 @@ then exit $E_BADARGS fi +# Declare associative array for background base composition +declare -A BGFREQ +BGFREQ[hg19]=0.29,0.21,0.21,0.29 +BGFREQ[hg18]=0.29,0.21,0.21,0.29 +BGFREQ[hg38]=0.29,0.21,0.21,0.29 +BGFREQ[mm9]=0.29,0.21,0.21,0.29 +BGFREQ[mm10]=0.29,0.21,0.21,0.29 +BGFREQ[bosTau3]=0.29,0.21,0.21,0.29 +BGFREQ[bosTau8]=0.29,0.21,0.21,0.29 +BGFREQ[canFam2]=0.29,0.21,0.21,0.29 +BGFREQ[canFam3]=0.29,0.21,0.21,0.29 +BGFREQ[panTro2]=0.29,0.21,0.21,0.29 +BGFREQ[panTro5]=0.29,0.21,0.21,0.29 +BGFREQ[rn5]=0.28,0.22,0.22,0.28 +BGFREQ[rn6]=0.28,0.22,0.22,0.28 +BGFREQ[dm3]=0.29,0.21,0.21,0.29 +BGFREQ[dm6]=0.29,0.21,0.21,0.29 +BGFREQ[ce6]=0.32,0.18,0.18,0.32 +BGFREQ[ce10]=0.32,0.18,0.18,0.32 +BGFREQ[ce11]=0.32,0.18,0.18,0.32 +BGFREQ[danRer5]=0.32,0.18,0.18,0.32 +BGFREQ[danRer7]=0.32,0.18,0.18,0.32 +BGFREQ[danRer10]=0.32,0.18,0.18,0.32 +BGFREQ[sacCer2]=0.31,0.19,0.19,0.31 +BGFREQ[sacCer3]=0.31,0.19,0.19,0.31 + +echo "BG nucleotide composition: ${BGFREQ[$4]}" +bg_freq=${BGFREQ[$4]} + parallel=0 if [ $# == 5 ] @@ -65,7 +94,7 @@ echo "PWM length: $matrix_len" echo "PWM file length : $file_len" echo "======== Calculating PWM score ========" -matrix_score=$(matrix_prob -e $p_value --bg "0.29,0.21,0.21,0.29" $matrix_file \ +matrix_score=$(matrix_prob -e $p_value --bg "$bg_freq" $matrix_file \ | grep SCORE | sed 's/:/\ /'\ | awk -F " " '{print $2}') diff --git a/pwm_scan b/pwm_scan index 8afa02e107be6b0ac441f5684d2a2a75b6980d6d..575bb61304da5b98e53e4c310b8bd0b09a099cd6 100755 --- a/pwm_scan +++ b/pwm_scan @@ -63,9 +63,37 @@ BOWTIEIDX[sacCer2]=s_cerevisiae_sacCer2 BOWTIEIDX[sacCer3]=s_cerevisiae_sacCer3 #echo ${BOWTIEIDX[$3]} - genome_idx_file=${BOWTIEIDX[$3]} +# Declare associative array for background base composition +declare -A BGFREQ +BGFREQ[hg19]=0.29,0.21,0.21,0.29 +BGFREQ[hg18]=0.29,0.21,0.21,0.29 +BGFREQ[hg38]=0.29,0.21,0.21,0.29 +BGFREQ[mm9]=0.29,0.21,0.21,0.29 +BGFREQ[mm10]=0.29,0.21,0.21,0.29 +BGFREQ[bosTau3]=0.29,0.21,0.21,0.29 +BGFREQ[bosTau8]=0.29,0.21,0.21,0.29 +BGFREQ[canFam2]=0.29,0.21,0.21,0.29 +BGFREQ[canFam3]=0.29,0.21,0.21,0.29 +BGFREQ[panTro2]=0.29,0.21,0.21,0.29 +BGFREQ[panTro5]=0.29,0.21,0.21,0.29 +BGFREQ[rn5]=0.28,0.22,0.22,0.28 +BGFREQ[rn6]=0.28,0.22,0.22,0.28 +BGFREQ[dm3]=0.29,0.21,0.21,0.29 +BGFREQ[dm6]=0.29,0.21,0.21,0.29 +BGFREQ[ce6]=0.32,0.18,0.18,0.32 +BGFREQ[ce10]=0.32,0.18,0.18,0.32 +BGFREQ[ce11]=0.32,0.18,0.18,0.32 +BGFREQ[danRer5]=0.32,0.18,0.18,0.32 +BGFREQ[danRer7]=0.32,0.18,0.18,0.32 +BGFREQ[danRer10]=0.32,0.18,0.18,0.32 +BGFREQ[sacCer2]=0.31,0.19,0.19,0.31 +BGFREQ[sacCer3]=0.31,0.19,0.19,0.31 + +echo "BG nucleotide composition: ${BGFREQ[$3]}" +bg_freq=${BGFREQ[$3]} + parallel=0 if [ $# == 5 ] then @@ -111,7 +139,7 @@ echo "PWM length: $matrix_len" echo "PWM file length : $file_len lines" echo "======== Calculating PWM score ========" -matrix_score=$(matrix_prob -e $p_value --bg "0.29,0.21,0.21,0.29" $matrix_file \ +matrix_score=$(matrix_prob -e $p_value --bg "$bg_freq" $matrix_file \ | grep SCORE | sed 's/:/\ /'\ | awk -F " " '{print $2}') @@ -164,21 +192,15 @@ echo "Use matrix_scan: $use_matrix_scan" if [ $use_matrix_scan -eq 0 ] then - echo "======== Generating tag list ========" - tagscore=${matrix_name}_co$matrix_score.dat - echo "mba -c $matrix_score -l $matrix_len $matrix_file | sort -k2,2 -nr > $tagscore" - echo "..." - # Generate list of all tags that represent all possible matrix hits - mba -c $matrix_score -l $matrix_len $matrix_file | sort -k2,2 -nr > $tagscore # Run Bowtie pipeline pwmout_bed=${matrix_name}_co${matrix_score}_bowtie.bed echo "======== Bowtie-based pipeline ========" - echo "awk '{print \">\"\$2\"\n\"\$1}' $tagscore | bowtie -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len $chrNC_dir > $pwmout_bed" + echo "mba -c $matrix_score -l $matrix_len $matrix_file | sort -k2,2 -nr | awk '{print \">\"\$2\"\n\"\$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len $chrNC_dir > $pwmout_bed" echo "..." - awk '{print ">"$2"\n"$1}' $tagscore | bowtie -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len -i $chrNC_dir > $pwmout_bed + mba -c $matrix_score -l $matrix_len $matrix_file | sort -k2,2 -nr | awk '{print ">"$2"\n"$1}' | bowtie --threads 4 -l $matrix_len -n0 -a $bowtie_dir/$genome_idx_file -f - --un unmapped.dat | sort -s -k3,3 -k4,4n | bowtie2bed -s $assembly -l $matrix_len -i $chrNC_dir > $pwmout_bed echo "List of matches (BED format) : $pwmout_bed" - rm $tagscore unmapped.dat + rm unmapped.dat else pwmout_bed=${matrix_name}_co${matrix_score}_matrix_scan.bed echo "======== matrix_scan-based pipeline ========"