Command Reference¶
copy_snppipeline_data.py¶
usage: copy_snppipeline_data.py [-h] whichData [destDirectory]
Copy SNP Pipeline data to a specified directory.
positional arguments:
whichData Which of the supplied data sets to copy. The choices are:
lambdaVirusInputs : Input reference and fastq files
lambdaVirusExpectedResults : Expected results files
agonaInputs : Input reference file
agonaExpectedResults : Expected results files
listeriaInputs : Input reference file
listeriaExpectedResults : Expected results files
configurationFile : File of parameters to customize the
SNP pipeline
Note: the lambda virus data set is complete with input data and expected
results. The agona and listeria data sets have the reference genome and
the expected results, but not the input fastq files, because the files are
too large to include with the package. (default: None)
destDirectory Destination directory into which the SNP pipeline data files will be copied.
The data files are copied into the destination directory if the directory
already exists. Otherwise the destination directory is created and the
data files are copied there. (default: current directory)
optional arguments:
-h, --help show this help message and exit
Example:
# create a new directory "testLambdaVirus" and copy the input data there
$ copy_snppipeline_data.py lambdaVirusInputs testLambdaVirus
run_snp_pipeline.sh¶
usage: run_snp_pipeline.sh [-h] [-f] [-m MODE] [-c FILE] [-Q torque|grid] [-o DIR] (-s DIR|-S FILE)
referenceFile
Run the SNP Pipeline on a specified data set.
Positional arguments:
referenceFile : Relative or absolute path to the reference fasta file.
Options:
-h : Show this help message and exit.
-f : Force processing even when result files already exist and
are newer than inputs.
-m MODE : Create a mirror copy of the reference directory and all the sample
directories. Use this option to avoid polluting the reference directory and
sample directories with the intermediate files generated by the snp pipeline.
A "reference" subdirectory and a "samples" subdirectory are created under
the output directory (see the -o option). One directory per sample is created
under the "samples" directory. Three suboptions allow a choice of how the
reference and samples are mirrored.
-m soft : creates soft links to the fasta and fastq files instead of copying
-m hard : creates hard links to the fasta and fastq files instead of copying
-m copy : copies the fasta and fastq files
-c FILE : Relative or absolute path to a configuration file for overriding defaults
and defining extra parameters for the tools and scripts within the pipeline.
Note: A default parameter configuration file named "snppipeline.conf" is
used whenever the pipeline is run without the -c option. The
configuration file used for each run is copied into the log directory,
capturing the parameters used during the run.
-Q torque|grid : Job queue manager for remote parallel job execution in an HPC environment.
Currently "torque" and "grid" are supported. If not specified, the pipeline
will execute locally.
-o DIR : Output directory for the snp list, snp matrix, and reference snp files.
Additional subdirectories are automatically created under the output
directory for logs files and the mirrored samples and reference files
(see the -m option). The output directory will be created if it does
not already exist. If not specified, the output files are written to
the current working directory. If you re-run the pipeline on previously
processed samples, and specify a different output directory, the
pipeline will not rebuild everything unless you either force a rebuild
(see the -f option) or you request mirrored inputs (see the -m option).
-s DIRECTORY : Relative or absolute path to the parent directory of all the sample
directories. The -s option should be used when all the sample directories
are in subdirectories immediately below a parent directory.
Note: You must specify either the -s or -S option, but not both.
Note: The specified directory should contain only a collection of sample
directories, nothing else.
Note: Unless you request mirrored inputs, see the -m option, additional files
will be written to each of the sample directories during the execution
of the SNP Pipeline
-S FILE : Relative or absolute path to a file listing all of the sample directories.
The -S option should be used when the samples are not under a common parent
directory.
Note: If you are not mirroring the samples (see the -m option), you can
improve parallel processing performance by sorting the the list of
directories descending by size, largest first. The -m option
automatically generates a sorted directory list.
Note: You must specify either the -s or -S option, but not both.
Note: Unless you request mirrored inputs, see the -m option, additional files
will be written to each of the sample directories during the execution
of the SNP Pipeline
prepReference.sh¶
usage: prepReference.sh [-h] [-f] referenceFile
Index the reference genome for subsequent alignment, and create
the faidx index file for subsequent pileups. The output is written
to the reference directory.
Positional arguments:
referenceFile : Relative or absolute path to the reference fasta file
Options:
-h : Show this help message and exit
-f : Force processing even when result files already exist and
are newer than inputs
alignSampleToReference.sh¶
usage: alignSampleToReference.sh [-h] [-f] referenceFile sampleFastqFile1 [sampleFastqFile2]
Align the sequence reads for a specified sample to a specified reference genome.
The output is written to the file "reads.sam" in the sample directory.
Positional arguments:
referenceFile : Relative or absolute path to the reference fasta file
sampleFastqFile1 : Relative or absolute path to the fastq file
sampleFastqFile2 : Optional relative or absolute path to the mate fastq file, if paired
Options:
-h : Show this help message and exit
-f : Force processing even when result files already exist and
are newer than inputs
prepSamples.sh¶
usage: prepSamples.sh [-h] [-f] referenceFile sampleDir
Find variants in a specified sample.
The output files are written to the sample directory.
Positional arguments:
referenceFile : Relative or absolute path to the reference fasta file
sampleDir : Relative or absolute directory of the sample
Options:
-h : Show this help message and exit
-f : Force processing even when result files already exist and
are newer than inputs
snp_filter.py¶
usage: snp_filter.py [-h] [-f] [-n NAME] [-l EDGE_LENGTH] [-w WINDOW_SIZE]
[-m MAX_NUM_SNPs] [-g OUT_GROUP] [-v 0..5] [--version]
sampleDirsFile refFastaFile
Remove abnormally dense SNPs from the input VCF file, save the reserved SNPs
into a new VCF file, and save the removed SNPs into another VCF file.
positional arguments:
sampleDirsFile Relative or absolute path to file containing a list of
directories -- one per sample
refFastaFile Relative or absolute path to the reference fasta file
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result files already exist
and are newer than inputs (default: False)
-n NAME, --vcfname NAME
File name of the input VCF files which must exist in
each of the sample directories (default: var.flt.vcf)
-l EDGE_LENGTH, --edge_length EDGE_LENGTH
The length of the edge regions in a contig, in which
all SNPs will be removed. (default: 500)
-w WINDOW_SIZE, --window_size WINDOW_SIZE
The length of the window in which the number of SNPs
should be no more than max_num_snp. (default: 1000)
-m MAX_NUM_SNPs, --max_snp MAX_NUM_SNPs
The maximum number of SNPs allowed in a window.
(default: 3)
-g OUT_GROUP, --out_group OUT_GROUP
Relative or absolute path to the file indicating
outgroup samples, one sample ID per line. (default:
None)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
create_snp_list.py¶
usage: create_snp_list.py [-h] [-f] [-n NAME] [--maxsnps INT] [-o FILE]
[-v 0..5] [--version]
sampleDirsFile filteredSampleDirsFile
Combine the SNP positions across all samples into a single unified SNP list
file identifing the postions and sample names where SNPs were called.
positional arguments:
sampleDirsFile Relative or absolute path to file containing a list of
directories -- one per sample
filteredSampleDirsFile
Relative or absolute path to the output file that will
be created containing the filtered list of sample
directories -- one per sample. The samples in this
file are those without an excessive number of snps.
See the --maxsnps parameter.
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result file already exists
and is newer than inputs (default: False)
-n NAME, --vcfname NAME
File name of the VCF files which must exist in each of
the sample directories (default: var.flt.vcf)
--maxsnps INT Exclude samples having more than this maximum allowed
number of SNPs. Set to -1 to disable this function.
(default: -1)
-o FILE, --output FILE
Output file. Relative or absolute path to the SNP list
file (default: snplist.txt)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
create_snp_pileup.py¶
usage: create_snp_pileup.py [-h] [-f] [-l FILE] [-a FILE] [-o FILE] [-v 0..5]
[--version]
Create the SNP pileup file for a sample -- the pileup file at the positions
where SNPs were called in any of the samples.
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result file already exists
and is newer than inputs (default: False)
-l FILE, --snpListFile FILE
Relative or absolute path to the SNP list file across
all samples (default: snplist.txt)
-a FILE, --allPileupFile FILE
Relative or absolute path to the genome-wide pileup
file for this sample (default: reads.all.pileup)
-o FILE, --output FILE
Output file. Relative or absolute path to the sample
SNP pileup file (default: reads.snp.pileup)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
call_consensus.py¶
usage: call_consensus.py [-h] [-f] [-l FILE] [-e FILE] [-o FILE] [-q INT]
[-c FREQ] [-d INT] [-b FREQ] [--vcfFileName NAME]
[--vcfRefName NAME] [--vcfAllPos]
[--vcfPreserveRefCase] [--vcfFailedSnpGt {.,0,1}]
[-v 0..5] [--version]
allPileupFile
Call the consensus base for a sample at the specified positions where SNPs
were previously called in any of the samples. Generates a single-sequence
fasta file with one base per specified position.
positional arguments:
allPileupFile Relative or absolute path to the genome-wide pileup
file for this sample.
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result file already exists
and is newer than inputs. (default: False)
-l FILE, --snpListFile FILE
Relative or absolute path to the SNP list file across
all samples. (default: snplist.txt)
-e FILE, --excludeFile FILE
VCF file of positions to exclude. (default: None)
-o FILE, --output FILE
Output file. Relative or absolute path to the
consensus fasta file for this sample. (default:
consensus.fasta)
-q INT, --minBaseQual INT
Mimimum base quality score to count a read. All other
snp filters take effect after the low-quality reads
are discarded. (default: 0)
-c FREQ, --minConsFreq FREQ
Consensus frequency. Mimimum fraction of high-quality
reads supporting the consensus to make a call.
(default: 0.6)
-d INT, --minConsStrdDpth INT
Consensus strand depth. Minimum number of high-quality
reads supporting the consensus which must be present
on both the forward and reverse strands to make a
call. (default: 0)
-b FREQ, --minConsStrdBias FREQ
Strand bias. Minimum fraction of the high-quality
consensus-supporting reads which must be present on
both the forward and reverse strands to make a call.
The numerator of this fraction is the number of high-
quality consensus-supporting reads on one strand. The
denominator is the total number of high-quality
consensus-supporting reads on both strands combined.
(default: 0)
--vcfFileName NAME VCF Output file name. If specified, a VCF file with
this file name will be created in the same directory
as the consensus fasta file for this sample. (default:
None)
--vcfRefName NAME Name of the reference file. This is only used in the
generated VCF file header. (default: Unknown
reference)
--vcfAllPos Flag to cause VCF file generation at all positions,
not just the snp positions. This has no effect on the
consensus fasta file, it only affects the VCF file.
This capability is intended primarily as a diagnostic
tool and enabling this flag will greatly increase
execution time. (default: False)
--vcfPreserveRefCase Flag to cause the VCF file generator to emit each
reference base in uppercase/lowercase as it appears in
the reference sequence file. If not specified, the
reference base is emitted in uppercase. (default:
False)
--vcfFailedSnpGt {.,0,1}
Controls the VCF file GT data element when a snp fails
filters. Possible values: .) The GT element will be a
dot, indicating unable to make a call. 0) The GT
element will be 0, indicating the reference base. 1)
The GT element will be the ALT index of the most
commonly occuring base, usually 1. (default: .)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
mergeVcf.sh¶
usage: mergeVcf.sh [-h] [-f] [-n NAME] [-o FILE] sampleDirsFile
Merge the vcf files from all samples into a single multi-vcf file for all samples.
Before running this command, the vcf file for each sample must be created by the
call_consensus.py script.
Positional arguments:
sampleDirsFile : Relative or absolute path to file containing a list of
directories -- one per sample
Options:
-h : Show this help message and exit
-f : Force processing even when result files already exist and
are newer than inputs
-n NAME : File name of the vcf files which must exist in each of
the sample directories. (default: consensus.vcf)
-o FILE : Output file. Relative or absolute path to the merged
multi-vcf file. (default: snpma.vcf)
create_snp_matrix.py¶
usage: create_snp_matrix.py [-h] [-f] [-c NAME] [-o FILE] [-v 0..5]
[--version]
sampleDirsFile
Create the SNP matrix containing the consensus base for each of the samples at
the positions where SNPs were called in any of the samples. The matrix
contains one row per sample and one column per SNP position. Non-SNP positions
are not included in the matrix. The matrix is formatted as a fasta file, with
each sequence (all of identical length) corresponding to the SNPs in the
correspondingly named sequence.
positional arguments:
sampleDirsFile Relative or absolute path to file containing a list of
directories -- one per sample
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result file already exists
and is newer than inputs (default: False)
-c NAME, --consFileName NAME
File name of the previously created consensus SNP call
file which must exist in each of the sample
directories (default: consensus.fasta)
-o FILE, --output FILE
Output file. Relative or absolute path to the SNP
matrix file (default: snpma.fasta)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
calculate_snp_distances.py¶
usage: calculate_snp_distances.py [-h] [-f] [-p FILE] [-m FILE] [-v 0..5]
[--version]
snpMatrixFile
Calculate pairwise SNP distances from the multi-fasta SNP matrix. Generates a
file of pairwise distances and a file containing a matrix of distances.
positional arguments:
snpMatrixFile Relative or absolute path to the input multi-fasta SNP
matrix file.
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result file already exists
and is newer than inputs (default: False)
-p FILE, --pairs FILE
Relative or absolute path to the pairwise distance
output file. (default: None)
-m FILE, --matrix FILE
Relative or absolute path to the distance matrix
output file. (default: None)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
create_snp_reference_seq.py¶
usage: create_snp_reference_seq.py [-h] [-f] [-l FILE] [-o FILE] [-v 0..5]
[--version]
referenceFile
Write reference sequence bases at SNP locations to a fasta file.
positional arguments:
referenceFile Relative or absolute path to the reference bases file
in fasta format
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result file already exists
and is newer than inputs (default: False)
-l FILE, --snpListFile FILE
Relative or absolute path to the SNP list file
(default: snplist.txt)
-o FILE, --output FILE
Output file. Relative or absolute path to the SNP
reference sequence file (default: referenceSNP.fasta)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
collectSampleMetrics.sh¶
usage: collectSampleMetrics.sh [-h] [-f] [-c FILE] [-m INT ] [-o FILE] [-v FILE] sampleDir referenceFile
Collect alignment, coverage, and variant metrics for a single specified sample.
Positional arguments:
sampleDir : Relative or absolute directory of the sample
referenceFile : Relative or absolute path to the reference fasta file
Options:
-h : Show this help message and exit
-f : Force processing even when result files already exist and
are newer than inputs
-c FILE : Relative or absolute path to the consensus fasta file
(default: consensus.fasta in the sampleDir)
-C FILE : Relative or absolute path to the consensus preserved fasta file
(default: consensus_preserved.fasta in the sampleDir)
-m INT : Maximum allowed number of SNPs per sample. (default: -1)
-o FILE : Output file. Relative or absolute path to the metrics file
(default: metrics in the sampleDir)
-v FILE : Relative or absolute path to the consensus vcf file
(default: consensus.vcf in the sampleDir)
-V FILE : Relative or absolute path to the consensus preserved vcf file
(default: consensus_preserved.vcf in the sampleDir)
combineSampleMetrics.sh¶
usage: combineSampleMetrics.sh [-h] [-n NAME] [-o FILE] sampleDirsFile
Combine the metrics from all samples into a single table of metrics for all samples.
The output is a tab-separated-values file with a row for each sample and a column
for each metric.
Before running this command, the metrics for each sample must be created by the
collectSampleMetrics.sh script.
Positional arguments:
sampleDirsFile : Relative or absolute path to file containing a list of
directories -- one per sample
Options:
-h : Show this help message and exit
-n NAME : File name of the metrics files which must exist in each of
the sample directories. (default: metrics)
-o FILE : Output file. Relative or absolute path to the combined metrics
file. (default: stdout)
-s : Emit column headings with spaces instead of underscores