Command Reference¶
run_snp_pipeline.sh¶
usage: run_snp_pipeline.sh [-h] [-f] [-m MODE] [-c FILE] [-Q torque|grid] [-o DIR] (-s DIR|-S FILE)
referenceFile
Run the SNP Pipeline on a specified data set.
Positional arguments:
referenceFile : Relative or absolute path to the reference fasta file.
Options:
-h : Show this help message and exit.
-f : Force processing even when result files already exist and
are newer than inputs.
-m MODE : Create a mirror copy of the reference directory and all the sample
directories. Use this option to avoid polluting the reference directory and
sample directories with the intermediate files generated by the snp pipeline.
A "reference" subdirectory and a "samples" subdirectory are created under
the output directory (see the -o option). One directory per sample is created
under the "samples" directory. Three suboptions allow a choice of how the
reference and samples are mirrored.
-m soft : creates soft links to the fasta and fastq files instead of copying
-m hard : creates hard links to the fasta and fastq files instead of copying
-m copy : copies the fasta and fastq files
-c FILE : Relative or absolute path to a configuration file for overriding defaults
and defining extra parameters for the tools and scripts within the pipeline.
Note: A default parameter configuration file named "snppipeline.conf" is
used whenever the pipeline is run without the -c option. The
configuration file used for each run is copied into the log directory,
capturing the parameters used during the run.
-Q torque|grid : Job queue manager for remote parallel job execution in an HPC environment.
Currently "torque" and "grid" are supported. If not specified, the pipeline
will execute locally.
-o DIR : Output directory for the snp list, snp matrix, and reference snp files.
Additional subdirectories are automatically created under the output
directory for logs files and the mirrored samples and reference files
(see the -m option). The output directory will be created if it does
not already exist. If not specified, the output files are written to
the current working directory. If you re-run the pipeline on previously
processed samples, and specify a different output directory, the
pipeline will not rebuild everything unless you either force a rebuild
(see the -f option) or you request mirrored inputs (see the -m option).
-s DIRECTORY : Relative or absolute path to the parent directory of all the sample
directories. The -s option should be used when all the sample directories
are in subdirectories immediately below a parent directory.
Note: You must specify either the -s or -S option, but not both.
Note: The specified directory should contain only a collection of sample
directories, nothing else.
Note: Unless you request mirrored inputs, see the -m option, additional files
will be written to each of the sample directories during the execution
of the SNP Pipeline
-S FILE : Relative or absolute path to a file listing all of the sample directories.
The -S option should be used when the samples are not under a common parent
directory.
Note: If you are not mirroring the samples (see the -m option), you can
improve parallel processing performance by sorting the the list of
directories descending by size, largest first. The -m option
automatically generates a sorted directory list.
Note: You must specify either the -s or -S option, but not both.
Note: Unless you request mirrored inputs, see the -m option, additional files
will be written to each of the sample directories during the execution
of the SNP Pipeline
cfsan_snp_pipeline¶
usage: cfsan_snp_pipeline [-h] [--version] subcommand ...
The CFSAN SNP Pipeline is a collection of tools using reference-based
alignments to call SNPs for a set of samples.
positional arguments:
subcommand
data Copy included data to a specified directory
index_ref Index the reference
map_reads Align reads to the reference
call_sites Find the sites with high-confidence SNPs in a sample
filter_regions Remove abnormally dense SNPs from all samples
merge_sites Prepare the list of sites having SNPs
call_consensus Call the consensus base at high-confidence sites
merge_vcfs Merge the per-sample VCF files
snp_matrix Create a matrix of SNPs
distance Calculate the SNP distances between samples
snp_reference Write reference bases at SNP locations to a fasta file
collect_metrics Collect quality and SNP metrics for a sample
combine_metrics Merge the per-sample metrics
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
data¶
usage: cfsan_snp_pipeline data [-h] [--version] whichData [destDirectory]
Copy data included with the CFSAN SNP Pipeline to a specified directory.
positional arguments:
whichData Which of the supplied data sets to copy. The choices are:
lambdaVirusInputs : Input reference and fastq files
lambdaVirusExpectedResults : Expected results files
agonaInputs : Input reference file
agonaExpectedResults : Expected results files
listeriaInputs : Input reference file
listeriaExpectedResults : Expected results files
configurationFile : File of parameters to customize the
SNP pipeline
Note: the lambda virus data set is complete with input data and expected
results. The agona and listeria data sets have the reference genome and
the expected results, but not the input fastq files, because the files are
too large to include with the package.
destDirectory Destination directory into which the SNP pipeline data files will be copied.
The data files are copied into the destination directory if the directory
already exists. Otherwise the destination directory is created and the
data files are copied there. (default: current directory)
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
Example:
# create a new directory "testLambdaVirus" and copy the Lambda virus input data there
$ cfsan_snp_pipeline data lambdaVirusInputs testLambdaVirus
index_ref¶
usage: cfsan_snp_pipeline index_ref [-h] [-f] [-v 0..5] [--version]
referenceFile
Index the reference genome for subsequent read mapping, and create the faidx
index file for subsequent pileups. The output is written to the reference
directory.
positional arguments:
referenceFile Relative or absolute path to the reference fasta file
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result files already exist
and are newer than inputs (default: False)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
map_reads¶
usage: cfsan_snp_pipeline map_reads [-h] [-f] [-v 0..5] [--version]
referenceFile sampleFastqFile1
[sampleFastqFile2]
Align the sequence reads for a specified sample to a specified reference
genome. The output is written to the file "reads.sam" in the sample directory.
positional arguments:
referenceFile Relative or absolute path to the reference fasta file
sampleFastqFile1 Relative or absolute path to the fastq file
sampleFastqFile2 Optional relative or absolute path to the mate fastq
file, if paired (default: None)
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result files already exist
and are newer than inputs (default: False)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
call_sites¶
usage: cfsan_snp_pipeline call_sites [-h] [-f] [-v 0..5] [--version]
referenceFile sampleDir
Find the sites with high-confidence SNPs in a sample.
positional arguments:
referenceFile Relative or absolute path to the reference fasta file
sampleDir Relative or absolute directory of the sample
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result files already exist
and are newer than inputs (default: False)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
filter_regions¶
usage: cfsan_snp_pipeline filter_regions [-h] [-f] [-n NAME] [-l EDGE_LENGTH]
[-w WINDOW_SIZE] [-m MAX_NUM_SNPs]
[-g OUT_GROUP] [-v 0..5] [--version]
sampleDirsFile refFastaFile
Remove abnormally dense SNPs from the input VCF file, save the reserved SNPs
into a new VCF file, and save the removed SNPs into another VCF file.
positional arguments:
sampleDirsFile Relative or absolute path to file containing a list of
directories -- one per sample
refFastaFile Relative or absolute path to the reference fasta file
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result files already exist
and are newer than inputs (default: False)
-n NAME, --vcfname NAME
File name of the input VCF files which must exist in
each of the sample directories (default: var.flt.vcf)
-l EDGE_LENGTH, --edge_length EDGE_LENGTH
The length of the edge regions in a contig, in which
all SNPs will be removed. (default: 500)
-w WINDOW_SIZE, --window_size WINDOW_SIZE
The length of the window in which the number of SNPs
should be no more than max_num_snp. (default: 1000)
-m MAX_NUM_SNPs, --max_snp MAX_NUM_SNPs
The maximum number of SNPs allowed in a window.
(default: 3)
-g OUT_GROUP, --out_group OUT_GROUP
Relative or absolute path to the file indicating
outgroup samples, one sample ID per line. (default:
None)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
merge_sites¶
usage: cfsan_snp_pipeline merge_sites [-h] [-f] [-n NAME] [--maxsnps INT]
[-o FILE] [-v 0..5] [--version]
sampleDirsFile filteredSampleDirsFile
Combine the SNP positions across all samples into a single unified SNP list
file identifing the positions and sample names where SNPs were called.
positional arguments:
sampleDirsFile Relative or absolute path to file containing a list of
directories -- one per sample
filteredSampleDirsFile
Relative or absolute path to the output file that will
be created containing the filtered list of sample
directories -- one per sample. The samples in this
file are those without an excessive number of snps.
See the --maxsnps parameter.
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result file already exists
and is newer than inputs (default: False)
-n NAME, --vcfname NAME
File name of the VCF files which must exist in each of
the sample directories (default: var.flt.vcf)
--maxsnps INT Exclude samples having more than this maximum allowed
number of SNPs. Set to -1 to disable this function.
(default: -1)
-o FILE, --output FILE
Output file. Relative or absolute path to the SNP list
file (default: snplist.txt)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
call_consensus¶
usage: cfsan_snp_pipeline call_consensus [-h] [-f] [-l FILE] [-e FILE]
[-o FILE] [-q INT] [-c FREQ] [-d INT]
[-b FREQ] [--vcfFileName NAME]
[--vcfRefName NAME] [--vcfAllPos]
[--vcfPreserveRefCase]
[--vcfFailedSnpGt {.,0,1}] [-v 0..5]
[--version]
allPileupFile
Call the consensus base for a sample at the specified positions where high-
confidence SNPs were previously called in any of the samples. Generates a
single-sequence fasta file with one base per specified position.
positional arguments:
allPileupFile Relative or absolute path to the genome-wide pileup
file for this sample.
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result file already exists
and is newer than inputs. (default: False)
-l FILE, --snpListFile FILE
Relative or absolute path to the SNP list file across
all samples. (default: snplist.txt)
-e FILE, --excludeFile FILE
VCF file of positions to exclude. (default: None)
-o FILE, --output FILE
Output file. Relative or absolute path to the
consensus fasta file for this sample. (default:
consensus.fasta)
-q INT, --minBaseQual INT
Mimimum base quality score to count a read. All other
snp filters take effect after the low-quality reads
are discarded. (default: 0)
-c FREQ, --minConsFreq FREQ
Consensus frequency. Mimimum fraction of high-quality
reads supporting the consensus to make a call.
(default: 0.6)
-d INT, --minConsStrdDpth INT
Consensus strand depth. Minimum number of high-quality
reads supporting the consensus which must be present
on both the forward and reverse strands to make a
call. (default: 0)
-b FREQ, --minConsStrdBias FREQ
Strand bias. Minimum fraction of the high-quality
consensus-supporting reads which must be present on
both the forward and reverse strands to make a call.
The numerator of this fraction is the number of high-
quality consensus-supporting reads on one strand. The
denominator is the total number of high-quality
consensus-supporting reads on both strands combined.
(default: 0)
--vcfFileName NAME VCF Output file name. If specified, a VCF file with
this file name will be created in the same directory
as the consensus fasta file for this sample. (default:
None)
--vcfRefName NAME Name of the reference file. This is only used in the
generated VCF file header. (default: Unknown
reference)
--vcfAllPos Flag to cause VCF file generation at all positions,
not just the snp positions. This has no effect on the
consensus fasta file, it only affects the VCF file.
This capability is intended primarily as a diagnostic
tool and enabling this flag will greatly increase
execution time. (default: False)
--vcfPreserveRefCase Flag to cause the VCF file generator to emit each
reference base in uppercase/lowercase as it appears in
the reference sequence file. If not specified, the
reference base is emitted in uppercase. (default:
False)
--vcfFailedSnpGt {.,0,1}
Controls the VCF file GT data element when a snp fails
filters. Possible values: .) The GT element will be a
dot, indicating unable to make a call. 0) The GT
element will be 0, indicating the reference base. 1)
The GT element will be the ALT index of the most
commonly occuring base, usually 1. (default: .)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
merge_vcfs¶
usage: cfsan_snp_pipeline merge_vcfs [-h] [-f] [-n NAME] [-o FILE] [-v 0..5]
[--version]
sampleDirsFile
Merge the consensus vcf files from all samples into a single multi-vcf file
for all samples.
positional arguments:
sampleDirsFile Relative or absolute path to file containing a list of
directories -- one per sample
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result files already exist
and are newer than inputs (default: False)
-n NAME, --vcfname NAME
File name of the vcf files which must exist in each of
the sample directories (default: consensus.vcf)
-o FILE, --output FILE
Output file. Relative or absolute path to the merged
multi-vcf file (default: snpma.vcf)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
snp_matrix¶
usage: cfsan_snp_pipeline snp_matrix [-h] [-f] [-c NAME] [-o FILE] [-v 0..5]
[--version]
sampleDirsFile
Create the SNP matrix containing the consensus base for each of the samples at
the positions where high-confidence SNPs were found in any of the samples. The
matrix contains one row per sample and one column per SNP position. Non-SNP
positions are not included in the matrix. The matrix is formatted as a fasta
file, with each sequence (all of identical length) corresponding to the SNPs
in the correspondingly named sequence.
positional arguments:
sampleDirsFile Relative or absolute path to file containing a list of
directories -- one per sample
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result file already exists
and is newer than inputs (default: False)
-c NAME, --consFileName NAME
File name of the previously created consensus SNP call
file which must exist in each of the sample
directories (default: consensus.fasta)
-o FILE, --output FILE
Output file. Relative or absolute path to the SNP
matrix file (default: snpma.fasta)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
distance¶
usage: cfsan_snp_pipeline distance [-h] [-f] [-p FILE] [-m FILE] [-v 0..5]
[--version]
snpMatrixFile
Calculate pairwise SNP distances from the multi-fasta SNP matrix. Generates a
file of pairwise distances and a file containing a matrix of distances.
positional arguments:
snpMatrixFile Relative or absolute path to the input multi-fasta SNP
matrix file.
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result file already exists
and is newer than inputs (default: False)
-p FILE, --pairs FILE
Relative or absolute path to the pairwise distance
output file. (default: None)
-m FILE, --matrix FILE
Relative or absolute path to the distance matrix
output file. (default: None)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
snp_reference¶
usage: cfsan_snp_pipeline snp_reference [-h] [-f] [-l FILE] [-o FILE]
[-v 0..5] [--version]
referenceFile
Write reference sequence bases at SNP locations to a fasta file.
positional arguments:
referenceFile Relative or absolute path to the reference bases file
in fasta format
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result file already exists
and is newer than inputs (default: False)
-l FILE, --snpListFile FILE
Relative or absolute path to the SNP list file
(default: snplist.txt)
-o FILE, --output FILE
Output file. Relative or absolute path to the SNP
reference sequence file (default: referenceSNP.fasta)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
collect_metrics¶
usage: cfsan_snp_pipeline collect_metrics [-h] [-f] [-o FILE] [-m INT]
[-c NAME] [-C NAME] [-v NAME]
[-V NAME] [--verbose 0..5]
[--version]
sampleDir referenceFile
Collect alignment, coverage, and variant metrics for a single specified
sample.
positional arguments:
sampleDir Relative or absolute directory of the sample
referenceFile Relative or absolute path to the reference fasta file
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result files already exist
and are newer than inputs (default: False)
-o FILE, --output FILE
Output file. Relative or absolute path to the metrics
file (default: metrics)
-m INT, --maxsnps INT
Maximum allowed number of SNPs per sample (default:
-1)
-c NAME File name of the consensus fasta file which must exist
in the sample directory (default: consensus.fasta)
-C NAME File name of the consensus preserved fasta file which
must exist in the sample directory (default:
consensus_preserved.fasta)
-v NAME File name of the consensus vcf file which must exist
in the sample directory (default: consensus.vcf)
-V NAME File name of the consensus preserved vcf file which
must exist in the sample directory (default:
consensus_preserved.vcf)
--verbose 0..5 Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
combine_metrics¶
usage: cfsan_snp_pipeline combine_metrics [-h] [-f] [-n NAME] [-o FILE] [-s]
[-v 0..5] [--version]
sampleDirsFile
Combine the metrics from all samples into a single table of metrics for all
samples. The output is a tab-separated-values file with a row for each sample
and a column for each metric. Before running this command, the metrics for
each sample must be created with the collect_metrics command.
positional arguments:
sampleDirsFile Relative or absolute path to file containing a list of
directories -- one per sample
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result files already exist
and are newer than inputs (default: False)
-n NAME, --metrics NAME
File name of the metrics files which must exist in
each of the sample directories. (default: metrics)
-o FILE, --output FILE
Output file. Relative or absolute path to the combined
metrics file. (default: metrics.tsv)
-s, --spaces Emit column headings with spaces instead of
underscores (default: False)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit