Command Reference¶
cfsan_snp_pipeline¶
usage: cfsan_snp_pipeline [-h] [--version] subcommand ...
The CFSAN SNP Pipeline is a collection of tools using reference-based
alignments to call SNPs for a set of samples.
positional arguments:
subcommand
run This do-it-all script runs all the pipeline steps
data Copy included data to a specified directory
index_ref Index the reference
map_reads Align reads to the reference
call_sites Find the sites with high-confidence SNPs in a sample
filter_regions Remove abnormally dense SNPs from all samples
merge_sites Prepare the list of sites having SNPs
call_consensus Call the consensus base at high-confidence sites
merge_vcfs Merge the per-sample VCF files
snp_matrix Create a matrix of SNPs
distance Calculate the SNP distances between samples
snp_reference Write reference bases at SNP locations to a fasta file
collect_metrics Collect quality and SNP metrics for a sample
combine_metrics Merge the per-sample metrics
purge Purge the intermediate output files
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
run¶
usage: cfsan_snp_pipeline run [-h] [-f] [-m MODE] [-c FILE]
[-Q grid|slurm|torque] [-o DIR]
(-s DIR | -S FILE) [--purge] [-v 0..5]
[--version]
referenceFile
Run the SNP Pipeline on a specified data set.
positional arguments:
referenceFile Relative or absolute path to the reference fasta file
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result files already exist
and are newer than inputs (default: False)
-m MODE, --mirror MODE
Create a mirror copy of the reference directory and
all the sample directories. Use this option to avoid
polluting the reference directory and sample
directories with the intermediate files generated by
the snp pipeline. A "reference" subdirectory and a
"samples" subdirectory are created under the output
directory (see the -o option). One directory per
sample is created under the "samples" directory.
Three suboptions allow a choice of how the reference
and samples are mirrored.
-m soft : creates soft links to the fasta and fastq
files instead of copying
-m hard : creates hard links to the fasta and fastq
files instead of copying
-m copy : copies the fasta and fastq files
(default: None)
-c FILE, --conf FILE Relative or absolute path to a configuration file for
overriding defaults and defining extra parameters for
the tools and scripts within the pipeline.
Note: A default parameter configuration file named
"snppipeline.conf" is used whenever the pipeline
is run without the -c option. The configuration
file used for each run is copied into the log
directory, capturing the parameters used during
the run. (default: None)
-Q grid|slurm|torque, --queue_mgr grid|slurm|torque
Job queue manager for remote parallel job execution in
an HPC environment. Currently "grid", "slurm", and
"torque" are supported. If not specified, the pipeline
will execute locally. (default: None)
-o DIR, --out_dir DIR
Output directory for the result files. Additional
subdirectories are automatically created under the
output directory for logs files and the mirrored
samples and reference files (see the -m option). The
output directory will be created if it does not
already exist. If not specified, the output files are
written to the current working directory. If you re-
run the pipeline on previously processed samples, and
specify a different output directory, the pipeline
will not rebuild everything unless you either force a
rebuild (see the -f option) or you request mirrored
inputs (see the -m option). (default: .)
-s DIR, --samples_dir DIR
Relative or absolute path to the parent directory of
all the sample directories. The -s option should be
used when all the sample directories are in
subdirectories immediately below a parent directory.
Note: You must specify either the -s or -S option, but
not both.
Note: The specified directory should contain only a
collection of sample directories, nothing else.
Note: Unless you request mirrored inputs, see the
-m option, additional files will be written to
each of the sample directories during the
execution of the SNP Pipeline (default: None)
-S FILE, --samples_file FILE
Relative or absolute path to a file listing all of the
sample directories. The -S option should be used when
the samples are not under a common parent directory.
Note: If you are not mirroring the samples (see the
-m option), you can improve parallel processing
performance by sorting the the list of
directories descending by size, largest first.
The -m option automatically generates a sorted
directory list.
Note: You must specify either the -s or -S option, but
not both.
Note: Unless you request mirrored inputs, see the
-m option, additional files will be written to
each of the sample directories during the
execution of the SNP Pipeline (default: None)
--purge Purge the intermediate output files (the entire
"samples" directory) when the pipeline completes
successfully. (default: False)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
data¶
usage: cfsan_snp_pipeline data [-h] [--version] whichData [destDirectory]
Copy data included with the CFSAN SNP Pipeline to a specified directory.
positional arguments:
whichData Which of the supplied data sets to copy. The choices are:
lambdaVirusInputs : Input reference and fastq files
lambdaVirusExpectedResults : Expected results files
agonaInputs : Input reference file
agonaExpectedResults : Expected results files
listeriaInputs : Input reference file
listeriaExpectedResults : Expected results files
configurationFile : File of parameters to customize the
SNP pipeline
Note: the lambda virus data set is complete with input data and expected
results. The agona and listeria data sets have the reference genome and
the expected results, but not the input fastq files, because the files are
too large to include with the package.
destDirectory Destination directory into which the SNP pipeline data files will be copied.
The data files are copied into the destination directory if the directory
already exists. Otherwise the destination directory is created and the
data files are copied there. (default: current directory)
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
Example:
# create a new directory "testLambdaVirus" and copy the Lambda virus input data there
$ cfsan_snp_pipeline data lambdaVirusInputs testLambdaVirus
index_ref¶
usage: cfsan_snp_pipeline index_ref [-h] [-f] [-v 0..5] [--version]
referenceFile
Index the reference genome for subsequent read mapping, and create the faidx
index file for subsequent pileups. The output is written to the reference
directory.
positional arguments:
referenceFile Relative or absolute path to the reference fasta file
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result files already exist
and are newer than inputs (default: False)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
map_reads¶
usage: cfsan_snp_pipeline map_reads [-h] [-f] [-v 0..5] [--threads INT]
[--version]
referenceFile sampleFastqFile1
[sampleFastqFile2]
Align the sequence reads for a specified sample to a specified reference
genome. The reads are sorted, duplicates marked, and realigned around indels.
The output is written to the file "reads.sorted.deduped.indelrealigned.bam" in
the sample directory.
positional arguments:
referenceFile Relative or absolute path to the reference fasta file
sampleFastqFile1 Relative or absolute path to the fastq file
sampleFastqFile2 Optional relative or absolute path to the mate fastq
file, if paired (default: None)
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result files already exist
and are newer than inputs (default: False)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--threads INT Number of CPU cores to use (default: 8)
--version show program's version number and exit
call_sites¶
usage: cfsan_snp_pipeline call_sites [-h] [-f] [-v 0..5] [--version]
referenceFile sampleDir
Find the sites with high-confidence SNPs in a sample.
positional arguments:
referenceFile Relative or absolute path to the reference fasta file
sampleDir Relative or absolute directory of the sample
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result files already exist
and are newer than inputs (default: False)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
filter_regions¶
usage: cfsan_snp_pipeline filter_regions [-h] [-f] [-n NAME] [-l EDGE_LENGTH]
[-w [WINDOW_SIZE [WINDOW_SIZE ...]]]
[-m [MAX_NUM_SNPs [MAX_NUM_SNPs ...]]]
[-g OUT_GROUP] [-M {all,each}]
[-v 0..5] [--version]
sampleDirsFile refFastaFile
Remove abnormally dense SNPs from the input VCF file, save the reserved SNPs
into a new VCF file, and save the removed SNPs into another VCF file.
positional arguments:
sampleDirsFile Relative or absolute path to file containing a list of
directories -- one per sample
refFastaFile Relative or absolute path to the reference fasta file
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result files already exist
and are newer than inputs (default: False)
-n NAME, --vcfname NAME
File name of the input VCF files which must exist in
each of the sample directories (default: var.flt.vcf)
-l EDGE_LENGTH, --edge_length EDGE_LENGTH
The length of the edge regions in a contig, in which
all SNPs will be removed. (default: 500)
-w [WINDOW_SIZE [WINDOW_SIZE ...]], --window_size [WINDOW_SIZE [WINDOW_SIZE ...]]
The length of the window in which the number of SNPs
should be no more than max_num_snp. (default: [1000])
-m [MAX_NUM_SNPs [MAX_NUM_SNPs ...]], --max_snp [MAX_NUM_SNPs [MAX_NUM_SNPs ...]]
The maximum number of SNPs allowed in a window.
(default: [3])
-g OUT_GROUP, --out_group OUT_GROUP
Relative or absolute path to the file indicating
outgroup samples, one sample ID per line. (default:
None)
-M {all,each}, --mode {all,each}
Control whether dense snp regions found in any sample
are filtered from all of the samples, or each sample
independently. (default: all)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
You can filter snps more than once by specifying multiple window sizes and max
snps. For example "-m 3 2 -w 1000 100" will filter more than 3 snps in 1000
bases and also more than 2 snps in 100 bases.
merge_sites¶
usage: cfsan_snp_pipeline merge_sites [-h] [-f] [-n NAME] [--maxsnps INT]
[-o FILE] [-v 0..5] [--version]
sampleDirsFile filteredSampleDirsFile
Combine the SNP positions across all samples into a single unified SNP list
file identifying the positions and sample names where SNPs were called.
positional arguments:
sampleDirsFile Relative or absolute path to file containing a list of
directories -- one per sample
filteredSampleDirsFile
Relative or absolute path to the output file that will
be created containing the filtered list of sample
directories -- one per sample. The samples in this
file are those without an excessive number of snps.
See the --maxsnps parameter.
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result file already exists
and is newer than inputs (default: False)
-n NAME, --vcfname NAME
File name of the VCF files which must exist in each of
the sample directories (default: var.flt.vcf)
--maxsnps INT Exclude samples having more than this maximum allowed
number of SNPs. Set to -1 to disable this function.
(default: -1)
-o FILE, --output FILE
Output file. Relative or absolute path to the SNP list
file (default: snplist.txt)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
call_consensus¶
usage: cfsan_snp_pipeline call_consensus [-h] [-f] [-l FILE] [-e FILE]
[-o FILE] [-q INT] [-c FREQ] [-D INT]
[-d INT] [-b FREQ]
[--vcfFileName NAME]
[--vcfRefName NAME] [--vcfAllPos]
[--vcfPreserveRefCase]
[--vcfFailedSnpGt {.,0,1}] [-v 0..5]
[--version]
allPileupFile
Call the consensus base for a sample at the specified positions where high-
confidence SNPs were previously called in any of the samples. Generates a
single-sequence fasta file with one base per specified position.
positional arguments:
allPileupFile Relative or absolute path to the genome-wide pileup
file for this sample.
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result file already exists
and is newer than inputs. (default: False)
-l FILE, --snpListFile FILE
Relative or absolute path to the SNP list file across
all samples. (default: snplist.txt)
-e FILE, --excludeFile FILE
VCF file of positions to exclude. (default: None)
-o FILE, --output FILE
Output file. Relative or absolute path to the
consensus fasta file for this sample. (default:
consensus.fasta)
-q INT, --minBaseQual INT
Mimimum base quality score to count a read. All other
snp filters take effect after the low-quality reads
are discarded. (default: 0)
-c FREQ, --minConsFreq FREQ
Consensus frequency. Mimimum fraction of high-quality
reads supporting the consensus to make a call.
(default: 0.6)
-D INT, --minConsDpth INT
Consensus depth. Minimum number of high-quality reads
supporting the consensus to make a call. (default: 1)
-d INT, --minConsStrdDpth INT
Consensus strand depth. Minimum number of high-quality
reads supporting the consensus which must be present
on both the forward and reverse strands to make a
call. (default: 0)
-b FREQ, --minConsStrdBias FREQ
Strand bias. Minimum fraction of the high-quality
consensus-supporting reads which must be present on
both the forward and reverse strands to make a call.
The numerator of this fraction is the number of high-
quality consensus-supporting reads on one strand. The
denominator is the total number of high-quality
consensus-supporting reads on both strands combined.
(default: 0)
--vcfFileName NAME VCF Output file name. If specified, a VCF file with
this file name will be created in the same directory
as the consensus fasta file for this sample. (default:
None)
--vcfRefName NAME Name of the reference file. This is only used in the
generated VCF file header. (default: Unknown
reference)
--vcfAllPos Flag to cause VCF file generation at all positions,
not just the snp positions. This has no effect on the
consensus fasta file, it only affects the VCF file.
This capability is intended primarily as a diagnostic
tool and enabling this flag will greatly increase
execution time. (default: False)
--vcfPreserveRefCase Flag to cause the VCF file generator to emit each
reference base in uppercase/lowercase as it appears in
the reference sequence file. If not specified, the
reference base is emitted in uppercase. (default:
False)
--vcfFailedSnpGt {.,0,1}
Controls the VCF file GT data element when a snp fails
filters. Possible values: .) The GT element will be a
dot, indicating unable to make a call. 0) The GT
element will be 0, indicating the reference base. 1)
The GT element will be the ALT index of the most
commonly occuring base, usually 1. (default: .)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
merge_vcfs¶
usage: cfsan_snp_pipeline merge_vcfs [-h] [-f] [-n NAME] [-o FILE] [-v 0..5]
[--version]
sampleDirsFile
Merge the consensus vcf files from all samples into a single multi-vcf file
for all samples.
positional arguments:
sampleDirsFile Relative or absolute path to file containing a list of
directories -- one per sample
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result files already exist
and are newer than inputs (default: False)
-n NAME, --vcfname NAME
File name of the vcf files which must exist in each of
the sample directories (default: consensus.vcf)
-o FILE, --output FILE
Output file. Relative or absolute path to the merged
multi-vcf file (default: snpma.vcf)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
snp_matrix¶
usage: cfsan_snp_pipeline snp_matrix [-h] [-f] [-c NAME] [-o FILE] [-v 0..5]
[--version]
sampleDirsFile
Create the SNP matrix containing the consensus base for each of the samples at
the positions where high-confidence SNPs were found in any of the samples. The
matrix contains one row per sample and one column per SNP position. Non-SNP
positions are not included in the matrix. The matrix is formatted as a fasta
file, with each sequence (all of identical length) corresponding to the SNPs
in the correspondingly named sequence.
positional arguments:
sampleDirsFile Relative or absolute path to file containing a list of
directories -- one per sample
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result file already exists
and is newer than inputs (default: False)
-c NAME, --consFileName NAME
File name of the previously created consensus SNP call
file which must exist in each of the sample
directories (default: consensus.fasta)
-o FILE, --output FILE
Output file. Relative or absolute path to the SNP
matrix file (default: snpma.fasta)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
distance¶
usage: cfsan_snp_pipeline distance [-h] [-f] [-p FILE] [-m FILE] [-v 0..5]
[--version]
snpMatrixFile
Calculate pairwise SNP distances from the multi-fasta SNP matrix. Generates a
file of pairwise distances and a file containing a matrix of distances.
positional arguments:
snpMatrixFile Relative or absolute path to the input multi-fasta SNP
matrix file.
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result file already exists
and is newer than inputs (default: False)
-p FILE, --pairs FILE
Relative or absolute path to the pairwise distance
output file. (default: None)
-m FILE, --matrix FILE
Relative or absolute path to the distance matrix
output file. (default: None)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
snp_reference¶
usage: cfsan_snp_pipeline snp_reference [-h] [-f] [-l FILE] [-o FILE]
[-v 0..5] [--version]
referenceFile
Write reference sequence bases at SNP locations to a fasta file.
positional arguments:
referenceFile Relative or absolute path to the reference bases file
in fasta format
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result file already exists
and is newer than inputs (default: False)
-l FILE, --snpListFile FILE
Relative or absolute path to the SNP list file
(default: snplist.txt)
-o FILE, --output FILE
Output file. Relative or absolute path to the SNP
reference sequence file (default: referenceSNP.fasta)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
collect_metrics¶
usage: cfsan_snp_pipeline collect_metrics [-h] [-f] [-o FILE] [-m INT]
[-c NAME] [-C NAME] [-v NAME]
[-V NAME] [--verbose 0..5]
[--version]
sampleDir referenceFile
Collect alignment, coverage, and variant metrics for a single specified
sample.
positional arguments:
sampleDir Relative or absolute directory of the sample
referenceFile Relative or absolute path to the reference fasta file
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result files already exist
and are newer than inputs (default: False)
-o FILE, --output FILE
Output file. Relative or absolute path to the metrics
file (default: metrics)
-m INT, --maxsnps INT
Maximum allowed number of SNPs per sample (default:
-1)
-c NAME File name of the consensus fasta file which must exist
in the sample directory (default: consensus.fasta)
-C NAME File name of the consensus preserved fasta file which
must exist in the sample directory (default:
consensus_preserved.fasta)
-v NAME File name of the consensus vcf file which must exist
in the sample directory (default: consensus.vcf)
-V NAME File name of the consensus preserved vcf file which
must exist in the sample directory (default:
consensus_preserved.vcf)
--verbose 0..5 Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
combine_metrics¶
usage: cfsan_snp_pipeline combine_metrics [-h] [-f] [-n NAME] [-o FILE] [-s]
[-v 0..5] [--version]
sampleDirsFile
Combine the metrics from all samples into a single table of metrics for all
samples. The output is a tab-separated-values file with a row for each sample
and a column for each metric. Before running this command, the metrics for
each sample must be created with the collect_metrics command.
positional arguments:
sampleDirsFile Relative or absolute path to file containing a list of
directories -- one per sample
optional arguments:
-h, --help show this help message and exit
-f, --force Force processing even when result files already exist
and are newer than inputs (default: False)
-n NAME, --metrics NAME
File name of the metrics files which must exist in
each of the sample directories. (default: metrics)
-o FILE, --output FILE
Output file. Relative or absolute path to the combined
metrics file. (default: metrics.tsv)
-s, --spaces Emit column headings with spaces instead of
underscores (default: False)
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit
purge¶
usage: cfsan_snp_pipeline purge [-h] [-v 0..5] [--version] work_dir
Purge the intermediate output files in the "samples" directory upon successful
completion of a SNP Pipeline run if no errors are encountered.
positional arguments:
work_dir Path to the working directory containing the "samples"
directory to be recursively deleted
optional arguments:
-h, --help show this help message and exit
-v 0..5, --verbose 0..5
Verbose message level (0=no info, 5=lots) (default: 1)
--version show program's version number and exit