Command Reference

cfsan_snp_pipeline

usage: cfsan_snp_pipeline [-h] [--version] subcommand        ...

The CFSAN SNP Pipeline is a collection of tools using reference-based
alignments to call SNPs for a set of samples.

positional arguments:
  subcommand
    run              This do-it-all script runs all the pipeline steps
    data             Copy included data to a specified directory
    index_ref        Index the reference
    map_reads        Align reads to the reference
    call_sites       Find the sites with high-confidence SNPs in a sample
    filter_regions   Remove abnormally dense SNPs from all samples
    merge_sites      Prepare the list of sites having SNPs
    call_consensus   Call the consensus base at high-confidence sites
    merge_vcfs       Merge the per-sample VCF files
    snp_matrix       Create a matrix of SNPs
    distance         Calculate the SNP distances between samples
    snp_reference    Write reference bases at SNP locations to a fasta file
    collect_metrics  Collect quality and SNP metrics for a sample
    combine_metrics  Merge the per-sample metrics
    purge            Purge the intermediate output files

optional arguments:
  -h, --help         show this help message and exit
  --version          show program's version number and exit

run

usage: cfsan_snp_pipeline run [-h] [-f] [-m MODE] [-c FILE] [-Q grid|torque]
                              [-o DIR] (-s DIR | -S FILE) [--purge] [-v 0..5]
                              [--version]
                              referenceFile

Run the SNP Pipeline on a specified data set.

positional arguments:
  referenceFile         Relative or absolute path to the reference fasta file

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result files already exist
                        and are newer than inputs (default: False)
  -m MODE, --mirror MODE
                        Create a mirror copy of the reference directory and
                        all the sample directories.  Use this option to avoid
                        polluting the reference directory and sample
                        directories with the intermediate files generated by
                        the snp pipeline.  A "reference" subdirectory and a
                        "samples" subdirectory are created under the output
                        directory (see the -o option).  One directory per
                        sample is created under the "samples" directory.
                        Three suboptions allow a choice of how the reference
                        and samples are mirrored.
                          -m soft : creates soft links to the fasta and fastq
                                    files instead of copying
                          -m hard : creates hard links to the fasta and fastq
                                    files instead of copying
                          -m copy : copies the fasta and fastq files
                        (default: None)
  -c FILE, --conf FILE  Relative or absolute path to a configuration file for
                        overriding defaults and defining extra parameters for
                        the tools and scripts within the pipeline.
                        Note: A default parameter configuration file named
                              "snppipeline.conf" is used whenever the pipeline
                              is run without the -c option.  The configuration
                              file used for each run is copied into the log
                              directory, capturing the parameters used during
                              the run. (default: None)
  -Q grid|torque, --queue_mgr grid|torque
                        Job queue manager for remote parallel job execution in
                        an HPC environment. Currently "grid" and "torque" are
                        supported. If not specified, the pipeline will execute
                        locally. (default: None)
  -o DIR, --out_dir DIR
                        Output directory for the result files. Additional
                        subdirectories are automatically created under the
                        output directory for logs files and the mirrored
                        samples and reference files (see the -m option). The
                        output directory will be created if it does not
                        already exist. If not specified, the output files are
                        written to the current working directory. If you re-
                        run the pipeline on previously processed samples, and
                        specify a different output directory, the pipeline
                        will not rebuild everything unless you either force a
                        rebuild (see the -f option) or you request mirrored
                        inputs (see the -m option). (default: .)
  -s DIR, --samples_dir DIR
                        Relative or absolute path to the parent directory of
                        all the sample directories.  The -s option should be
                        used when all the sample directories are in
                        subdirectories immediately below a parent directory.
                        Note: You must specify either the -s or -S option, but
                              not both.
                        Note: The specified directory should contain only a
                              collection of sample directories, nothing else.
                        Note: Unless you request mirrored inputs, see the
                              -m option, additional files will be written to
                              each of the sample directories during the
                              execution of the SNP Pipeline (default: None)
  -S FILE, --samples_file FILE
                        Relative or absolute path to a file listing all of the
                        sample directories.  The -S option should be used when
                        the samples are not under a common parent directory.
                        Note: If you are not mirroring the samples (see the
                              -m option), you can improve parallel processing
                              performance by sorting the the list of
                              directories descending by size, largest first.
                              The -m option automatically generates a sorted
                              directory list.
                        Note: You must specify either the -s or -S option, but
                              not both.
                        Note: Unless you request mirrored inputs, see the
                              -m option, additional files will be written to
                              each of the sample directories during the
                              execution of the SNP Pipeline (default: None)
  --purge               Purge the intermediate output files (the entire
                        "samples" directory) when the pipeline completes
                        successfully. (default: False)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

data

usage: cfsan_snp_pipeline data [-h] [--version] whichData [destDirectory]

Copy data included with the CFSAN SNP Pipeline to a specified directory.

positional arguments:
  whichData          Which of the supplied data sets to copy.  The choices are:
                         lambdaVirusInputs          : Input reference and fastq files
                         lambdaVirusExpectedResults : Expected results files
                         agonaInputs                : Input reference file
                         agonaExpectedResults       : Expected results files
                         listeriaInputs             : Input reference file
                         listeriaExpectedResults    : Expected results files
                         configurationFile          : File of parameters to customize the
                                                      SNP pipeline

                     Note: the lambda virus data set is complete with input data and expected
                     results.  The agona and listeria data sets have the reference genome and
                     the expected results, but not the input fastq files, because the files are
                     too large to include with the package.

  destDirectory      Destination directory into which the SNP pipeline data files will be copied.
                     The data files are copied into the destination directory if the directory
                     already exists.  Otherwise the destination directory is created and the
                     data files are copied there.  (default: current directory)

optional arguments:
  -h, --help     show this help message and exit
  --version      show program's version number and exit

Example:
# create a new directory "testLambdaVirus" and copy the Lambda virus input data there
$ cfsan_snp_pipeline data lambdaVirusInputs testLambdaVirus

index_ref

usage: cfsan_snp_pipeline index_ref [-h] [-f] [-v 0..5] [--version]
                                    referenceFile

Index the reference genome for subsequent read mapping, and create the faidx
index file for subsequent pileups. The output is written to the reference
directory.

positional arguments:
  referenceFile         Relative or absolute path to the reference fasta file

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result files already exist
                        and are newer than inputs (default: False)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

map_reads

usage: cfsan_snp_pipeline map_reads [-h] [-f] [-v 0..5] [--threads INT]
                                    [--version]
                                    referenceFile sampleFastqFile1
                                    [sampleFastqFile2]

Align the sequence reads for a specified sample to a specified reference
genome. The reads are sorted, duplicates marked, and realigned around indels.
The output is written to the file "reads.sorted.deduped.indelrealigned.bam" in
the sample directory.

positional arguments:
  referenceFile         Relative or absolute path to the reference fasta file
  sampleFastqFile1      Relative or absolute path to the fastq file
  sampleFastqFile2      Optional relative or absolute path to the mate fastq
                        file, if paired (default: None)

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result files already exist
                        and are newer than inputs (default: False)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --threads INT         Number of CPU cores to use (default: 8)
  --version             show program's version number and exit

call_sites

usage: cfsan_snp_pipeline call_sites [-h] [-f] [-v 0..5] [--version]
                                     referenceFile sampleDir

Find the sites with high-confidence SNPs in a sample.

positional arguments:
  referenceFile         Relative or absolute path to the reference fasta file
  sampleDir             Relative or absolute directory of the sample

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result files already exist
                        and are newer than inputs (default: False)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

filter_regions

usage: cfsan_snp_pipeline filter_regions [-h] [-f] [-n NAME] [-l EDGE_LENGTH]
                                         [-w [WINDOW_SIZE [WINDOW_SIZE ...]]]
                                         [-m [MAX_NUM_SNPs [MAX_NUM_SNPs ...]]]
                                         [-g OUT_GROUP] [-M {all,each}]
                                         [-v 0..5] [--version]
                                         sampleDirsFile refFastaFile

Remove abnormally dense SNPs from the input VCF file, save the reserved SNPs
into a new VCF file, and save the removed SNPs into another VCF file.

positional arguments:
  sampleDirsFile        Relative or absolute path to file containing a list of
                        directories -- one per sample
  refFastaFile          Relative or absolute path to the reference fasta file

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result files already exist
                        and are newer than inputs (default: False)
  -n NAME, --vcfname NAME
                        File name of the input VCF files which must exist in
                        each of the sample directories (default: var.flt.vcf)
  -l EDGE_LENGTH, --edge_length EDGE_LENGTH
                        The length of the edge regions in a contig, in which
                        all SNPs will be removed. (default: 500)
  -w [WINDOW_SIZE [WINDOW_SIZE ...]], --window_size [WINDOW_SIZE [WINDOW_SIZE ...]]
                        The length of the window in which the number of SNPs
                        should be no more than max_num_snp. (default: [1000])
  -m [MAX_NUM_SNPs [MAX_NUM_SNPs ...]], --max_snp [MAX_NUM_SNPs [MAX_NUM_SNPs ...]]
                        The maximum number of SNPs allowed in a window.
                        (default: [3])
  -g OUT_GROUP, --out_group OUT_GROUP
                        Relative or absolute path to the file indicating
                        outgroup samples, one sample ID per line. (default:
                        None)
  -M {all,each}, --mode {all,each}
                        Control whether dense snp regions found in any sample
                        are filtered from all of the samples, or each sample
                        independently. (default: all)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

You can filter snps more than once by specifying multiple window sizes and max
snps. For example "-m 3 2 -w 1000 100" will filter more than 3 snps in 1000
bases and also more than 2 snps in 100 bases.

merge_sites

usage: cfsan_snp_pipeline merge_sites [-h] [-f] [-n NAME] [--maxsnps INT]
                                      [-o FILE] [-v 0..5] [--version]
                                      sampleDirsFile filteredSampleDirsFile

Combine the SNP positions across all samples into a single unified SNP list
file identifying the positions and sample names where SNPs were called.

positional arguments:
  sampleDirsFile        Relative or absolute path to file containing a list of
                        directories -- one per sample
  filteredSampleDirsFile
                        Relative or absolute path to the output file that will
                        be created containing the filtered list of sample
                        directories -- one per sample. The samples in this
                        file are those without an excessive number of snps.
                        See the --maxsnps parameter.

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result file already exists
                        and is newer than inputs (default: False)
  -n NAME, --vcfname NAME
                        File name of the VCF files which must exist in each of
                        the sample directories (default: var.flt.vcf)
  --maxsnps INT         Exclude samples having more than this maximum allowed
                        number of SNPs. Set to -1 to disable this function.
                        (default: -1)
  -o FILE, --output FILE
                        Output file. Relative or absolute path to the SNP list
                        file (default: snplist.txt)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

call_consensus

usage: cfsan_snp_pipeline call_consensus [-h] [-f] [-l FILE] [-e FILE]
                                         [-o FILE] [-q INT] [-c FREQ] [-D INT]
                                         [-d INT] [-b FREQ]
                                         [--vcfFileName NAME]
                                         [--vcfRefName NAME] [--vcfAllPos]
                                         [--vcfPreserveRefCase]
                                         [--vcfFailedSnpGt {.,0,1}] [-v 0..5]
                                         [--version]
                                         allPileupFile

Call the consensus base for a sample at the specified positions where high-
confidence SNPs were previously called in any of the samples. Generates a
single-sequence fasta file with one base per specified position.

positional arguments:
  allPileupFile         Relative or absolute path to the genome-wide pileup
                        file for this sample.

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result file already exists
                        and is newer than inputs. (default: False)
  -l FILE, --snpListFile FILE
                        Relative or absolute path to the SNP list file across
                        all samples. (default: snplist.txt)
  -e FILE, --excludeFile FILE
                        VCF file of positions to exclude. (default: None)
  -o FILE, --output FILE
                        Output file. Relative or absolute path to the
                        consensus fasta file for this sample. (default:
                        consensus.fasta)
  -q INT, --minBaseQual INT
                        Mimimum base quality score to count a read. All other
                        snp filters take effect after the low-quality reads
                        are discarded. (default: 0)
  -c FREQ, --minConsFreq FREQ
                        Consensus frequency. Mimimum fraction of high-quality
                        reads supporting the consensus to make a call.
                        (default: 0.6)
  -D INT, --minConsDpth INT
                        Consensus depth. Minimum number of high-quality reads
                        supporting the consensus to make a call. (default: 1)
  -d INT, --minConsStrdDpth INT
                        Consensus strand depth. Minimum number of high-quality
                        reads supporting the consensus which must be present
                        on both the forward and reverse strands to make a
                        call. (default: 0)
  -b FREQ, --minConsStrdBias FREQ
                        Strand bias. Minimum fraction of the high-quality
                        consensus-supporting reads which must be present on
                        both the forward and reverse strands to make a call.
                        The numerator of this fraction is the number of high-
                        quality consensus-supporting reads on one strand. The
                        denominator is the total number of high-quality
                        consensus-supporting reads on both strands combined.
                        (default: 0)
  --vcfFileName NAME    VCF Output file name. If specified, a VCF file with
                        this file name will be created in the same directory
                        as the consensus fasta file for this sample. (default:
                        None)
  --vcfRefName NAME     Name of the reference file. This is only used in the
                        generated VCF file header. (default: Unknown
                        reference)
  --vcfAllPos           Flag to cause VCF file generation at all positions,
                        not just the snp positions. This has no effect on the
                        consensus fasta file, it only affects the VCF file.
                        This capability is intended primarily as a diagnostic
                        tool and enabling this flag will greatly increase
                        execution time. (default: False)
  --vcfPreserveRefCase  Flag to cause the VCF file generator to emit each
                        reference base in uppercase/lowercase as it appears in
                        the reference sequence file. If not specified, the
                        reference base is emitted in uppercase. (default:
                        False)
  --vcfFailedSnpGt {.,0,1}
                        Controls the VCF file GT data element when a snp fails
                        filters. Possible values: .) The GT element will be a
                        dot, indicating unable to make a call. 0) The GT
                        element will be 0, indicating the reference base. 1)
                        The GT element will be the ALT index of the most
                        commonly occuring base, usually 1. (default: .)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

merge_vcfs

usage: cfsan_snp_pipeline merge_vcfs [-h] [-f] [-n NAME] [-o FILE] [-v 0..5]
                                     [--version]
                                     sampleDirsFile

Merge the consensus vcf files from all samples into a single multi-vcf file
for all samples.

positional arguments:
  sampleDirsFile        Relative or absolute path to file containing a list of
                        directories -- one per sample

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result files already exist
                        and are newer than inputs (default: False)
  -n NAME, --vcfname NAME
                        File name of the vcf files which must exist in each of
                        the sample directories (default: consensus.vcf)
  -o FILE, --output FILE
                        Output file. Relative or absolute path to the merged
                        multi-vcf file (default: snpma.vcf)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

snp_matrix

usage: cfsan_snp_pipeline snp_matrix [-h] [-f] [-c NAME] [-o FILE] [-v 0..5]
                                     [--version]
                                     sampleDirsFile

Create the SNP matrix containing the consensus base for each of the samples at
the positions where high-confidence SNPs were found in any of the samples. The
matrix contains one row per sample and one column per SNP position. Non-SNP
positions are not included in the matrix. The matrix is formatted as a fasta
file, with each sequence (all of identical length) corresponding to the SNPs
in the correspondingly named sequence.

positional arguments:
  sampleDirsFile        Relative or absolute path to file containing a list of
                        directories -- one per sample

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result file already exists
                        and is newer than inputs (default: False)
  -c NAME, --consFileName NAME
                        File name of the previously created consensus SNP call
                        file which must exist in each of the sample
                        directories (default: consensus.fasta)
  -o FILE, --output FILE
                        Output file. Relative or absolute path to the SNP
                        matrix file (default: snpma.fasta)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

distance

usage: cfsan_snp_pipeline distance [-h] [-f] [-p FILE] [-m FILE] [-v 0..5]
                                   [--version]
                                   snpMatrixFile

Calculate pairwise SNP distances from the multi-fasta SNP matrix. Generates a
file of pairwise distances and a file containing a matrix of distances.

positional arguments:
  snpMatrixFile         Relative or absolute path to the input multi-fasta SNP
                        matrix file.

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result file already exists
                        and is newer than inputs (default: False)
  -p FILE, --pairs FILE
                        Relative or absolute path to the pairwise distance
                        output file. (default: None)
  -m FILE, --matrix FILE
                        Relative or absolute path to the distance matrix
                        output file. (default: None)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

snp_reference

usage: cfsan_snp_pipeline snp_reference [-h] [-f] [-l FILE] [-o FILE]
                                        [-v 0..5] [--version]
                                        referenceFile

Write reference sequence bases at SNP locations to a fasta file.

positional arguments:
  referenceFile         Relative or absolute path to the reference bases file
                        in fasta format

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result file already exists
                        and is newer than inputs (default: False)
  -l FILE, --snpListFile FILE
                        Relative or absolute path to the SNP list file
                        (default: snplist.txt)
  -o FILE, --output FILE
                        Output file. Relative or absolute path to the SNP
                        reference sequence file (default: referenceSNP.fasta)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

collect_metrics

usage: cfsan_snp_pipeline collect_metrics [-h] [-f] [-o FILE] [-m INT]
                                          [-c NAME] [-C NAME] [-v NAME]
                                          [-V NAME] [--verbose 0..5]
                                          [--version]
                                          sampleDir referenceFile

Collect alignment, coverage, and variant metrics for a single specified
sample.

positional arguments:
  sampleDir             Relative or absolute directory of the sample
  referenceFile         Relative or absolute path to the reference fasta file

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result files already exist
                        and are newer than inputs (default: False)
  -o FILE, --output FILE
                        Output file. Relative or absolute path to the metrics
                        file (default: metrics)
  -m INT, --maxsnps INT
                        Maximum allowed number of SNPs per sample (default:
                        -1)
  -c NAME               File name of the consensus fasta file which must exist
                        in the sample directory (default: consensus.fasta)
  -C NAME               File name of the consensus preserved fasta file which
                        must exist in the sample directory (default:
                        consensus_preserved.fasta)
  -v NAME               File name of the consensus vcf file which must exist
                        in the sample directory (default: consensus.vcf)
  -V NAME               File name of the consensus preserved vcf file which
                        must exist in the sample directory (default:
                        consensus_preserved.vcf)
  --verbose 0..5        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

combine_metrics

usage: cfsan_snp_pipeline combine_metrics [-h] [-f] [-n NAME] [-o FILE] [-s]
                                          [-v 0..5] [--version]
                                          sampleDirsFile

Combine the metrics from all samples into a single table of metrics for all
samples. The output is a tab-separated-values file with a row for each sample
and a column for each metric. Before running this command, the metrics for
each sample must be created with the collect_metrics command.

positional arguments:
  sampleDirsFile        Relative or absolute path to file containing a list of
                        directories -- one per sample

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result files already exist
                        and are newer than inputs (default: False)
  -n NAME, --metrics NAME
                        File name of the metrics files which must exist in
                        each of the sample directories. (default: metrics)
  -o FILE, --output FILE
                        Output file. Relative or absolute path to the combined
                        metrics file. (default: metrics.tsv)
  -s, --spaces          Emit column headings with spaces instead of
                        underscores (default: False)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

purge

usage: cfsan_snp_pipeline purge [-h] [-v 0..5] [--version] work_dir

Purge the intermediate output files in the "samples" directory upon successful
completion of a SNP Pipeline run if no errors are encountered.

positional arguments:
  work_dir              Path to the working directory containing the "samples"
                        directory to be recursively deleted

optional arguments:
  -h, --help            show this help message and exit
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit