Command Reference

run_snp_pipeline.sh

usage: run_snp_pipeline.sh [-h] [-f] [-m MODE] [-c FILE] [-Q torque|grid] [-o DIR] (-s DIR|-S FILE)
                           referenceFile

Run the SNP Pipeline on a specified data set.

Positional arguments:
  referenceFile  : Relative or absolute path to the reference fasta file.

Options:
  -h             : Show this help message and exit.

  -f             : Force processing even when result files already exist and
                   are newer than inputs.

  -m MODE        : Create a mirror copy of the reference directory and all the sample
                   directories.  Use this option to avoid polluting the reference directory and
                   sample directories with the intermediate files generated by the snp pipeline.
                   A "reference" subdirectory and a "samples" subdirectory are created under
                   the output directory (see the -o option).  One directory per sample is created
                   under the "samples" directory.  Three suboptions allow a choice of how the
                   reference and samples are mirrored.
                     -m soft : creates soft links to the fasta and fastq files instead of copying
                     -m hard : creates hard links to the fasta and fastq files instead of copying
                     -m copy : copies the fasta and fastq files

  -c FILE        : Relative or absolute path to a configuration file for overriding defaults
                   and defining extra parameters for the tools and scripts within the pipeline.
                   Note: A default parameter configuration file named "snppipeline.conf" is
                         used whenever the pipeline is run without the -c option.  The
                         configuration file used for each run is copied into the log directory,
                         capturing the parameters used during the run.

  -Q torque|grid : Job queue manager for remote parallel job execution in an HPC environment.
                   Currently "torque" and "grid" are supported.  If not specified, the pipeline
                   will execute locally.

  -o DIR         : Output directory for the snp list, snp matrix, and reference snp files.
                   Additional subdirectories are automatically created under the output
                   directory for logs files and the mirrored samples and reference files
                   (see the -m option).  The output directory will be created if it does
                   not already exist.  If not specified, the output files are written to
                   the current working directory.  If you re-run the pipeline on previously
                   processed samples, and specify a different output directory, the
                   pipeline will not rebuild everything unless you either force a rebuild
                   (see the -f option) or you request mirrored inputs (see the -m option).

  -s DIRECTORY   : Relative or absolute path to the parent directory of all the sample
                   directories.  The -s option should be used when all the sample directories
                   are in subdirectories immediately below a parent directory.
                   Note: You must specify either the -s or -S option, but not both.
                   Note: The specified directory should contain only a collection of sample
                         directories, nothing else.
                   Note: Unless you request mirrored inputs, see the -m option, additional files
                         will be written to each of the sample directories during the execution
                         of the SNP Pipeline

  -S FILE        : Relative or absolute path to a file listing all of the sample directories.
                   The -S option should be used when the samples are not under a common parent
                   directory.
                   Note: If you are not mirroring the samples (see the -m option), you can
                         improve parallel processing performance by sorting the the list of
                         directories descending by size, largest first.  The -m option
                         automatically generates a sorted directory list.
                   Note: You must specify either the -s or -S option, but not both.
                   Note: Unless you request mirrored inputs, see the -m option, additional files
                         will be written to each of the sample directories during the execution
                         of the SNP Pipeline

cfsan_snp_pipeline

usage: cfsan_snp_pipeline [-h] [--version] subcommand        ...

The CFSAN SNP Pipeline is a collection of tools using reference-based
alignments to call SNPs for a set of samples.

positional arguments:
  subcommand
    data             Copy included data to a specified directory
    index_ref        Index the reference
    map_reads        Align reads to the reference
    call_sites       Find the sites with high-confidence SNPs in a sample
    filter_regions   Remove abnormally dense SNPs from all samples
    merge_sites      Prepare the list of sites having SNPs
    call_consensus   Call the consensus base at high-confidence sites
    merge_vcfs       Merge the per-sample VCF files
    snp_matrix       Create a matrix of SNPs
    distance         Calculate the SNP distances between samples
    snp_reference    Write reference bases at SNP locations to a fasta file
    collect_metrics  Collect quality and SNP metrics for a sample
    combine_metrics  Merge the per-sample metrics

optional arguments:
  -h, --help         show this help message and exit
  --version          show program's version number and exit

data

usage: cfsan_snp_pipeline data [-h] [--version] whichData [destDirectory]

Copy data included with the CFSAN SNP Pipeline to a specified directory.

positional arguments:
  whichData          Which of the supplied data sets to copy.  The choices are:
                         lambdaVirusInputs          : Input reference and fastq files
                         lambdaVirusExpectedResults : Expected results files
                         agonaInputs                : Input reference file
                         agonaExpectedResults       : Expected results files
                         listeriaInputs             : Input reference file
                         listeriaExpectedResults    : Expected results files
                         configurationFile          : File of parameters to customize the
                                                      SNP pipeline

                     Note: the lambda virus data set is complete with input data and expected
                     results.  The agona and listeria data sets have the reference genome and
                     the expected results, but not the input fastq files, because the files are
                     too large to include with the package.

  destDirectory      Destination directory into which the SNP pipeline data files will be copied.
                     The data files are copied into the destination directory if the directory
                     already exists.  Otherwise the destination directory is created and the
                     data files are copied there.  (default: current directory)

optional arguments:
  -h, --help     show this help message and exit
  --version      show program's version number and exit

Example:
# create a new directory "testLambdaVirus" and copy the Lambda virus input data there
$ cfsan_snp_pipeline data lambdaVirusInputs testLambdaVirus

index_ref

usage: cfsan_snp_pipeline index_ref [-h] [-f] [-v 0..5] [--version]
                                    referenceFile

Index the reference genome for subsequent read mapping, and create the faidx
index file for subsequent pileups. The output is written to the reference
directory.

positional arguments:
  referenceFile         Relative or absolute path to the reference fasta file

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result files already exist
                        and are newer than inputs (default: False)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

map_reads

usage: cfsan_snp_pipeline map_reads [-h] [-f] [-v 0..5] [--version]
                                    referenceFile sampleFastqFile1
                                    [sampleFastqFile2]

Align the sequence reads for a specified sample to a specified reference
genome. The output is written to the file "reads.sam" in the sample directory.

positional arguments:
  referenceFile         Relative or absolute path to the reference fasta file
  sampleFastqFile1      Relative or absolute path to the fastq file
  sampleFastqFile2      Optional relative or absolute path to the mate fastq
                        file, if paired (default: None)

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result files already exist
                        and are newer than inputs (default: False)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

call_sites

usage: cfsan_snp_pipeline call_sites [-h] [-f] [-v 0..5] [--version]
                                     referenceFile sampleDir

Find the sites with high-confidence SNPs in a sample.

positional arguments:
  referenceFile         Relative or absolute path to the reference fasta file
  sampleDir             Relative or absolute directory of the sample

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result files already exist
                        and are newer than inputs (default: False)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

filter_regions

usage: cfsan_snp_pipeline filter_regions [-h] [-f] [-n NAME] [-l EDGE_LENGTH]
                                         [-w WINDOW_SIZE] [-m MAX_NUM_SNPs]
                                         [-g OUT_GROUP] [-v 0..5] [--version]
                                         sampleDirsFile refFastaFile

Remove abnormally dense SNPs from the input VCF file, save the reserved SNPs
into a new VCF file, and save the removed SNPs into another VCF file.

positional arguments:
  sampleDirsFile        Relative or absolute path to file containing a list of
                        directories -- one per sample
  refFastaFile          Relative or absolute path to the reference fasta file

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result files already exist
                        and are newer than inputs (default: False)
  -n NAME, --vcfname NAME
                        File name of the input VCF files which must exist in
                        each of the sample directories (default: var.flt.vcf)
  -l EDGE_LENGTH, --edge_length EDGE_LENGTH
                        The length of the edge regions in a contig, in which
                        all SNPs will be removed. (default: 500)
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        The length of the window in which the number of SNPs
                        should be no more than max_num_snp. (default: 1000)
  -m MAX_NUM_SNPs, --max_snp MAX_NUM_SNPs
                        The maximum number of SNPs allowed in a window.
                        (default: 3)
  -g OUT_GROUP, --out_group OUT_GROUP
                        Relative or absolute path to the file indicating
                        outgroup samples, one sample ID per line. (default:
                        None)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

merge_sites

usage: cfsan_snp_pipeline merge_sites [-h] [-f] [-n NAME] [--maxsnps INT]
                                      [-o FILE] [-v 0..5] [--version]
                                      sampleDirsFile filteredSampleDirsFile

Combine the SNP positions across all samples into a single unified SNP list
file identifing the positions and sample names where SNPs were called.

positional arguments:
  sampleDirsFile        Relative or absolute path to file containing a list of
                        directories -- one per sample
  filteredSampleDirsFile
                        Relative or absolute path to the output file that will
                        be created containing the filtered list of sample
                        directories -- one per sample. The samples in this
                        file are those without an excessive number of snps.
                        See the --maxsnps parameter.

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result file already exists
                        and is newer than inputs (default: False)
  -n NAME, --vcfname NAME
                        File name of the VCF files which must exist in each of
                        the sample directories (default: var.flt.vcf)
  --maxsnps INT         Exclude samples having more than this maximum allowed
                        number of SNPs. Set to -1 to disable this function.
                        (default: -1)
  -o FILE, --output FILE
                        Output file. Relative or absolute path to the SNP list
                        file (default: snplist.txt)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

call_consensus

usage: cfsan_snp_pipeline call_consensus [-h] [-f] [-l FILE] [-e FILE]
                                         [-o FILE] [-q INT] [-c FREQ] [-d INT]
                                         [-b FREQ] [--vcfFileName NAME]
                                         [--vcfRefName NAME] [--vcfAllPos]
                                         [--vcfPreserveRefCase]
                                         [--vcfFailedSnpGt {.,0,1}] [-v 0..5]
                                         [--version]
                                         allPileupFile

Call the consensus base for a sample at the specified positions where high-
confidence SNPs were previously called in any of the samples. Generates a
single-sequence fasta file with one base per specified position.

positional arguments:
  allPileupFile         Relative or absolute path to the genome-wide pileup
                        file for this sample.

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result file already exists
                        and is newer than inputs. (default: False)
  -l FILE, --snpListFile FILE
                        Relative or absolute path to the SNP list file across
                        all samples. (default: snplist.txt)
  -e FILE, --excludeFile FILE
                        VCF file of positions to exclude. (default: None)
  -o FILE, --output FILE
                        Output file. Relative or absolute path to the
                        consensus fasta file for this sample. (default:
                        consensus.fasta)
  -q INT, --minBaseQual INT
                        Mimimum base quality score to count a read. All other
                        snp filters take effect after the low-quality reads
                        are discarded. (default: 0)
  -c FREQ, --minConsFreq FREQ
                        Consensus frequency. Mimimum fraction of high-quality
                        reads supporting the consensus to make a call.
                        (default: 0.6)
  -d INT, --minConsStrdDpth INT
                        Consensus strand depth. Minimum number of high-quality
                        reads supporting the consensus which must be present
                        on both the forward and reverse strands to make a
                        call. (default: 0)
  -b FREQ, --minConsStrdBias FREQ
                        Strand bias. Minimum fraction of the high-quality
                        consensus-supporting reads which must be present on
                        both the forward and reverse strands to make a call.
                        The numerator of this fraction is the number of high-
                        quality consensus-supporting reads on one strand. The
                        denominator is the total number of high-quality
                        consensus-supporting reads on both strands combined.
                        (default: 0)
  --vcfFileName NAME    VCF Output file name. If specified, a VCF file with
                        this file name will be created in the same directory
                        as the consensus fasta file for this sample. (default:
                        None)
  --vcfRefName NAME     Name of the reference file. This is only used in the
                        generated VCF file header. (default: Unknown
                        reference)
  --vcfAllPos           Flag to cause VCF file generation at all positions,
                        not just the snp positions. This has no effect on the
                        consensus fasta file, it only affects the VCF file.
                        This capability is intended primarily as a diagnostic
                        tool and enabling this flag will greatly increase
                        execution time. (default: False)
  --vcfPreserveRefCase  Flag to cause the VCF file generator to emit each
                        reference base in uppercase/lowercase as it appears in
                        the reference sequence file. If not specified, the
                        reference base is emitted in uppercase. (default:
                        False)
  --vcfFailedSnpGt {.,0,1}
                        Controls the VCF file GT data element when a snp fails
                        filters. Possible values: .) The GT element will be a
                        dot, indicating unable to make a call. 0) The GT
                        element will be 0, indicating the reference base. 1)
                        The GT element will be the ALT index of the most
                        commonly occuring base, usually 1. (default: .)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

merge_vcfs

usage: cfsan_snp_pipeline merge_vcfs [-h] [-f] [-n NAME] [-o FILE] [-v 0..5]
                                     [--version]
                                     sampleDirsFile

Merge the consensus vcf files from all samples into a single multi-vcf file
for all samples.

positional arguments:
  sampleDirsFile        Relative or absolute path to file containing a list of
                        directories -- one per sample

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result files already exist
                        and are newer than inputs (default: False)
  -n NAME, --vcfname NAME
                        File name of the vcf files which must exist in each of
                        the sample directories (default: consensus.vcf)
  -o FILE, --output FILE
                        Output file. Relative or absolute path to the merged
                        multi-vcf file (default: snpma.vcf)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

snp_matrix

usage: cfsan_snp_pipeline snp_matrix [-h] [-f] [-c NAME] [-o FILE] [-v 0..5]
                                     [--version]
                                     sampleDirsFile

Create the SNP matrix containing the consensus base for each of the samples at
the positions where high-confidence SNPs were found in any of the samples. The
matrix contains one row per sample and one column per SNP position. Non-SNP
positions are not included in the matrix. The matrix is formatted as a fasta
file, with each sequence (all of identical length) corresponding to the SNPs
in the correspondingly named sequence.

positional arguments:
  sampleDirsFile        Relative or absolute path to file containing a list of
                        directories -- one per sample

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result file already exists
                        and is newer than inputs (default: False)
  -c NAME, --consFileName NAME
                        File name of the previously created consensus SNP call
                        file which must exist in each of the sample
                        directories (default: consensus.fasta)
  -o FILE, --output FILE
                        Output file. Relative or absolute path to the SNP
                        matrix file (default: snpma.fasta)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

distance

usage: cfsan_snp_pipeline distance [-h] [-f] [-p FILE] [-m FILE] [-v 0..5]
                                   [--version]
                                   snpMatrixFile

Calculate pairwise SNP distances from the multi-fasta SNP matrix. Generates a
file of pairwise distances and a file containing a matrix of distances.

positional arguments:
  snpMatrixFile         Relative or absolute path to the input multi-fasta SNP
                        matrix file.

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result file already exists
                        and is newer than inputs (default: False)
  -p FILE, --pairs FILE
                        Relative or absolute path to the pairwise distance
                        output file. (default: None)
  -m FILE, --matrix FILE
                        Relative or absolute path to the distance matrix
                        output file. (default: None)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

snp_reference

usage: cfsan_snp_pipeline snp_reference [-h] [-f] [-l FILE] [-o FILE]
                                        [-v 0..5] [--version]
                                        referenceFile

Write reference sequence bases at SNP locations to a fasta file.

positional arguments:
  referenceFile         Relative or absolute path to the reference bases file
                        in fasta format

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result file already exists
                        and is newer than inputs (default: False)
  -l FILE, --snpListFile FILE
                        Relative or absolute path to the SNP list file
                        (default: snplist.txt)
  -o FILE, --output FILE
                        Output file. Relative or absolute path to the SNP
                        reference sequence file (default: referenceSNP.fasta)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

collect_metrics

usage: cfsan_snp_pipeline collect_metrics [-h] [-f] [-o FILE] [-m INT]
                                          [-c NAME] [-C NAME] [-v NAME]
                                          [-V NAME] [--verbose 0..5]
                                          [--version]
                                          sampleDir referenceFile

Collect alignment, coverage, and variant metrics for a single specified
sample.

positional arguments:
  sampleDir             Relative or absolute directory of the sample
  referenceFile         Relative or absolute path to the reference fasta file

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result files already exist
                        and are newer than inputs (default: False)
  -o FILE, --output FILE
                        Output file. Relative or absolute path to the metrics
                        file (default: metrics)
  -m INT, --maxsnps INT
                        Maximum allowed number of SNPs per sample (default:
                        -1)
  -c NAME               File name of the consensus fasta file which must exist
                        in the sample directory (default: consensus.fasta)
  -C NAME               File name of the consensus preserved fasta file which
                        must exist in the sample directory (default:
                        consensus_preserved.fasta)
  -v NAME               File name of the consensus vcf file which must exist
                        in the sample directory (default: consensus.vcf)
  -V NAME               File name of the consensus preserved vcf file which
                        must exist in the sample directory (default:
                        consensus_preserved.vcf)
  --verbose 0..5        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit

combine_metrics

usage: cfsan_snp_pipeline combine_metrics [-h] [-f] [-n NAME] [-o FILE] [-s]
                                          [-v 0..5] [--version]
                                          sampleDirsFile

Combine the metrics from all samples into a single table of metrics for all
samples. The output is a tab-separated-values file with a row for each sample
and a column for each metric. Before running this command, the metrics for
each sample must be created with the collect_metrics command.

positional arguments:
  sampleDirsFile        Relative or absolute path to file containing a list of
                        directories -- one per sample

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Force processing even when result files already exist
                        and are newer than inputs (default: False)
  -n NAME, --metrics NAME
                        File name of the metrics files which must exist in
                        each of the sample directories. (default: metrics)
  -o FILE, --output FILE
                        Output file. Relative or absolute path to the combined
                        metrics file. (default: metrics.tsv)
  -s, --spaces          Emit column headings with spaces instead of
                        underscores (default: False)
  -v 0..5, --verbose 0..5
                        Verbose message level (0=no info, 5=lots) (default: 1)
  --version             show program's version number and exit