This page describes some of the data files you received from your service provider.

Primary Analysis Data

This is data directly generated by a PacBio RS II run. The Primary directory includes one subdirectory for each run. Each run directory includes a subdirectory for each SMRT Cell used in the run. * Each SMRT Cell directory includes an Analysis_Results subdirectory, which contains output files of interest. Example:

/path/to/secondary/storage/2420294/0011
├── Analysis_Results
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.bax.h5
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.log
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.subreads.fasta
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.subreads.fastq
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.bax.h5
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.log
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.subreads.fasta
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.subreads.fastq
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.bax.h5
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.log
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.subreads.fasta
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.subreads.fastq
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.bas.h5
│   ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.sts.csv
│   └── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.sts.xml
├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.xfer.xml
├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.xfer.xml
├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.xfer.xml
├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.mcd.h5
└── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.metadata.xml

For information on the main files of interest, see:

  • [bas.h5 Reference Guide] (https://s3.amazonaws.com/files.pacb.com/software/instrument/2.0.0/bas.h5+Reference+Guide.pdf) (PDF):
    Describes the main output files produced by the primary analysis pipeline: bas.h5, .1.bax.h5, .2.bax.h5, and .3.bax.h5. The bax.h5 files contain base call information from the sequencing run. The bas.h5 file is essentially a pointer to the three bax.h5 files.

  • [Metadata Output Guide] (https://s3.amazonaws.com/files.pacb.com/software/instrument/2.0.0/Metadata+Output+Guide.pdf) (PDF): Describes the file metadata.xml, which contains top-level information about the data, including what sequencing enzyme and chemistry were used, sample name, and other metadata.

  • [Statistics Output Guide] (https://s3.amazonaws.com/files.pacb.com/software/instrument/1.3.1/Statistics+Output+Guide.pdf) (PDF): Describes the file sts.xml, which includes summary statistics from a single movie acquisition.

Secondary Analysis Data

This is data produced by secondary analysis, which is performed on the primary analysis data generated by the instrument.

  • All files for a specific job reside in one directory that is named according to the job ID number
  • Every SMRT Portal job has the following structure. Example:

    /path/to/smrtanalysis/userdata/jobs/016/016234 ├── data/ ├── results/ ├── log/ ├── workflow/ ├── job.sh ├── input.xml └── settings.xml data is a directory that contains intermediate and final data files for the analysis job results is a directory that contains summary statistics and plots for the analysis job log is a directory that contains all log files for the analysis job workflow is a directory that contains all the executables for the analysis job job.sh is an executable file used by SMRT Portal to run the smrtpipe.py analysis job input.xml is a .xml file containing a list of input bax.h5 files used to run the analysis job * settings.xml is a .xml file containing the parameters needed to perform the analysis job

  • For more detail on specific protocol outputs, see [[Navigating the SMRT Pipe Job Directory]].

Within the data directory are several types of output files. You can use these data files as input for further downstream processing, pass on to collaborators, or upload to public genome sites. Depending on the protocol being performed, the data directory contain files in the following formats:

  • cmp.h5: The primary sequence alignment file for SMRT sequencing data. (Click [here] (https://s3.amazonaws.com/files.pacb.com/software/smrtanalysis/1.4/doc/cmp.h5+Reference+Guide.pdf) (PDF) for further details.)
  • H5: Hierarchical Data Format; a file-system-like data format. (Click [here] (http://www.hdfgroup.org/HDF5/doc/H5.intro.html) for further details.)
  • SAM: Sequence Alignment Map is a generic nucleotide alignment format that describes the alignment of query sequences or sequencing reads to a reference sequence or assembly. (Click [here] (http://samtools.sourceforge.net/) for further details.)
  • BAM: Binary version of the Sequence Alignment Map (SAM) format. (Click [here] (http://genome.ucsc.edu/goldenPath/help/bam.html) for further details.)
  • BAI: The index file for a file generated in the BAM format. (This is a non-standard file type.)
  • FASTA: FASTA-formatted sequence files contains either nucleic acid sequence (such as DNA) or protein sequence information. FASTA files store multiple sequences in a single file. (Click [here] (http://en.wikipedia.org/wiki/FASTA_format) for further details.)
  • GFF: General Feature Format, used for describing genes and other features associated with DNA, RNA and Protein sequences. (Click [here] (http://genome.ucsc.edu/FAQ/FAQformat#format3) for further details.)
  • VCF: Variant Call Format, for use with the molecular visualization and analysis program VMD. (Click [here] (http://en.wikipedia.org/wiki/Variant_Call_Format) for further details.)
  • BED: Format that defines the data lines displayed in an annotation track. (Click [here] (http://genome.ucsc.edu/FAQ/FAQformat#format1) for further details.)
  • CSV: Comma-Separated Values file. Can be viewed using Microsoft Excel or a text editor.
  • GML: An XML representation of the scaffold graph that results from scaffolding contigs using the AHA hybrid assembly algorithm.

SMRT Portal Reports

Your service provider included secondary analysis reports generated using SMRT Portal. * For an explanation of the report fields, click [here] (https://s3.amazonaws.com/files.pacb.com/software/smrtanalysis/2.2.0/Reports+-+Terminology.pdf) (PDF).

Downloading SMRT Analysis Software

  • The latest version of the SMRT Analysis software is available [here] (http://www.pacb.com/devnet/).

  • Pacific Biosciences provides a free Amazon Machine Image that you can use to run SMRT Portal in the cloud. See [here] (https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/%22Installing%22-SMRT-Portal-the-easy-way---Launching-A-SMRT-Portal-AMI) for details.