Software

Statistical Structural Variation ANalyzer

Finding SVs by Combining Multiple Signal Source

INTRODUCTION

We develop, implement, validate a novel statistical algorithm, SWAN, for identifying genomic SVs based on NGS data from mixed tumor samples, see Figure 1. Previously, pair-end NGS data have shown to be the most resourceful for identifying genomic SVs. Computational SV calling methods based on single sequence feature, such as insert size, coverage depth or hanging read pairs, have been developed under the assumption of homogenous sampling17-19. Yet these methods are of limited success with high false discovery rate and not suitable for heterogeneous sample analysis, in particular tumor samples.

SWAN is the first to introduce a statistically verifiable heterogeneity SV model to the community. As in SWAN, the genetic material sampled is no longer viewed as a homogenous mutant or reference sample but explicitly modeled as a mixture of both mutant and reference sequences with their fractions estimable. This is an important step forward comparing to previous non-probabilistic homogenous modeling in other tools, which is crucial to acknowledge the what’s really happening in tumor samples where normal tissue are plagued by tumor cells. Our proper modeling is the basis for developing powerful statistical test in the downstream.

SWAN is also designed to statistically combine signals from multiple sequence features. Different features will give signals of different quality of upon SV events. For instance, the coverage signal is usually stronger in large deletions while almost non-existent upon insertions events. The insert size signal is most significant when SV size far large than the mean insert size. Conventional approaches solely rely on one signal or the other will inevitably miss the corresponding part of events where such signal is not present. SWAN, instead, is based on the scan theory of marked Poisson process mixture and we design three likelihood ratio scan statistics lW, lC and lD to capture all three signals. We then combine them to improve the statistical power. The combination has shown promising results in our simulaiton studies, where we often observe individual signals from lW, lC and lD are not distinctive compared to their background noise, however, when combined they represent an evident call for the underlying SV event (see illustration in Figure 1).

IMPLEMENTATION

Figure 1.The workflow of proposed SWAN algorithm. Currently, SWAN is implemented using R and C++. It is available as a standard R cpp extension (see availability).

AVAILABILITY

  • Download released master branch source code package here and install. Look for the README.txt file within the package (also viewable from https://bitbucket.org/charade/swan) for detailed installation instructions and others.
  • Git-based developmental source code access at: https://bitbucket.org/charade/swan. The SWAN Rcpp package is made open source for advanced users to pipeline the analysis or implement their own variants.

WIKI

SWAN’s page is an growing resource for manuals, FAQs and other information. This is a MUST read place before you actually using the SWAN tool. These documentations are also openly editable. You are more than welcome to contribute to this ongoing documentation.

CONTACTS

Questions and comments shall be addressed to lixia@stanford.edu

REFERENCES

A genome-wide approach for detecting novel insertion-deletion variants of mid-range size. LC Xia, S Sakshuwong, E Hopmans, J Bell, S Grimes, D Siegmund, H Ji, Nancy Zhang. Nucleic Acids Research 44 (15), e126 (2016) https://doi.org/10.1093/nar/gkw481

Scan statistics on Poisson random fields with applications in genomics. Nancy R Zhang, Benjamin Yakir, Li C Xia, David Siegmund. Annals of Applied Statistics 10(2) p726-752 (2016)