Introduction to scPloidy

Introduction

scPloidy is a package to compute ploidy of single cells (or nuclei) based on single-cell (or single-nucleus) ATAC-seq data. In ATAC-seq, open chromatin regions are excised and sequenced. For any site on the genome, ATAC-seq could read 0, 1 or 2 DNA fragments, if the cell was diploid. If the cell was tetraploid, ATAC-seq could read 0, 1, 2, 3 or 4 fragments from the same site. This is the basic idea used in scPloidy. We model the depth of DNA sequencing at one site by binomial distribution.

Usage

library(scPloidy)
library(GenomicRanges)
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:dplyr':
#> 
#>     combine, intersect, setdiff, union
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:gplots':
#> 
#>     space
#> The following object is masked from 'package:tidyr':
#> 
#>     expand
#> The following objects are masked from 'package:dplyr':
#> 
#>     first, rename
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> 
#> Attaching package: 'IRanges'
#> The following objects are masked from 'package:dplyr':
#> 
#>     collapse, desc, slice
#> Loading required package: GenomeInfoDb
library(IRanges)
library(readr)

See description.

?fragmentoverlapcount
?ploidy

Analyzing sample data

In this section, we demonstrate the package by using a dataset included in the package. See next section on how to analyze your own data.

Compute overlaps of DNA fragments

We use small dataset for single-nucleus ATAC-seq of rat liver. To limit the file size, it only includes 10 cells and fragments from chromosomes 19 and 20. The fragments were mapped to the rat rn7 genome.

We first set the regions where overlaps are counted. This is usually all of the autosomes.

targetregions =
  GenomicRanges::GRanges(
    c("chr19", "chr20"),
    IRanges::IRanges(c(1, 1), width = 500000000))

Simple repeats in the genome can generate false overlaps. We exclude such regions. The regions were downloaded from USCS genome browser.

simpleRepeat = readr::read_tsv(
  system.file("extdata", "simpleRepeat.chr19_20.txt.gz", package = "scPloidy"),
  col_names = c("chrom", "chromStart", "chromEnd"))
#> Rows: 107037 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (1): chrom
#> dbl (2): chromStart, chromEnd
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
simpleRepeat[, 2] = simpleRepeat[, 2] + 1 # convert from 0-based position to 1-based
simpleRepeat = GenomicRanges::makeGRangesFromDataFrame(
  as.data.frame(simpleRepeat),
  seqnames.field = "chrom",
  start.field    = "chromStart",
  end.field      = "chromEnd")

Now compute the overlaps. The input file SHR_m154211.10cells.chr19_20.fragments.txt.gz records the fragments sequenced in ATAC-seq.

fragmentoverlap =
  fragmentoverlapcount(
    system.file("extdata", "SHR_m154211.10cells.chr19_20.fragments.txt.gz", package = "scPloidy"),
    targetregions,
    excluderegions = simpleRepeat,
    Tn5offset = c(4, -5))
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
fragmentoverlap
#> # A tibble: 10 × 8
#>    barcode     nfrags depth1 depth2 depth3 depth4 depth5 depth6
#>    <chr>        <int>  <int>  <int>  <int>  <int>  <int>  <int>
#>  1 BC00019_N01   6501   6288    201      9      2      1      0
#>  2 BC00022_N01   8006   7689    310      6      1      0      0
#>  3 BC00025_N01   5904   5767    133      3      1      0      0
#>  4 BC00026_N01   5257   5102    148      6      1      0      0
#>  5 BC00035_N01   6360   6198    158      4      0      0      0
#>  6 BC00055_N01   6271   6084    181      6      0      0      0
#>  7 BC00057_N01   5570   5413    152      4      1      0      0
#>  8 BC00077_N01   5997   5795    191     10      1      0      0
#>  9 BC00086_N01   5416   5261    151      4      0      0      0
#> 10 BC00087_N01   3470   3409     60      1      0      0      0
rm(fragmentoverlap)

Infer ploidy

Instead of the small dataset computed above, we here use a complete dataset for single-nucleus ATAC-seq of a rat liver.

data(SHR_m154211)
?SHR_m154211
fragmentoverlap = SHR_m154211$fragmentoverlap
fragmentoverlap
#> # A tibble: 3,572 × 8
#>    barcode     nfrags depth1 depth2 depth3 depth4 depth5 depth6
#>    <chr>        <int>  <int>  <int>  <int>  <int>  <int>  <int>
#>  1 BC00025_N01 129201 126908   2237     53      3      0      0
#>  2 BC00026_N01 130564 127401   3065     91      6      1      0
#>  3 BC00035_N01 123100 120093   2940     66      1      0      0
#>  4 BC00057_N01 110602 108061   2472     67      2      0      0
#>  5 BC00086_N01 110108 107374   2667     66      1      0      0
#>  6 BC00087_N01 108661 106717   1908     33      2      1      0
#>  7 BC00099_N01 107028 104685   2264     76      3      0      0
#>  8 BC00106_N01 100659  97576   2985     92      6      0      0
#>  9 BC00115_N01 105057 102889   2104     64      0      0      0
#> 10 BC00116_N01 105085 103519   1535     31      0      0      0
#> # ℹ 3,562 more rows

Compute ploidy, assuming a mixture of 2x (diploid), 4x (tetraploid) and 8x cells. We recommend using ploidy.moment which is based on moment method.

p = ploidy(fragmentoverlap,
           c(2, 4, 8))
#> number of iterations= 178
head(p)
#>       barcode ploidy.moment ploidy.momentfractional ploidy.kmeans ploidy.em
#> 1 BC00025_N01             8                5.706704             4         4
#> 2 BC00026_N01             8                7.485560             8         8
#> 3 BC00035_N01             8                7.391306             4         4
#> 4 BC00057_N01             8                7.083094             8         8
#> 5 BC00086_N01             8                7.522005             8         4
#> 6 BC00087_N01             8                5.703602             4         4

The cell type of the cells are stored in dataframe cells. There are many hepatocytes of 4x and 8x, but other cell types are mostly 2x.

cells = SHR_m154211$cells
table(cells$celltype,
      p$ploidy.moment[match(cells$barcode, p$barcode)])
#>              
#>                  2    4    8
#>   B             31   24    2
#>   T_NK          65   55    1
#>   endothelial  384  196    2
#>   hepatocyte   203 1343  780
#>   macrophage   140  139    1
#>   stellate     107   93    6

Analyzing your own data

Using fragments.tsv.gz generated by 10x Cell Ranger

In the Cell Ranger output directory, you should have files fragments.tsv.gz and fragments.tsv.gz.tbi. The file fragments.tsv.gz can be processed with the fragmentoverlapcount function, specifying the option Tn5offset = c(1, 0)

Using BAM file

Although this requires several steps, you can start from a BAM file and generate fragments file for scPloidy. You need to install samtools, bgzip and tabix, and run the following in your shell.

First generate a BAM file in which the cell barcode is prepended to read name, separated by ‘:’. For example, suppose in your input file bap.bam your barcode BCxxxxxx was stored in the field with DB tag as DB:Z:atac_possorted_bam_BCxxxxxx. We generate a BAM file snap.bam.

# extract the header file
samtools view bap.bam -H > header.sam

# create a bam file with the barcode embedded into the read name
cat <( cat header.sam ) \
<( samtools view bap.bam | awk '{for (i=12; i<=NF; ++i) { if ($i ~ "^DB:Z:atac_possorted_bam_BC"){ td[substr($i,1,2)] = substr($i,25,length($i)-24); } }; printf "%s:%s\n", td["DB"], $0 }' ) \
| samtools view -bS - > snap.bam

We next sort the reads by barcode, and obtain the file snap.nsrt.bam.

samtools sort -n -@ 20 -m 10G snap.bam -o snap.nsrt.bam

Finally, we extract fragments from the name-sorted BAM file, and obtain the file fragments.txt.gz.

samtools view -f 0x2 -F 0x900 -q 30 snap.nsrt.bam  \
  | samtofragmentbed.pl  \
  | perl -ne 'chomp; @a=split; $a[3]=~s/:.*//; print join("\t",@a), "\n"'  \
  | LC_ALL=C sort -T ./ -S 50% --parallel=12 -k1,1 -k2,3n -k4,4 -t$'\t' | uniq  \
  | bgzip > fragments.txt.gz
tabix -b 2 -e 3 fragments.txt.gz

The Perl script samtofragmentbed.pl is included in this package as this file:

system.file("perl", "samtofragmentbed.pl", package = "scPloidy")

The file fragments.txt.gz can be processed with the fragmentoverlapcount function, specifying the option Tn5offset = c(4, -5)