scPloidy
is a package to compute ploidy of single cells
(or nuclei) based on single-cell (or single-nucleus) ATAC-seq data. In
ATAC-seq, open chromatin regions are excised and sequenced. For any site
on the genome, ATAC-seq could read 0, 1 or 2 DNA fragments, if the cell
was diploid. If the cell was tetraploid, ATAC-seq could read 0, 1, 2, 3
or 4 fragments from the same site. This is the basic idea used in
scPloidy
. We model the depth of DNA sequencing at one site
by binomial distribution.
library(scPloidy)
library(GenomicRanges)
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:dplyr':
#>
#> combine, intersect, setdiff, union
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:gplots':
#>
#> space
#> The following object is masked from 'package:tidyr':
#>
#> expand
#> The following objects are masked from 'package:dplyr':
#>
#> first, rename
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#>
#> Attaching package: 'IRanges'
#> The following objects are masked from 'package:dplyr':
#>
#> collapse, desc, slice
#> Loading required package: GenomeInfoDb
library(IRanges)
library(readr)
See description.
In this section, we demonstrate the package by using a dataset included in the package. See next section on how to analyze your own data.
We use small dataset for single-nucleus ATAC-seq of rat liver. To limit the file size, it only includes 10 cells and fragments from chromosomes 19 and 20. The fragments were mapped to the rat rn7 genome.
We first set the regions where overlaps are counted. This is usually all of the autosomes.
targetregions =
GenomicRanges::GRanges(
c("chr19", "chr20"),
IRanges::IRanges(c(1, 1), width = 500000000))
Simple repeats in the genome can generate false overlaps. We exclude such regions. The regions were downloaded from USCS genome browser.
simpleRepeat = readr::read_tsv(
system.file("extdata", "simpleRepeat.chr19_20.txt.gz", package = "scPloidy"),
col_names = c("chrom", "chromStart", "chromEnd"))
#> Rows: 107037 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (1): chrom
#> dbl (2): chromStart, chromEnd
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
simpleRepeat[, 2] = simpleRepeat[, 2] + 1 # convert from 0-based position to 1-based
simpleRepeat = GenomicRanges::makeGRangesFromDataFrame(
as.data.frame(simpleRepeat),
seqnames.field = "chrom",
start.field = "chromStart",
end.field = "chromEnd")
Now compute the overlaps. The input file
SHR_m154211.10cells.chr19_20.fragments.txt.gz
records the
fragments sequenced in ATAC-seq.
fragmentoverlap =
fragmentoverlapcount(
system.file("extdata", "SHR_m154211.10cells.chr19_20.fragments.txt.gz", package = "scPloidy"),
targetregions,
excluderegions = simpleRepeat,
Tn5offset = c(4, -5))
#> | | | 0% | |=================================== | 50% | |======================================================================| 100%
fragmentoverlap
#> # A tibble: 10 × 8
#> barcode nfrags depth1 depth2 depth3 depth4 depth5 depth6
#> <chr> <int> <int> <int> <int> <int> <int> <int>
#> 1 BC00019_N01 6501 6288 201 9 2 1 0
#> 2 BC00022_N01 8006 7689 310 6 1 0 0
#> 3 BC00025_N01 5904 5767 133 3 1 0 0
#> 4 BC00026_N01 5257 5102 148 6 1 0 0
#> 5 BC00035_N01 6360 6198 158 4 0 0 0
#> 6 BC00055_N01 6271 6084 181 6 0 0 0
#> 7 BC00057_N01 5570 5413 152 4 1 0 0
#> 8 BC00077_N01 5997 5795 191 10 1 0 0
#> 9 BC00086_N01 5416 5261 151 4 0 0 0
#> 10 BC00087_N01 3470 3409 60 1 0 0 0
rm(fragmentoverlap)
Instead of the small dataset computed above, we here use a complete dataset for single-nucleus ATAC-seq of a rat liver.
data(SHR_m154211)
?SHR_m154211
fragmentoverlap = SHR_m154211$fragmentoverlap
fragmentoverlap
#> # A tibble: 3,572 × 8
#> barcode nfrags depth1 depth2 depth3 depth4 depth5 depth6
#> <chr> <int> <int> <int> <int> <int> <int> <int>
#> 1 BC00025_N01 129201 126908 2237 53 3 0 0
#> 2 BC00026_N01 130564 127401 3065 91 6 1 0
#> 3 BC00035_N01 123100 120093 2940 66 1 0 0
#> 4 BC00057_N01 110602 108061 2472 67 2 0 0
#> 5 BC00086_N01 110108 107374 2667 66 1 0 0
#> 6 BC00087_N01 108661 106717 1908 33 2 1 0
#> 7 BC00099_N01 107028 104685 2264 76 3 0 0
#> 8 BC00106_N01 100659 97576 2985 92 6 0 0
#> 9 BC00115_N01 105057 102889 2104 64 0 0 0
#> 10 BC00116_N01 105085 103519 1535 31 0 0 0
#> # ℹ 3,562 more rows
Compute ploidy, assuming a mixture of 2x (diploid), 4x (tetraploid)
and 8x cells. We recommend using ploidy.moment
which is
based on moment method.
p = ploidy(fragmentoverlap,
c(2, 4, 8))
#> number of iterations= 178
head(p)
#> barcode ploidy.moment ploidy.momentfractional ploidy.kmeans ploidy.em
#> 1 BC00025_N01 8 5.706704 4 4
#> 2 BC00026_N01 8 7.485560 8 8
#> 3 BC00035_N01 8 7.391306 4 4
#> 4 BC00057_N01 8 7.083094 8 8
#> 5 BC00086_N01 8 7.522005 8 4
#> 6 BC00087_N01 8 5.703602 4 4
The cell type of the cells are stored in dataframe
cells
. There are many hepatocytes of 4x and 8x, but other
cell types are mostly 2x.
In the Cell Ranger output directory, you should have files
fragments.tsv.gz
and fragments.tsv.gz.tbi
. The
file fragments.tsv.gz
can be processed with the
fragmentoverlapcount
function, specifying the option
Tn5offset = c(1, 0)
Although this requires several steps, you can start from a BAM file
and generate fragments file for scPloidy
. You need to
install samtools
, bgzip
and
tabix
, and run the following in your shell.
First generate a BAM file in which the cell barcode is prepended to
read name, separated by ‘:’. For example, suppose in your input file
bap.bam
your barcode BCxxxxxx
was stored in
the field with DB
tag as
DB:Z:atac_possorted_bam_BCxxxxxx
. We generate a BAM file
snap.bam
.
# extract the header file
samtools view bap.bam -H > header.sam
# create a bam file with the barcode embedded into the read name
cat <( cat header.sam ) \
<( samtools view bap.bam | awk '{for (i=12; i<=NF; ++i) { if ($i ~ "^DB:Z:atac_possorted_bam_BC"){ td[substr($i,1,2)] = substr($i,25,length($i)-24); } }; printf "%s:%s\n", td["DB"], $0 }' ) \
| samtools view -bS - > snap.bam
We next sort the reads by barcode, and obtain the file
snap.nsrt.bam
.
Finally, we extract fragments from the name-sorted BAM file, and
obtain the file fragments.txt.gz
.
samtools view -f 0x2 -F 0x900 -q 30 snap.nsrt.bam \
| samtofragmentbed.pl \
| perl -ne 'chomp; @a=split; $a[3]=~s/:.*//; print join("\t",@a), "\n"' \
| LC_ALL=C sort -T ./ -S 50% --parallel=12 -k1,1 -k2,3n -k4,4 -t$'\t' | uniq \
| bgzip > fragments.txt.gz
tabix -b 2 -e 3 fragments.txt.gz
The Perl script samtofragmentbed.pl
is included in this
package as this file:
The file fragments.txt.gz
can be processed with the
fragmentoverlapcount
function, specifying the option
Tn5offset = c(4, -5)