--- title: "Introduction to scPloidy" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{intro} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) has_pkg = requireNamespace("GenomicRanges", quietly = TRUE) && requireNamespace("IRanges", quietly = TRUE) && requireNamespace("readr", quietly = TRUE) knitr::opts_chunk$set(eval = has_pkg) ``` ## Introduction `scPloidy` is a package to compute ploidy of single cells (or nuclei) based on single-cell (or single-nucleus) ATAC-seq data. In ATAC-seq, open chromatin regions are excised and sequenced. For any site on the genome, ATAC-seq could read 0, 1 or 2 DNA fragments, if the cell was diploid. If the cell was tetraploid, ATAC-seq could read 0, 1, 2, 3 or 4 fragments from the same site. This is the basic idea used in `scPloidy`. We model the depth of DNA sequencing at one site by binomial distribution. ## Usage ```{r setup} library(scPloidy) library(GenomicRanges) library(IRanges) library(readr) ``` See description. ```{r} ?fragmentoverlapcount ?ploidy ``` ## Analyzing sample data In this section, we demonstrate the package by using a dataset included in the package. See next section on how to analyze your own data. ### Compute overlaps of DNA fragments We use small dataset for single-nucleus ATAC-seq of rat liver. To limit the file size, it only includes 10 cells and fragments from chromosomes 19 and 20. The fragments were mapped to the rat rn7 genome. We first set the regions where overlaps are counted. This is usually all of the autosomes. ```{r} targetregions = GenomicRanges::GRanges( c("chr19", "chr20"), IRanges::IRanges(c(1, 1), width = 500000000)) ``` Simple repeats in the genome can generate false overlaps. We exclude such regions. The regions were downloaded from USCS genome browser. ```{r} simpleRepeat = readr::read_tsv( system.file("extdata", "simpleRepeat.chr19_20.txt.gz", package = "scPloidy"), col_names = c("chrom", "chromStart", "chromEnd")) simpleRepeat[, 2] = simpleRepeat[, 2] + 1 # convert from 0-based position to 1-based simpleRepeat = GenomicRanges::makeGRangesFromDataFrame( as.data.frame(simpleRepeat), seqnames.field = "chrom", start.field = "chromStart", end.field = "chromEnd") ``` Now compute the overlaps. The input file `SHR_m154211.10cells.chr19_20.fragments.txt.gz` records the fragments sequenced in ATAC-seq. ```{r} fragmentoverlap = fragmentoverlapcount( system.file("extdata", "SHR_m154211.10cells.chr19_20.fragments.txt.gz", package = "scPloidy"), targetregions, excluderegions = simpleRepeat, Tn5offset = c(4, -5)) fragmentoverlap rm(fragmentoverlap) ``` ### Infer ploidy Instead of the small dataset computed above, we here use a complete dataset for single-nucleus ATAC-seq of a rat liver. ```{r} data(SHR_m154211) ?SHR_m154211 fragmentoverlap = SHR_m154211$fragmentoverlap fragmentoverlap ``` Compute ploidy, assuming a mixture of 2x (diploid), 4x (tetraploid) and 8x cells. We recommend using `ploidy.moment` which is based on moment method. ```{r} p = ploidy(fragmentoverlap, c(2, 4, 8)) head(p) ``` The cell type of the cells are stored in dataframe `cells`. There are many hepatocytes of 4x and 8x, but other cell types are mostly 2x. ```{r} cells = SHR_m154211$cells table(cells$celltype, p$ploidy.moment[match(cells$barcode, p$barcode)]) ``` ## Analyzing your own data ### Using fragments.tsv.gz generated by 10x Cell Ranger In the Cell Ranger output directory, you should have files `fragments.tsv.gz` and `fragments.tsv.gz.tbi`. The file `fragments.tsv.gz` can be processed with the `fragmentoverlapcount` function, specifying the option `Tn5offset = c(1, 0)` ### Using BAM file Although this requires several steps, you can start from a BAM file and generate fragments file for `scPloidy`. You need to install `samtools`, `bgzip` and `tabix`, and run the following in your shell. First generate a BAM file in which the cell barcode is prepended to read name, separated by ':'. For example, suppose in your input file `bap.bam` your barcode `BCxxxxxx` was stored in the field with `DB` tag as `DB:Z:atac_possorted_bam_BCxxxxxx`. We generate a BAM file `snap.bam`. ```{bash, eval = FALSE} # extract the header file samtools view bap.bam -H > header.sam # create a bam file with the barcode embedded into the read name cat <( cat header.sam ) \ <( samtools view bap.bam | awk '{for (i=12; i<=NF; ++i) { if ($i ~ "^DB:Z:atac_possorted_bam_BC"){ td[substr($i,1,2)] = substr($i,25,length($i)-24); } }; printf "%s:%s\n", td["DB"], $0 }' ) \ | samtools view -bS - > snap.bam ``` We next sort the reads by barcode, and obtain the file `snap.nsrt.bam`. ```{bash, eval = FALSE} samtools sort -n -@ 20 -m 10G snap.bam -o snap.nsrt.bam ``` Finally, we extract fragments from the name-sorted BAM file, and obtain the file `fragments.txt.gz`. ```{bash, eval = FALSE} samtools view -f 0x2 -F 0x900 -q 30 snap.nsrt.bam \ | samtofragmentbed.pl \ | perl -ne 'chomp; @a=split; $a[3]=~s/:.*//; print join("\t",@a), "\n"' \ | LC_ALL=C sort -T ./ -S 50% --parallel=12 -k1,1 -k2,3n -k4,4 -t$'\t' | uniq \ | bgzip > fragments.txt.gz tabix -b 2 -e 3 fragments.txt.gz ``` The Perl script `samtofragmentbed.pl` is included in this package as this file: ```{r, eval = FALSE} system.file("perl", "samtofragmentbed.pl", package = "scPloidy") ``` The file `fragments.txt.gz` can be processed with the `fragmentoverlapcount` function, specifying the option `Tn5offset = c(4, -5)`