A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/stuart-lab/signac/issues/521 below:

GeneActivity() duplicates genes on pseudoautosomal regions of XY chromosomes · Issue #521 · stuart-lab/signac · GitHub

My goal is to create a gene activity matrix that perfectly aligns with the features of a scRNA seurat object. All data, ATAC and RNA, was mapped to GRCh38p13. As you know, duplicate gene names are given a decimal point suffix in seurat objects. To deal with this, I manually added these suffixes to my gene annotations:
See:

rna <- readRDS('rna99.rds')
atac <- readRDS(atac99.rds'))

hg38 <- rtracklayer::import(hg38.path)
genome(hg38) <- "hg38"
seqlevelsStyle(hg38) <- "UCSC"

#make all genes 'protein coding' so they are included in gene activities
hg38$gene_biotype <- "protein_coding"

Since Seurat adds decimal points to duplicate genes, need to do the same for annotations so that gene activity matrix aligns with scRNA Seurat object

#features.tsv is the 10X cellranger output with gene names and ensembl ID's
features.tsv <- read.table('features.tsv')

#filter out genes not present in scRNA data. 
hg38 <- hg38[hg38$gene_id %in% features.tsv$ensg,]

#use ensg IDs to identify which duplicate gene has added suffix in the scRNA Seurat objects
#rename genes in the annotation according to RNA object names
df <- data.frame("rownames"=rownames(rna), "feature"=features.tsv$gene_name, "ensg" = features.tsv$ensg)
df2 <- df[df[,1] != df[,2],]
for (i in 1:nrow(df2)){
  hg38$gene_name[which(hg38$gene_id == df2[i,3])] <- df2[i,1]
}

Annotation(atac) <- hg38

mtx <- GeneActivity(atac)

As you can see, hg38 has all the same genes as RNA Seurat objects, with the same names and added decimal points:

nrow(rna)
36601
length(unique(hg38$gene_name))
36601
table(hg38$gene_name %in% rownames(rna))
TRUE
2767648

The gene activity matrix is still not aligning

nrow(mtx)
36588
length(unique(rownames(mtx))
36559

It makes sense that there are 36601-13 = 36588 rows (mitochondrial genes excluded from gene activity matrix).
However, you can see that there are duplicates in the matrix, even after I manually changed the gene names

**Which means the gene activity matrix is duplicating some genes and leaving out others, even after I made sure all genes in the annotations were unique. **

Most of the left out genes come from unlocalized contiguous sequences, like KI270721.1.
Reads from these sequences are not present in my 10X cellranger data, so I am not worried about them.

I am concerned about why there are duplicate genes in my gene activity matrix even after I manually changed the gene names for each duplicate in the annotations:

rownames(mtx)[duplicated(rownames(mtx))]
 [1] "PLCXD1"     "GTPBP6"     "LINC00685"  "PPP2R3B"    "AL732314.6"
 [6] "AL732314.4" "SHOX"       "AL672277.1" "CRLF2"      "CSF2RA"
[11] "IL3RA"      "SLC25A6"    "LINC00106"  "ASMTL-AS1"  "ASMTL"
[16] "P2RY8"      "AKAP17A"    "ASMT"       "AL683807.1" "AL683807.2"
[21] "DHRSX"      "DHRSX-IT1"  "ZBED1"      "LINC00102"  "CD99"
[26] "SPRY3"      "VAMP7"      "IL9R"       "WASIR1"

Upon inspection of these genes, each one maps to both the X and the Y chromosome. Even though X and Y are homologous, the computer treats them as separate sets, and thus will add a suffix to the end of them. Is there a way to work around this?


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4