My goal is to create a gene activity matrix that perfectly aligns with the features of a scRNA seurat object. All data, ATAC and RNA, was mapped to GRCh38p13. As you know, duplicate gene names are given a decimal point suffix in seurat objects. To deal with this, I manually added these suffixes to my gene annotations:
See:
rna <- readRDS('rna99.rds')
atac <- readRDS(atac99.rds'))
hg38 <- rtracklayer::import(hg38.path)
genome(hg38) <- "hg38"
seqlevelsStyle(hg38) <- "UCSC"
#make all genes 'protein coding' so they are included in gene activities
hg38$gene_biotype <- "protein_coding"
Since Seurat adds decimal points to duplicate genes, need to do the same for annotations so that gene activity matrix aligns with scRNA Seurat object
#features.tsv is the 10X cellranger output with gene names and ensembl ID's
features.tsv <- read.table('features.tsv')
#filter out genes not present in scRNA data.
hg38 <- hg38[hg38$gene_id %in% features.tsv$ensg,]
#use ensg IDs to identify which duplicate gene has added suffix in the scRNA Seurat objects
#rename genes in the annotation according to RNA object names
df <- data.frame("rownames"=rownames(rna), "feature"=features.tsv$gene_name, "ensg" = features.tsv$ensg)
df2 <- df[df[,1] != df[,2],]
for (i in 1:nrow(df2)){
hg38$gene_name[which(hg38$gene_id == df2[i,3])] <- df2[i,1]
}
Annotation(atac) <- hg38
mtx <- GeneActivity(atac)
As you can see, hg38 has all the same genes as RNA Seurat objects, with the same names and added decimal points:
nrow(rna)
36601
length(unique(hg38$gene_name))
36601
table(hg38$gene_name %in% rownames(rna))
TRUE
2767648
The gene activity matrix is still not aligning
nrow(mtx)
36588
length(unique(rownames(mtx))
36559
It makes sense that there are 36601-13 = 36588 rows (mitochondrial genes excluded from gene activity matrix).
However, you can see that there are duplicates in the matrix, even after I manually changed the gene names
**Which means the gene activity matrix is duplicating some genes and leaving out others, even after I made sure all genes in the annotations were unique. **
Most of the left out genes come from unlocalized contiguous sequences, like KI270721.1.
Reads from these sequences are not present in my 10X cellranger data, so I am not worried about them.
I am concerned about why there are duplicate genes in my gene activity matrix even after I manually changed the gene names for each duplicate in the annotations:
rownames(mtx)[duplicated(rownames(mtx))]
[1] "PLCXD1" "GTPBP6" "LINC00685" "PPP2R3B" "AL732314.6"
[6] "AL732314.4" "SHOX" "AL672277.1" "CRLF2" "CSF2RA"
[11] "IL3RA" "SLC25A6" "LINC00106" "ASMTL-AS1" "ASMTL"
[16] "P2RY8" "AKAP17A" "ASMT" "AL683807.1" "AL683807.2"
[21] "DHRSX" "DHRSX-IT1" "ZBED1" "LINC00102" "CD99"
[26] "SPRY3" "VAMP7" "IL9R" "WASIR1"
Upon inspection of these genes, each one maps to both the X and the Y chromosome. Even though X and Y are homologous, the computer treats them as separate sets, and thus will add a suffix to the end of them. Is there a way to work around this?
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4