The program align.py uses mappy to align reads in Python using multiple worker threads. After loading the index the memory usage jumps up quickly to >20Gb and then continues to climb steadily through 40Gb an beyond.
This issue was first discovered in bonito and isolated to mappy. The data flow in the example mirrors that in bonito but reduced to using only Python stdlib functionality.
mappy: v2.24
pysam: v0.18 (just for optionally reading fastq inputs)
python: v3.8.6
Run program, creating query sequences from index on the fly
python align.py GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.mmi --threads 48
or using a directory containing *.fastq*
files:
python align.py GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.mmi --fastq_dir FAQ32498 --threads 48
The inputs I am using are available in the AWS S3 bucket at:
s3://ont-research/misc/mappy-mem/FAQ32498.tar
s3://ont-research/misc/mappy-mem/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.mmi
I've not fully ascertained if using lots of threads exacerbates the problem or simply makes the symptom apparent more quickly.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4