A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/vgteam/vg/wiki/VCF-export-with-vg-deconstruct below:

VCF export with vg deconstruct · vgteam/vg Wiki · GitHub

VCF is a standard, text-based format for representing genome variation. Since VCF is so widely used and supported, it's common to convert from vg to VCF to leverage existing linear-reference-based tools to study variation from a graph.

Graphs are exported to VCF using vg deconstruct. This tools creates a site (VCF line) for every snarl (bubble) in the graph. The allele information (reference and alts) for these sites is taken from paths in the graph. This is important: if your graph does not have sufficient path information for both reference and alt alleles, and/or these paths are not named correctly, then vg deconstruct will probably not work properly.

vg deconstruct will work best if all paths in your graph follow the "PanSN" naming convention described here since writing VCF requires parsing out SAMPLE, HAPLOTYPE and CONTIG fields from the path names.

The reference path forms the basis of all the coordinates in the VCF. A single path can be specified with -p or a set of paths with a common prefix with -P. By default, all REFERENCE-sense paths in the graph are used. This behaviour makes sense when running on a GBZ with a single reference, such as those often produced by Minigraph-Cactus or vg autoindex, but it's usually prodent to explicitly set the reference with -P.

The ALT alleles will be formed from all other paths in the graph (that were not selected as reference paths as described above). Unlike for reference paths, these paths can be HAPLOTYPE-sense paths from GBZ or GBWTs. To get meaningful genotypes in the output VCF, these paths must include sample and haplotype information either with the proper naming scheme or by virtue of being encoded in a valid GBZ/GBWT.

Cycles in the reference path as found, for example, in graphs from PGGB are supported. A site is written for each loop through the path, with alt alleles assigned to the appropriate sites by measuring similarity in their flanking context.

The ID field in VCF is the two node IDs and their orientations in the graph that correspond to the two endpoints of the snarl. Note that when running on GBZ input that was derived from a GFA, you can have IDs written in terms of the original GFA coordinates with -OP.

By default, all top-level sites (ie not nested in other sites) along the chosen reference path(s) are output. The -a option will write a site for every single snarl that is spanned by a reference path. Nesting information is stored in the following INFO tags in the VCF

Clustering similar alleles [experimental]

vg deconstruct writes base-level resolved alleles. So two SV insertions that differ by a single base will have separate alleles in the output. This can lead to a proliferation of multi-allelic sites and make SVs difficult to count.

-L is a new option that clusters together alleles based on how similar they are in the graph. More precisely, two alleles are merged if the Jaccard coefficient of their (basewide) graph positions is greater or equal to the value given. By default it is 1, meaning alleles are only merged if they are identical.

When -L is specified, and it is <1, then two additional FORMAT tags are written - these apply to each genotyped allele:

Nested decomposition [experimental]

-n is a new option for exporting nested sites that addresses some shortcomings with the -a option:

The new INFO TAGs are (work in progress):

Example 1: SNP inside deletion

vg deconstruct test/nesting/nested_snp_in_del.gfa -p x -n | tail -4

##contig=<ID=x,length=5>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	a
x	1	>1>6	CATG	CAAG,C	60	.	AC=1,1;AF=0.5,0.5;AN=2;AT=>1>2>3>5>6,>1>2>4>5>6,>1>6;NS=1;PA=0;PL=4;PR=4;RC=x;RD=4;RL=4;RS=1;LV=0	GT	1|2
x	3	>2>5	T	A,*	60	.	AC=1,1;AF=0.5,0.5;AN=2;AT=>2>3>5,>2>4>5,.;NS=1;PA=0;PL=4;PR=4;RC=x;RD=4;RL=4;RS=1;LV=1;PS=>1>6	GT	1|2
Example 2: SNP inside insertion

vg deconstruct test/nesting/nested_snp_in_ins2.gfa -p x -n

##contig=<ID=a#1#y0,length=5>
##contig=<ID=x,length=2>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	a	b
a#1#y0	3	>2>5	A	T,*	60	.	AC=1,1;AF=0.25,0.25;AN=4;AT=>2>4>5,>2>3>5,.;NS=2;PA=1;PL=4;PR=1;RC=x;RD=1;RL=4;RS=1;LV=1;PS=>1>6	GT	0|1	0|2
x	1	>1>6	C	CAAG,CATG	60	.	AC=2,1;AF=0.5,0.25;AN=4;AT=>1>6,>1>2>4>5>6,>1>2>3>5>6;NS=2;PA=0;PL=1;PR=1;RC=x;RD=1;RL=1;RS=1;LV=0	GT	1|2	1|0

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4