Physcraper module
The core blasting and new sequence integration module
physcraper.scrape.
PhyscraperScrape
(data_obj, ids_obj=None, search_taxon=None)[source]¶
This is the class that does the perpetual updating
To build the class the following is needed:
- data_obj: Object of class ATT (see above)
- ids_obj: Object of class IdDict (see above)
During the initializing process the following self.objects are generated:
- self.workdir: path to working directory retrieved from ATT object = data_obj.workdir
- self.logfile: path of logfile
- self.data: ATT object
- self.ids: IdDict object
- self.config: Config object
- self.new_seqs: dictionary that contains the newly found seq using blast:
- key: gi id
- value: corresponding seq
- self.new_seqs_otu_id: dictionary that contains
the new sequences that passed the remove_identical_seq() step:
- key: otu_id
- value: see otu_dict, is a subset of the otu_dict, all sequences that will be newly added to aln and tre
- self.mrca_ncbi: int ncbi identifier of mrca
- self.blast_subdir: path to folder that contains the files writen during blast
- self.newseqs_file: filename of files that contains the sequences from self.new_seqs_otu_id
- self.date: Date of the run - may lag behind real date!
- self.repeat: either 1 or 0, it is used to determine if we continue updating the tree,
no new seqs found = 0 * self.newseqs_acc: list of all gi_ids that were passed into remove_identical_seq(). Used to speed up adding process * self.blocklist: list of gi_id of sequences that shall not be added or need to be removed. Supplied by user. * self.seq_filter: list of words that may occur in otu_dict.status and which shall not be used in the building of FilterBlast.sp_d
(thatâs the main function), but it is also used as assert statement to make sure unwanted seqs are not added.
- self.unpublished: True/False. Used to look for local unpublished seq that shall be added if True.
- self.path_to_local_seq: Usually False, contains path to unpublished sequences if option is used.
Following functions are called during the init-process:
- self.reset_markers(): adds things to self: I think they are used to make sure certain function run,
- if program crashed and pickle file is read in.
- self._blasted: 0/1, if run_blast_wrapper() was called, it is set to 1 for the round.
- self._blast_read: 0/1, if read_blast_wrapper() was called, it is set to 1 for the round.
- self._identical_removed: 0
- self._query_seqs_written: 0/1, if write_query_seqs() was called, it is set to 1 for the round.
- self._query_seqs_aligned: 0
- self._query_seqs_placed: 0/1, if place_query_seqs() was called, it is set to 1 for the round.
- self._reconciled: 0
- self._full_tree_est: 0/1, if est_full_tree() was called, it is set to 1 for the round.
align_new_seqs
(aligner='muscle')[source]¶
Align the new sequences against each other
calculate_bootstrap
(alignment='default', num_reps='100')[source]¶
Calculates bootstrap and consensus trees.
-p: random seed -s: aln file -n: output fn -t: starting tree -b: bootstrap random seed -#: bootstrap stopping criteria -z: specifies file with multiple trees
calculate_final_tree
(boot_reps=100)[source]¶
Calculates the final tree using a trimmed alignment.
check_complement
(match, seq, gb_id)[source]¶
Double check if blast match is to sequence, complement or reverse complement, and return correct seq
est_full_tree
(alignment='default', startingtree=None)[source]¶
Full RAxML run from the placement tree as starting tree. The PTHREAD version is the faster one, hopefully people install it if not it falls back to the normal RAxML.
filter_seqs
(tmp_dict, selection='random', threshold=None)[source]¶
Subselect from sequences to a threshold of number of seqs per species
get_full_seq
(gb_id, blast_seq)[source]¶
Get full sequence from gb_acc that was retrieved via blast.
Currently only used for local searches, Genbank database sequences are retrieving them in batch mode, which is hopefully faster.
Parameters:full sequence, the whole submitted sequence, not only the part that matched the blast query sequence
make_sp_dict
(otu_list=None)[source]¶
Makes dict of OT_ids by species
map_taxa_to_ncbi
()[source]¶
Find NCBI ids for taxa from OpenTree
read_blast_wrapper
(blast_dir=None)[source]¶
reads in and processes the blast xml files
Parameters: blast_dir â path to directory which contains blast files Returns: fills different dictionaries with information from blast filesread_local_blast_query
(fn_path)[source]¶
Implementation to read in results of local blast searches.
Parameters: fn_path â path to file containing the local blast searches Returns: updated self.new_seqs and self.data.gb_dict dictionariesread_webbased_blast_query
(fn_path)[source]¶
Implementation to read in results of web blast searches.
Parameters: fn_path â path to file containing the local blast searches Returns: updated self.new_seqs and self.data.gb_dict dictionariesremove_blocklistitem
()[source]¶
This removes items from aln, and tree, if the corresponding Genbank identifer were added to the blocklist.
Note, that seq that were not added because they were similar to the one being removed here, are lost (that should not be a major issue though, as in a new blast_run, new seqs from the taxon can be added.)
remove_identical_seqs
()[source]¶
goes through the new seqs pulled down, and removes ones that are shorter than LENGTH_THRESH percent of the orig seq lengths, and chooses the longer of two that are other wise identical, and puts them in a dict with new name as gi_ott_id.
replace_aln
(filename, schema='fasta')[source]¶
Replace the alignment in the data object with the new alignment
replace_tre
(filename, schema='newick')[source]¶
Replace the tree in the data object with the new tree
reset_markers
()[source]¶
set completion markers back to 0 for a re-run
run_blast_wrapper
()[source]¶
generates the blast queries and saves them depending on the blasting method to different file formats
It runs blast if the sequences was not blasted since the user defined threshold in the config file (delay).
Returns: writes blast queries to filerun_local_blast_cmd
(query, taxon_label, fn_path)[source]¶
Contains the cmds used to run a local blast query, which is different from the web-queries.
Parameters:runs local blast query and writes it to file
run_muscle
(input_aln_path=None, new_seqs_path=None, outname='all_align')[source]¶
Aligns the new sequences and the profile aligns to the exsiting alignment
run_web_blast_query
(query, equery, fn_path)[source]¶
Equivalent to run_local_blast_cmd() but for webqueries, that need to be implemented differently.
Parameters:runs web blast query and writes it to file
select_seq_at_random
(otu_list, count)[source]¶
Selects sequences at random if there are more than the threshold.
seq_dict_build
(seq, new_otu_label, seq_dict)[source]¶
takes a sequence, a label (the otu_id) and a dictionary and adds the sequence to the dict only if it is not a subsequence of a sequence already in the dict. If the new sequence is a super sequence of one in the dict, it removes that sequence and replaces it
Parameters:updated seq_dict
summarize_boot
(besttreepath, bootpath, min_clade_freq=0.2)[source]¶
Summarize the bootstrap proportions onto the ML tree
write_mrca
()[source]¶
Write out search info to file
write_new_seqs
(filename='date')[source]¶
writes out the query sequence file
physcraper.scrape.
debug
(msg)[source]¶
short debugging command
physcraper.scrape.
set_verbose
()[source]¶
Set output to verbose
AlignTreeTax: The core data object for Physcraper. Holds and links name spaces for a tree, an alignment, the taxa and their metadata.
physcraper.aligntreetax.
AlignTreeTax
(tree, otu_dict, alignment, search_taxon, workdir, configfile=None, tree_schema='newick', aln_schema='fasta', tag=None)[source]¶
Wrap up the key parts together, requires OTT_id, and names must already match. Hypothetically, all the keys in the otu_dict should be clean.
<dendropy.datamodel.charmatrixmodel.DnaCharacterMatrix>` object * search_taxon: OToL identifier of the group of interest, either subclade as defined by user or of all tip labels in the phylogeny * workdir: the path to the corresponding working directory * config_obj: Config class * schema: optional argument to define tre file schema, if different from ânewickâ
self.aln: contains the alignment and which will be updated during the run
self.tre: contains the phylogeny, which will be updated during the run
key: otu_id, a unique identifier
was âoriginalâ, âqueriedâ, âremovedâ, âadded during filtering processâ * â^ot:ottTaxonNameâ: OToL taxon name * â^physcraper:last_blastedâ: contains the date when the sequence was blasted. * â^user:TaxonNameâ: optional, user given label from OtuJsonDict * â^ot:originalLabelâ optional, user given tip label of phylogeny
self.ps_otu: iterator for new otu IDs, is used as key for self.otu_dict
self.workdir: contains the path to the working directory, if folder does not exists it is generated.
self.mrca_ott: OToL taxon Id for the most recent common ancestor of the ingroup
self.orig_seqlen: list of the original sequence length of the input data
self.gi_dict: dictionary, that has all information from sequences found during the blasting. * key: GenBank sequence identifier * value: dictionary, content depends on blast option, differs between webquery and local blast queries
- keys - value pairs for local blast:
- â^ncbi:giâ: GenBank sequence identifier
- âaccessionâ: GenBank accession number
- âstaxidsâ: Taxon identifier
- âsscinamesâ: Taxon species name
- âpidentâ: Blast percentage of identical matches
- âevalueâ: Blast e-value
- âbitscoreâ: Blast bitscore, used for FilterBlast
- âsseqâ: corresponding sequence
- âtitleâ: title of Genbank sequence submission
- key - values for web-query:
- âaccessionâ:Genbank accession number
- âlengthâ: length of sequence
- âtitleâ: string combination of hit_id and hit_def
- âhit_idâ: string combination of gi id and accession number
- âhspsâ: Bio.Blast.Record.HSP object
- âhit_defâ: title from GenBank sequence
- optional key - value pairs for unpublished option:
- âlocalIDâ: local sequence identifier
self._reconciled: True/False,
self.unpubl_otu_json: optional, will contain the OTU-dict for unpublished data, if that option is used
add_otu
(gb_id, ids_obj)[source]¶
Generates an otu_id for new sequences and adds them into self.otu_dict. Needs to be passed an IdDict to do the mapping.
Parameters:the unique otu_id - the key from self.otu_dict of the corresponding sequence
check_tre_in_aln
()[source]¶
Makes sure that everything which is in tre is also found in aln.
Extracted method from trim. Not sure we actually need it there.
get_otu_for_acc
(gb_id)[source]¶
A reverse search to find the unique OTU ID for a given accession number :param gb_id: the Genbank identifier
prune_short
()[source]¶
Prunes sequences from alignment if they are shorter than specified in the config file, or if tip is only present in tre.
Sometimes in the de-concatenating of the original alignment taxa with no sequence are generated or in general if certain sequences are really short. This removes those from both the tre and the alignment.
has test: test_prune_short.py
Returns: prunes aln and treread_in_aln
(alignment, aln_schema)[source]¶
Reads in an alignment to the object taxon namespace.
read_in_tree
(tree, tree_schema=None)[source]¶
Imports a tree either from a file or a dendropy data object. Adds records in OTU dictionary if not already present.
remove_taxa_aln_tre
(taxon_label)[source]¶
Removes taxa from aln and tre and updates otu_dict, takes a single taxon_label as input.
note: has test, test_remove_taxa_aln_tre.py
Parameters: taxon_label â taxon_label from dendropy object - aln or phy Returns: removes information/data from taxon_labeltrim
(min_taxon_perc)[source]¶
It removes bases at the start and end of alignments, if they are represented by less than the value specified. E.g. 0.75 that 75% of the sequences need to have a base present.
Ensures, that not whole chromosomes get dragged in. Itâs cutting the ends of long sequences.
has test: test_trim.py
write_aln
(filename=None, alnschema='fasta', direc='workdir')[source]¶
Output alignment with unique otu ids as labels.
write_files
(treefilename=None, treeschema='newick', alnfilename=None, alnschema='fasta', direc='workdir')[source]¶
Outputs both the streaming files, labeled with OTU ids. Can be mapped to original labels using otu_dict.json or otu_seq_info.csv
write_labelled
(label, filename='labelled', direc='workdir', norepeats=True, add_gb_id=False)[source]¶
Output tree and alignment with human readable labels. Jumps through a bunch of hoops to make labels unique.
NOT MEMORY EFFICIENT AT ALL
Has different options available for different desired outputs
Parameters:writes out labelled phylogeny and alignment to file
write_labelled_aln
(label, filename='labelled', direc='workdir', norepeats=True, add_gb_id=False)[source]¶
A wrapper for the write_labelled aln function to maintain older functionalities
write_labelled_tree
(label, filename='labelled', direc='workdir', norepeats=True, add_gb_id=False)[source]¶
A wrapper for the write_labelled tree function to maintain older functionalities
write_otus
(filename='otu_info', schema='table', direc='workdir')[source]¶
Output all of the OTU information as either json or csv
write_papara_files
(treefilename='random_resolve.tre', alnfilename='aln_ott.phy')[source]¶
This writes out needed files for papara (except query sequences). Papara is finicky about trees and needs phylip format for the alignment.
NOTE: names for tree and aln files should not be changed, as they are hardcoded in align_query_seqs().
Is only used within func align_query_seqs.
write_random_resolve_tre
(treefilename='random_resolve.tre', direc='workdir')[source]¶
Randomly resolve polytomies, because some downstream approaches require that, e.g. Papara.
physcraper.aligntreetax.
generate_ATT_from_files
(workdir, configfile, alnfile, aln_schema, treefile, otu_json, tree_schema, search_taxon=None)[source]¶
Build an ATT object without phylesystem, use your own files instead.
Spaces vs underscores kept being an issue, so all spaces are coerced to underscores when data are read in.
Note: has test -> test_owndata.py
Parameters:object of class ATT
physcraper.aligntreetax.
generate_ATT_from_run
(workdir, start_files='output', tag=None, configfile=None, run=True)[source]¶
Build an ATT object without phylesystem, use your own files instead. :return: object of class ATT
physcraper.aligntreetax.
set_verbose
()[source]¶
Set verbosity of outputs
physcraper.aligntreetax.
write_labelled_aln
(aligntreetax, label, filepath, schema='fasta', norepeats=True, add_gb_id=False)[source]¶
Output tree and alignment with human readable labels. Jumps through a bunch of hoops to make labels unique.
NOT MEMORY EFFICIENT AT ALL
Has different options available for different desired outputs.
Parameters:writes out labelled phylogeny and alignment to file
physcraper.aligntreetax.
write_labelled_tree
(treetax, label, filepath, schema='newick', norepeats=True, add_gb_id=False)[source]¶
Output tree and alignment with human readable labels. Jumps through a bunch of hoops to make labels unique.
NOT MEMORY EFFICIENT AT ALL
Has different options available for different desired outputs.
Parameters:writes out labelled phylogeny and alignment to file
physcraper.aligntreetax.
write_otu_file
(treetax, filepath, schema='table')[source]¶
Writes out OTU dict as json or table. :param treetax: eitehr a treetaxon object or an alignment tree taxon object :param filename: filename :param schema: either table or json format :return: writes out otu_dict to file
Linker Functions to get data from OpenTree
physcraper.opentree_helpers.
OtuJsonDict
(id_to_spn, id_dict)[source]¶
Makes an OTU json dictionary, which is also produced within the openTreeLife-query.
This function is used, if files that shall be updated are not part of the OpenTreeofLife project. It reads in the file that contains the tip names and the corresponding species names. It then tries to get the unique identifier from the OpenTree project or from NCBI.
Reads input file into the var sp_info_dict, translates using an IdDict object using web to call OpenTree, then NCBI if not found.
Parameters:dictionary with key: âotu_tiplabelâ and value is another dict with the keys â^ncbi:taxonâ, â^ot:ottTaxonNameâ, â^ot:ottIdâ, â^ot:originalLabelâ, â^user:TaxonNameâ, â^physcraper:statusâ, â^physcraper:last_blastedâ
physcraper.opentree_helpers.
bulk_tnrs_load
(filename)[source]¶
Read in outputs from OpenTree Bulk TNRS, translates to a Physcraper OTU dictionary. :param filename: input json file
physcraper.opentree_helpers.
check_if_ottid_in_synth
(ottid)[source]¶
Web call to check if OTT id in synthetic tree. NOT USED.
physcraper.opentree_helpers.
conflict_tree
(inputtree, otu_dict)[source]¶
Write out a tree with labels that work for the OpenTree Conflict API
physcraper.opentree_helpers.
count_match_tree_to_aln
(tree, dataset)[source]¶
Assess how many taxa match between multiple genes in an alignment data set and input tree.
physcraper.opentree_helpers.
debug
(msg)[source]¶
short debugging command
physcraper.opentree_helpers.
deconcatenate_aln
(aln_obj, filename, direc)[source]¶
Split out separate concatended alignments. NOT TESTED
physcraper.opentree_helpers.
generate_ATT_from_phylesystem
(alnfile, aln_schema, workdir, configfile, study_id, tree_id, search_taxon=None, tip_label='^ot:originalLabel')[source]¶
Gathers together tree, alignment, and study info; forces names to OTT ids.
Study and tree IDâs can be obtained by using python ./scripts/find_trees.py LINEAGE_NAME
Spaces vs underscores kept being an issue, so all spaces are coerced to underscores when data are read in.
Parameters: aln â dendropy :class:`DnaCharacterMatrix<dendropy.datamodel.charmatrixmodel.DnaCharacterMatrix>` alignment object. :param workdir: Path to working directory. :param config_obj: Config class containing the settings. :param study_id: OpenTree study id of the phylogeny to update. :param tree_id: OpenTree tree id of the phylogeny to update, some studies have several phylogenies. :param phylesystem_loc: Access the GitHub version of the OpenTree data store, or a local clone. :param search_taxon: optional. OTT id of the MRCA of the clade that shall be updated. :return: Object of class ATT.
physcraper.opentree_helpers.
get_citations_from_json
(synth_response, citations_file)[source]¶
Get ciattions for studies in an induced synthetic tree repsonse. :param synth_response: Web service call record :param citations_file: Output file
physcraper.opentree_helpers.
get_dataset_from_treebase
(study_id)[source]¶
Given a phylogeny in OpenTree with mapped tip labels, this function gets an alignment from the corresponding study on TreeBASE, if available. By default, it first tries getting the alignment from the supertreebase repository at https://github.com/TreeBASE/supertreebase. If that fials, it tries getting the alignment directly form TreeBASE at https://treebase.org If both fail, it exits with a message.
physcraper.opentree_helpers.
get_max_match_aln
(tree, dataset, min_match=3)[source]¶
Select an alignment from a DNA dataset
physcraper.opentree_helpers.
get_mrca_ott
(ott_ids)[source]¶
Finds the MRCA of taxa in the ingroup of the original tree. The BLAST search later is limited to descendants of this MRCA according to the NCBI taxonomy.
Only used in the functions that generate the ATT object.
Parameters: ott_ids â List of all OTT ids for tip labels in phylogeny Returns: OTT id of most recent common ancestorphyscraper.opentree_helpers.
get_nexson
(study_id)[source]¶
Grabs nexson from phylesystem.
physcraper.opentree_helpers.
get_ott_taxon_info
(spp_name)[source]¶
Get OTT id, taxon name, and NCBI id (if present) from the OpenTree Taxonomy. Only works with version 3 of OpenTree APIs
Parameters: spp_name â Species name Returns:physcraper.opentree_helpers.
get_ottid_from_gbifid
(gbif_id)[source]¶
Returns a dictionary mapping GBIF ids to OTT ids. ott_id is set to âNoneâ if the GBIF id is not found in the Open Tree Taxanomy
physcraper.opentree_helpers.
get_tree_from_study
(study_id, tree_id, label_format='ot:originallabel')[source]¶
Create a dendropy Tree object from OpenTree data. :param study_id: OpenTree Study Id :param tree_id: OpenTree tree id :param label_format: One of âidâ, ânameâ, âot:originallabelâ, âot:ottidâ, âot:otttaxonnameâ. defaults to âot:originallabelâ
physcraper.opentree_helpers.
get_tree_from_synth
(ott_ids, label_format='name', citation='cites.txt')[source]¶
Wrapper for OT.synth_induced_tree that also pulls citations
physcraper.opentree_helpers.
ottids_in_synth
(synthfile=None)[source]¶
Checks if OTT ids are present in current synthetic tree, using a file listing all current OTT ids in synth (v12.3) :param synthfile: defaults to taxonomy/ottids_in_synth.txt
physcraper.opentree_helpers.
root_tree_from_synth
(tree, otu_dict, base='ott')[source]¶
Uses information from OpenTree of Life to suggest a root. :param tree: dendropy Tree :param otu_dict: a dictionary of tip label metadata, inculding an '^ot:ottId'attribute 'param base: either `synth
or ott. If synth will use OpenTree synthetic tree relationships to root input tree, if ott will use OpenTree taxonomy.
physcraper.opentree_helpers.
scraper_from_opentree
(study_id, tree_id, alnfile, workdir, aln_schema, configfile=None)[source]¶
Pull tree from OpenTree to create a physcraper object.
physcraper.opentree_helpers.
set_verbose
()[source]¶
Set output verbosity
Physcraper run Configuration object generator
physcraper.configobj.
ConfigObj
(configfile=None, run=True)[source]¶
To build the class the following is needed:
- configfi: a configuration file in a specific format, e.g. to read in self.e_value_thresh.
During the initializing process the following self objects are generated:
- self.e_value_thresh: the defined threshold for the e-value during Blast searches,
check out: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ
self.hitlist_size: the maximum number of sequences retrieved by a single blast search
self.minlen: value from 0 to 1. Defines how much shorter new seq can be compared to input
- self.trim_perc: value that determines how many seq need to be present before the beginning
and end of alignment will be trimmed
self.maxlen: max length for values to add to aln
self.get_ncbi_taxonomy: Path to sh file doing somethingâ¦
self.ott_ncbi: file containing OTT id, ncbi and taxon name (??)
self.email: email address used for blast queries
self.blast_loc: defines which blasting method to use:
- either web-query (=remote)
- from a local blast database (=local)
self.num_threads: number of cores to be used during a run
self.url_base:
- if blastloc == remote: it defines the url for the blast queries.
- if blastloc == local: url_base = None
self.delay: defines when to reblast sequences in days
optional self.objects:
if blastloc == local:
- self.blastdb: this defines the path to the local blast database
- self.ncbi_nodes: path to ânodes.dmpâ file, that contains the hierarchical information
- self.ncbi_names: path to ânames.dmpâ file, that contains the different IDâs
check_taxonomy
()[source]¶
Locates a taxonomy directory in tthe phyysraper repo, or if not avail (often because module was pip installed), genertes one.
config_str
()[source]¶
Write out the current config values. DOES NOT INCUDE SOME HIDDEN CONFIGUREABLE ATTRIBUTES
read_config
(configfi)[source]¶
Reads configfile, and sets configuration params. any params not listed will be set to dafault values in set_default() * configfile: path to input file.
set_defaults
()[source]¶
In the absence of an input configuration file, sets default values.
set_local
()[source]¶
Checks that all appropriate files etc are in place for local blast db.
write_file
(direc, filename='run.config')[source]¶
writes config params to file * direc: path to write file * filename: filename to use. Default = run.config
physcraper.configobj.
is_number
(inputstr)[source]¶
Test if string can be coerced to float
Link together NCBI and Open Tree identifiers and names, with Gen Bank information for updated sequences
physcraper.ids.
IdDicts
(configfile=None)[source]¶
Class contains different taxonomic identifiers and helps to find the corresponding ids between ncbi and OToL
To build the class the following is needed:
- config_obj: Object of class config (see above)
- workdir: the path to the assigned working directory
During the initializing process the following self objects are generated:
self.workdir: contains path of working directory
self.config: contains the Config class object
self.ott_to_ncbi: dictionary
- key: OToL taxon identifier
- value: ncbi taxon identifier
self.ncbi_to_ott: dictionary
- key: OToL taxon identifier
- value: ncbi taxon identifier
self.ott_to_name: dictionary
- key: OToL taxon identifier
- value: OToL taxon name
self.acc_ncbi_dict: dictionary
- key: Genbank identifier
- value: ncbi taxon identifier
self.spn_to_ncbiid: dictionary
- key: OToL taxon name
- value: ncbi taxon identifier
self.ncbiid_to_spn: dictionary
- key: ncbi taxon identifier
- value: ncbi taxon name
user defined list of mrca OTT-IDâs #TODO this is flipped form the dat aobj .ott_mrca. On purpose?
#reomved mrcaâs from ida, and put them into scrape object
- Optional:
- depending on blasting method:
- self.ncbi_parser: for local blast,
- initializes the ncbi_parser class, that contains information about rank and identifiers
entrez_efetch
(gb_id)[source]¶
It adds information to various id_dicts.
Parameters: gb_id â Genbank identifier Returns: read_handleget_ncbiid_from_acc
(acc)[source]¶
checks local dicts, and then runs eftech to get ncbi id for accession
get_tax_seq_acc
(acc)[source]¶
Pulls the taxon ID and the full sequences from NCBI
uses ncbi databases to easily retrieve taxonomic information.
parts are altered from https://github.com/zyxue/ncbitax2lin/blob/master/ncbitax2lin.py
physcraper.ncbi_data_parser.
Parser
(names_file, nodes_file)[source]¶
Reads in databases from ncbi to connect species names with the taxonomic identifier and the corresponding hierarchical information. It provides a much faster way to get those information then using web queries. We use those files to get independent from web requests to find those information (the implementation of it in BioPython was not really reliable). Nodes includes the hierarchical information, names the scientific names and IDâs. The files need to be updated regularly, best way to always do it when a new blast database was loaded.
get_downtorank_id
(tax_id, downtorank='species')[source]¶
Recursive function to find the parent id of a taxon as defined by downtorank.
get_id_from_name
(tax_name)[source]¶
Find the ID for a given taxonomic name.
get_id_from_synonym
(tax_name)[source]¶
Find the ID for a given taxonomic name, which is not an accepted name.
get_name_from_id
(tax_id)[source]¶
Find the scientific name for a given ID.
get_rank
(tax_id)[source]¶
Get rank for given ncbi tax id.
match_id_to_mrca
(tax_id, mrca_id)[source]¶
Recursive function to find out if tax_id is part of mrca_id.
physcraper.ncbi_data_parser.
get_acc_from_blast
(query_string)[source]¶
Get the accession number from a blast query. :param query_string: string that contains acc and gi from local blast query result :return: gb_acc
physcraper.ncbi_data_parser.
get_gi_from_blast
(query_string)[source]¶
Get the gi number from a blast query. Get acc is more difficult now, as new seqs not always have gi number, then query changes.
If not available return None.
Parameters: query_string â string that contains acc and gi from local blast query result Returns: gb_id if availablephyscraper.ncbi_data_parser.
get_ncbi_tax_id
(handle)[source]¶
Get the taxon ID from ncbi. ONly used for web queries
Parameters: handle â NCBI read.handle Returns: ncbi_idphyscraper.ncbi_data_parser.
get_ncbi_tax_name
(handle)[source]¶
Get the sp name from ncbi. Could be replaced by direct lookup to ott_ncbi.
Parameters: handle â NCBI read.handle Returns: ncbi_spnphyscraper.ncbi_data_parser.
get_tax_info_from_acc
(gb_id, ids_obj)[source]¶
takes an accession number and returns the ncbi_id and the taxon name
physcraper.ncbi_data_parser.
load_names
(names_file)[source]¶
Loads names.dmp and converts it into a pandas.DataFrame. Includes only names which are accepted as scientific name by ncbi.
physcraper.ncbi_data_parser.
load_nodes
(nodes_file)[source]¶
Loads nodes.dmp and converts it into a pandas.DataFrame. Contains the information about the taxonomic hierarchy of names.
physcraper.ncbi_data_parser.
load_synonyms
(names_file)[source]¶
Loads names.dmp and converts it into a pandas.DataFrame. Includes only names which are viewed as synonym by ncbi.
physcraper.ncbi_data_parser.
strip
(inputstr)[source]¶
Strips of blank characters from string in pd dataframe.
Work in progress to pull apart the linked tree and taxon objects from the alignemnt based ATT object
physcraper.treetaxon.
TreeTax
(otu_json, treefrom, schema='newick')[source]¶
wrap up the key parts together, requires OTT_id, and names must already match.
write_labelled
(label, path, norepeats=True, add_gb_id=False)[source]¶
output tree and alignment with human readable labels Jumps through a bunch of hoops to make labels unique.
NOT MEMORY EFFICIENT AT ALL
Has different options available for different desired outputs
Parameters:writes out labelled phylogeny and alignment to file
physcraper.treetaxon.
generate_TreeTax_from_run
(workdir, start_files='output', tag=None)[source]¶
Build an Tree + Taxon object from the outputs of a run. :return: object of class TreeTax
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4