Search Toolkit Book for CBuildDatabase
Build BlastDB format databases from various data sources. More...
#include <objtools/blast/seqdb_writer/build_db.hpp>
Build BlastDB format databases from various data sources.
This class provides an API for building BlastDB format databases. The WriteDB library is used internally to produce the actual database; the functionality provided by this class helps to bridge the gap between the WriteDB API and the needs of a command line database construction tool.
Definition at line 136 of file build_db.hpp.
◆ CBuildDatabase() [1/2] CBuildDatabase::CBuildDatabase ( const string & dbname, const string & title, bool is_protein, CWriteDB::TIndexType indexing, bool use_gi_mask, ostream * logfile, bool long_seqids =false
, EBlastDbVersion dbver = eBDB_Version4
, bool limit_defline = false
, Uint8 oid_masks = EOidMaskType::fNone
, bool scan_bioseq_4_cfastareader_usrobj = true
)
Constructor.
Create a database with the specified name, type, and other characteristics. The database will use the specified dbname as the base name for database volumes. Note that the indexing argument will be combined with either eSparseIndex or eDefault, depending on the "sparse" flag.
Definition at line 1073 of file build_db.cpp.
References CTime::AsString(), CDirEntry::CreateAbsolutePath(), CreateDirectories(), dbname(), DeleteBlastDb(), CTime::eCurrent, CWriteDB::eNucleotide, CWriteDB::eProtein, m_LogFile, m_LongIDs, m_OutputDb, m_OutputDbName, m_ParseIDs, ParseMoleculeTypeString(), CRef< C, Locker >::Reset(), and CWriteDB::SetMaxFileSize().
◆ CBuildDatabase() [2/2] CBuildDatabase::CBuildDatabase ( const string & dbname, const string & title, bool is_protein, bool sparse, bool parse_seqids, bool use_gi_mask, ostream * logfil, bool long_seqids =false
, EBlastDbVersion dbver = eBDB_Version4
, bool limit_defline = false
, Uint8 oid_masks = EOidMaskType::fNone
, bool scan_bioseq_4_cfastareader_usrobj = true
)
Constructor.
Create a database with the specified name, type, and other characteristics. The database will use the specified dbname as the base name for database volumes. Note that the indexing argument will be combined with either eSparseIndex or eDefault, depending on the "sparse" flag.
Definition at line 1136 of file build_db.cpp.
References CTime::AsString(), CDirEntry::CreateAbsolutePath(), CreateDirectories(), dbname(), DeleteBlastDb(), CTime::eCurrent, CWriteDB::eDefault, CWriteDB::eNucleotide, CWriteDB::eProtein, CWriteDB::eSparseIndex, m_LogFile, m_OutputDb, m_OutputDbName, m_ParseIDs, ParseMoleculeTypeString(), CRef< C, Locker >::Reset(), and CWriteDB::SetMaxFileSize().
◆ ~CBuildDatabase() CBuildDatabase::~CBuildDatabase ( ) ◆ AddFasta() ◆ AddIds()Add the specified sequences from the source database.
The list of strings are interpreted as GIs if they're composed only of numeric digits, or as Seq-ids otherwise. The sequence IDs will be resolved, and a sequence corresponding to each ID will be added to the output database. If remote resolution is enabled, it will be used to find up-to-date versions for any ambiguously versioned IDs (i.e. unversioned IDs of versioned Seq-id types). Then local fetching will be used to process IDs using the source database if one was specified. If any sequences have not be found, and remote services are enabled, remote fetching will be used for IDs not resolved locally. If any IDs are not found at all, they will be reported as part of the logging output.
Definition at line 1321 of file build_db.cpp.
References _ASSERT, map_checker< Container >::end(), map_checker< Container >::find(), CSeqDB::GetDBNameList(), CSeqDBGiList::GetGiOid(), CSeqDBGiList::GetNumGis(), CSeqDBGiList::GetNumSis(), CSeqDB::GetSequenceType(), CSeqDBGiList::SGiOid::gi, i, m_LogFile, m_SourceDb, m_UseRemote, m_Verbose, CRef< C, Locker >::NotEmpty(), CSeqDBGiList::SGiOid::oid, x_AddRemoteSequences(), x_DupLocal(), x_ReportUnresolvedIds(), and x_ResolveGis().
Referenced by BOOST_AUTO_TEST_CASE(), and Build().
◆ AddSequences() [1/2]Add sequences from an IBioseqSource object.
The provided `src' object is queried using GetNext() to get a Bioseq object. The Bioseq is added to the output database (with appropriate modifications of taxid, membership bits, and linkout bits, as configured here). This process repeats until the GetNext() method returns NULL.
Definition at line 794 of file build_db.cpp.
References CBioseq_Base::CanGetId(), count, debug_mode, CSeq_id_Base::e_Local, CStopWatch::Elapsed(), CStopWatch::eStart, CSeq_id::fAcc_nuc, CSeq_id::fAcc_prot, CBioseq_Base::GetId(), CBioseq::GetLength(), IBioseqSource::GetNext(), CConstRef< C, Locker >::GetNonNullPointer(), GI_CONST, info, CBioseq::IsAa(), label, m_IsProtein, m_LogFile, m_LongIDs, m_SkipLargeGis, m_Verbose, NCBI_THROW, CConstRef< C, Locker >::NotEmpty(), NULL, CBioseq_Base::SetId(), sw, t, and x_EditAndAddBioseq().
Referenced by AddFasta(), BOOST_AUTO_TEST_CASE(), s_TestReadPDBAsn1(), CMakeBlastDBApp::x_AddSeqEntries(), CMakeClusterDBApp::x_BuildDatabase(), BlastdbCopyApplication::x_CopyDB(), BlastdbCopyApplication::x_MakeDBwIDList(), and CMakeBlastDBApp::x_ProcessInputData().
◆ AddSequences() [2/2]Add sequences from an IRawSequenceSource object.
The provided `src' object is queried using GetNext() to get various "raw format" sequence data and metadata components. These pieces of data are added to the output database (with appropriate modifications of taxid, membership bits, and linkout bits, as configured here). This process repeats until the GetNext() method returns false.
Definition at line 904 of file build_db.cpp.
References _ASSERT, CWriteDB::AddColumnMetaData(), CWriteDB::AddSequence(), CBlastDbBlob::Clear(), count, CWriteDB::CreateUserColumn(), CTempString::data(), done, CStopWatch::Elapsed(), CMaskedRangesVector::empty(), CTempString::empty(), CRef< C, Locker >::Empty(), map_checker< Container >::end(), CStopWatch::eStart, map_checker< Container >::find(), CWriteDB::FindColumn(), CBlast_def_line_set_Base::Get(), IRawSequenceSource::GetColumnId(), IRawSequenceSource::GetColumnMetaData(), IRawSequenceSource::GetColumnNames(), IRawSequenceSource::GetNext(), IMaskDataSource::GetRanges(), i, int, ITERATE, m_FoundMatchingMasks, m_IsProtein, m_LogFile, m_MaskData, m_OutputDb, NCBI_THROW, CWriteDB::SetBlobData(), CWriteDB::SetDeflines(), CWriteDB::SetMaskData(), ncbi::grid::netcache::search::fields::size, CTempString::size(), sw, t, CBlastDbBlob::WriteRaw(), x_AddPig(), and x_EditHeaders().
◆ Build()Build the database.
This method builds a database from the given list of Sequence IDs and the provided file, which should contain FASTA format data. It is equivalent to calling StartBuild(), AddIds(), AddFasta(), and EndBuild() in that order (except that a little additional logging is done with summary information.).
Definition at line 1289 of file build_db.cpp.
References AddFasta(), AddIds(), CStopWatch::Elapsed(), EndBuild(), CStopWatch::eStart, m_DeflineCount, m_LogFile, m_OIDCount, StartBuild(), sw, and t.
Referenced by BOOST_AUTO_TEST_CASE().
◆ CreateDirectories() void CBuildDatabase::CreateDirectories ( const string & dbname ) staticCreate Directory for blast db.
Definition at line 1051 of file build_db.cpp.
References CDirEntry::CheckAccess(), CDir::CreatePath(), dbname(), CDirEntry::eIfEmptyPath_Empty, CDir::Exists(), CDirEntry::fWrite, CDirEntry::GetDir(), CDirEntry::GetName(), msg(), and NCBI_THROW.
Referenced by CBuildDatabase(), CBlastdbConvertApp::Run(), and CMakeProfileDBApp::x_Run().
◆ EndBuild()Finish building a new database.
This method closes the newly constructed database, flushing any unflushed volumes, creating an alias file to tie the volumes together, and so on.
Definition at line 1423 of file build_db.cpp.
References CWriteDB::Close(), eUnknown, m_OutputDb, NCBI_EXCEPTION_VAR, NULL, CException::what(), and x_EndBuild().
Referenced by AddFasta(), BOOST_AUTO_TEST_CASE(), Build(), s_TestReadPDBAsn1(), CMakeBlastDBApp::x_BuildDatabase(), CMakeClusterDBApp::x_BuildDatabase(), BlastdbCopyApplication::x_CopyDB(), and BlastdbCopyApplication::x_MakeDBwIDList().
◆ GetOutputDbName() string CBuildDatabase::GetOutputDbName ( ) const inline ◆ RegisterMaskingAlgorithm() [1/2]Define a masking algorithm.
The returned integer ID will be defined as corresponding to the provided program enumeration (e.g. DUST, SEG, etc) and options string, for subject masking. Each program enumeration (such as DUST) may be used several times with different options strings, however, the combination of program and options should be unique for each algorithm ID. The options string is a free-form string (at least from this class's point of view).
Definition at line 1597 of file build_db.cpp.
References m_OutputDb, and CWriteDB::RegisterMaskAlgorithm().
◆ RegisterMaskingAlgorithm() [2/2]Define a masking algorithm.
The returned integer ID will be defined as corresponding to the provided program enumeration (e.g. DUST, SEG, etc) and options string, for subject masking. Each program enumeration (such as DUST) may be used several times with different options strings, however, the combination of program and options should be unique for each algorithm ID. The options string is a free-form string (at least from this class's point of view).
Definition at line 1584 of file build_db.cpp.
References m_OutputDb, and CWriteDB::RegisterMaskAlgorithm().
Referenced by CClusterDBSource::CClusterDBSource(), CRawSeqDBSource::CRawSeqDBSource(), and CMakeBlastDBApp::x_ProcessMaskData().
◆ SetLeafTaxIds() ◆ SetLinkouts() ◆ SetMaskDataSource()Specify an object mapping Seq-id to subject masking data.
Masking data is provided to CBuildDatabase by implementing an interface that can produce masking data given the Seq-ids for the sequence that is to be masked. This object could wrap a simple lookup table, an algorithm that produces the data on the fly, or a wrapper around an existing database that fetches the masking data from that database.
Definition at line 1609 of file build_db.cpp.
References m_MaskData, and CRef< C, Locker >::Reset().
Referenced by CMakeBlastDBApp::x_ProcessMaskData().
◆ SetMaskLetters() void CBuildDatabase::SetMaskLetters ( const string & mask_letters )Specify letters to mask out of protein sequence data.
Protein sequences sometimes contain rare (or recently defined) letters that cause trouble for some algorithms. This method specifies a list of protein letters that might be found in the input sequences, but which should be replaced by "X" before adding those sequences to the database.
Definition at line 1221 of file build_db.cpp.
References m_OutputDb, and CWriteDB::SetMaskedLetters().
◆ SetMaxFileSize() void CBuildDatabase::SetMaxFileSize ( Uint8 max_file_size ) ◆ SetMembBits() ◆ SetSkipCopyingGis() void CBuildDatabase::SetSkipCopyingGis ( bool v ) inline ◆ SetSourceDb() [1/2] void CBuildDatabase::SetSourceDb ( const string & src_db_name ) ◆ SetSourceDb() [2/2] ◆ SetTaxids() void CBuildDatabase::SetTaxids ( CTaxIdSet & taxids ) ◆ SetUseRemote() void CBuildDatabase::SetUseRemote ( bool use_remote ) inlineSpecify whether to use remote fetching for locally absent IDs.
If identifiers in the list provided to Build or to AddIds is not found in the source database (if any), remote sequence fetching APIs can be used to fetch those sequences. Normally this happens in two cases. First, sequences listed in the list of IDs are sometimes too new to be found in the source database. Secondly, sequences may be found in the source database, but newer versions might be available in the remote database.
If the use_remote flag is set to true, this class finds the latest version number for unversioned IDs (but only of types that can have versions in the first place), and will attempt to remotely fetch any sequences for which the source database does not have the latest version. If the flag is specified as false, no remote lookups will be done, and sequences found in ids but not found in the source database will not be added to the output database.
Note: This does not affect the AddSequences, AddRawSequences, or AddFasta methods; in those cases, all provided sequences are added in the form they are provided in.
The default value for this flag is "true".
Definition at line 385 of file build_db.hpp.
References m_UseRemote.
Referenced by BOOST_AUTO_TEST_CASE(), BlastdbCopyApplication::x_CopyDB(), and BlastdbCopyApplication::x_MakeDBwIDList().
◆ SetVerbosity() void CBuildDatabase::SetVerbosity ( bool v ) inline ◆ StartBuild() void CBuildDatabase::StartBuild ( ) ◆ x_AddMasksForSeqId() ◆ x_AddOneRemoteSequence() void CBuildDatabase::x_AddOneRemoteSequence ( const objects::CSeq_id & seqid, bool & found, bool & error ) private ◆ x_AddPig() void CBuildDatabase::x_AddPig ( CRef< objects::CBlast_def_line_set > headers ) private ◆ x_AddRemoteSequences()Duplicate IDs from local databases.
This method iterates over the list of IDs; any IDs that were not found in the source database are added by fetching the sequence from remote services. (Whether an ID was found locally can be determined by whether the OID found in the GI list is valid.)
Definition at line 555 of file build_db.cpp.
References count, CStopWatch::Elapsed(), CStopWatch::eStart, CSeqDBGiList::GetGiOid(), CSeqDBGiList::GetKey(), CSeqDBGiList::GetNumGis(), CSeqDBGiList::GetNumSis(), CSeqDBGiList::GetSiOid(), i, m_LogFile, m_Verbose, CSeqDBGiList::SGiOid::oid, CSeqDBGiList::SSiOid::oid, sw, t, and x_AddOneRemoteSequence().
Referenced by AddIds().
◆ x_DupLocal() void CBuildDatabase::x_DupLocal ( ) privateDuplicate IDs from local databases.
This method iterates over the list of IDs, copying sequences found in the source databases to the output database.
Definition at line 235 of file build_db.cpp.
References CWriteDB::AddSequence(), ambig(), buffer, CSeqDB::CheckOrFindOID(), count, CStopWatch::Elapsed(), CStopWatch::eStart, CTaxIdSet::FixTaxId(), CBlast_def_line_set_Base::Get(), CSeqDB::GetHdr(), CSeqDBExpert::GetRawSeqAndAmbig(), m_DeflineCount, m_LogFile, m_OIDCount, m_OutputDb, m_SourceDb, m_Taxids, CWriteDB::SetDeflines(), sw, t, and x_SetLinkAndMbit().
Referenced by AddIds().
◆ x_EditAndAddBioseq() bool CBuildDatabase::x_EditAndAddBioseq ( CConstRef< objects::CBioseq > bs, objects::CSeqVector * sv, bool add_pig =false
) private
Modify a Bioseq as needed and add it to the database.
The provided Bioseq is added to the database. Modifications are made to the data as needed (but the input object is not affected). In particular, the taxid is set (0 is used if no taxid is known), and linkout and membership bits are set.
Definition at line 469 of file build_db.cpp.
References CWriteDB::AddSequence(), CWriteDB::ExtractBioseqDeflines(), CBlast_def_line_set_Base::Get(), m_DeflineCount, m_LongIDs, m_OIDCount, m_OutputDb, m_ParseIDs, m_ScanBioseq4CFastaReaderUsrObjct, s_FixBioseqDeltas(), CWriteDB::SetDeflines(), x_AddMasksForSeqId(), x_AddPig(), and x_EditHeaders().
Referenced by AddSequences(), and x_AddOneRemoteSequence().
◆ x_EditHeaders() void CBuildDatabase::x_EditHeaders ( CRef< objects::CBlast_def_line_set > headers ) private ◆ x_EndBuild() ◆ x_GetScope() CScope & CBuildDatabase::x_GetScope ( ) private ◆ x_ReportUnresolvedIds() ◆ x_ResolveFromSource()Determine if this string ID can be found in the source database.
The provided string will be looked up as an accession in the source database. If a corresponding sequence is found, it will be returned in the `id' field. The resolution is only considered a match if the provided string is a substring of the FASTA representation of the provided Seq-id, and if that substring seems to represent whole components (so that it's surrounded by delimeters such as `|' and `.' rather than by alphanumeric characters, which may be part of another ID).
Definition at line 185 of file build_db.cpp.
References CSeqDB::AccessionToOids(), CSeq_id::AsFastaString(), done, CRef< C, Locker >::Empty(), CSeqDB::GetSeqIDs(), ITERATE, and m_SourceDb.
Referenced by x_ResolveGis().
◆ x_ResolveGis()Resolve various input IDs (as strings) to GIs.
The input IDs are examined, the type of each is determined as a GIs or some other kind of Seq-id, and each ID is resolved to a GI where possible. The list of GIs and other Seq-ids found is returned in a GI list.
Definition at line 116 of file build_db.cpp.
References CInputGiList::AppendGi(), CInputGiList::AppendSi(), CheckAccession(), debug_mode, ITERATE, m_LogFile, m_SourceDb, m_UseRemote, CRef< C, Locker >::NotEmpty(), x_ResolveFromSource(), x_ResolveRemoteId(), and ZERO_GI.
Referenced by AddIds().
◆ x_ResolveRemoteId() void CBuildDatabase::x_ResolveRemoteId ( CRef< objects::CSeq_id > & seqid, TGi & gi ) privateResolve an ID remotely.
This method looks up the given ID via remote services in order to find an ID for the most up-to-date version of the sequence. The remote service will return a list of Seq-ids; if at least one of these is a GI, that will be returned in `gi'. If no GI is found, but at least one of the returned IDs is of the same type as the input Seq-id, the version number of the input Seq-id will be updated.
Definition at line 65 of file build_db.cpp.
References debug_mode, CSeq_id::GetTextseq_Id(), CSeq_id_Base::IsGi(), CTextseq_id_Base::IsSetVersion(), ITERATE, m_LogFile, NULL, CRef< C, Locker >::Reset(), CSeq_id_Base::Which(), x_GetScope(), and ZERO_GI.
Referenced by x_ResolveGis().
◆ x_SetLeafTaxids() void CBuildDatabase::x_SetLeafTaxids ( CRef< objects::CBlast_def_line_set > headers ) privateStore leaf taxids in provided headers.
True to keep linkout bits from source dbs, false to discard.
DEPRECATED
Definition at line 601 of file build_db.hpp.
Referenced by SetLinkouts().
◆ m_KeepMbits bool CBuildDatabase::m_KeepMbits private ◆ m_LogFile ostream& CBuildDatabase::m_LogFile privateLogfile.
Definition at line 638 of file build_db.hpp.
Referenced by AddIds(), AddSequences(), Build(), CBuildDatabase(), SetLeafTaxIds(), SetLinkouts(), SetMembBits(), SetSourceDb(), x_AddOneRemoteSequence(), x_AddRemoteSequences(), x_DupLocal(), x_EndBuild(), x_ReportUnresolvedIds(), x_ResolveGis(), and x_ResolveRemoteId().
◆ m_LongIDs bool CBuildDatabase::m_LongIDs private ◆ m_MaskData ◆ m_ObjMgr CRef<objects::CObjectManager> CBuildDatabase::m_ObjMgr private ◆ m_OIDCount int CBuildDatabase::m_OIDCount private ◆ m_OutputDbDatabase being produced here.
Definition at line 629 of file build_db.hpp.
Referenced by AddSequences(), CBuildDatabase(), EndBuild(), RegisterMaskingAlgorithm(), SetMaskLetters(), SetMaxFileSize(), x_AddMasksForSeqId(), x_AddPig(), x_DupLocal(), x_EditAndAddBioseq(), and x_EndBuild().
◆ m_OutputDbName string CBuildDatabase::m_OutputDbName private ◆ m_ParseIDs bool CBuildDatabase::m_ParseIDs private ◆ m_ScanBioseq4CFastaReaderUsrObjct bool CBuildDatabase::m_ScanBioseq4CFastaReaderUsrObjct private ◆ m_Scope CRef<objects::CScope> CBuildDatabase::m_Scope private ◆ m_SkipCopyingGis bool CBuildDatabase::m_SkipCopyingGis private ◆ m_SkipLargeGis bool CBuildDatabase::m_SkipLargeGis private ◆ m_SourceDb ◆ m_Taxids ◆ m_UseRemote bool CBuildDatabase::m_UseRemote private ◆ m_Verbose bool CBuildDatabase::m_Verbose privateThe documentation for this class was generated from the following files:
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4