A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCBuildDatabase.html below:

NCBI C++ ToolKit: CBuildDatabase Class Reference

Search Toolkit Book for CBuildDatabase

Build BlastDB format databases from various data sources. More...

#include <objtools/blast/seqdb_writer/build_db.hpp>

  CBuildDatabase (const string &dbname, const string &title, bool is_protein, CWriteDB::TIndexType indexing, bool use_gi_mask, ostream *logfile, bool long_seqids=false, EBlastDbVersion dbver=eBDB_Version4, bool limit_defline=false, Uint8 oid_masks=EOidMaskType::fNone, bool scan_bioseq_4_cfastareader_usrobj=true)   Constructor. More...
    CBuildDatabase (const string &dbname, const string &title, bool is_protein, bool sparse, bool parse_seqids, bool use_gi_mask, ostream *logfil, bool long_seqids=false, EBlastDbVersion dbver=eBDB_Version4, bool limit_defline=false, Uint8 oid_masks=EOidMaskType::fNone, bool scan_bioseq_4_cfastareader_usrobj=true)   Constructor. More...
    ~CBuildDatabase ()   void  SetTaxids (CTaxIdSet &taxids)   Specify a mapping of sequence ids to taxonomic ids. More...
  void  SetMaskLetters (const string &mask_letters)   Specify letters to mask out of protein sequence data. More...
  void  SetSourceDb (const string &src_db_name)   Specify source database(s) via the database name(s). More...
  void  SetSourceDb (CRef< CSeqDBExpert > src_db)   Specify source database. More...
  void  SetLinkouts (const TLinkoutMap &linkouts, bool keep_links)   Specify a linkout bit lookup object. More...
  void  SetMembBits (const TLinkoutMap &membbits, bool keep_mbits)   Specify a membership bit lookup object. More...
  void  SetLeafTaxIds (const TIdToLeafs &taxids, bool keep_taxids)   Specify a leaf-taxids object. More...
  bool  Build (const vector< string > &ids, CNcbiIstream *fasta_file)   Build the database. More...
  void  StartBuild ()   Start building a new database. More...
  bool  AddIds (const vector< string > &ids)   Add the specified sequences from the source database. More...
  bool  AddFasta (CNcbiIstream &fasta_file)   Add sequences from a file containing FASTA data. More...
  bool  AddSequences (IBioseqSource &src, bool add_pig=false)   Add sequences from an IBioseqSource object. More...
  bool  AddSequences (IRawSequenceSource &src)   Add sequences from an IRawSequenceSource object. More...
  bool  EndBuild (bool erase=false)   Finish building a new database. More...
  void  SetUseRemote (bool use_remote)   Specify whether to use remote fetching for locally absent IDs. More...
  void  SetVerbosity (bool v)   Specify level of output verbosity. More...
  void  SetSkipCopyingGis (bool v)   void  SetMaxFileSize (Uint8 max_file_size)   Set the maximum size of database component files. More...
  int  RegisterMaskingAlgorithm (EBlast_filter_program program, const string &options, const string &name="")   Define a masking algorithm. More...
  int  RegisterMaskingAlgorithm (const string &program, const string &description, const string &options)   Define a masking algorithm. More...
  void  SetMaskDataSource (IMaskDataSource &ranges)   Specify an object mapping Seq-id to subject masking data. More...
  string  GetOutputDbName () const     CObject (void)   Constructor. More...
    CObject (const CObject &src)   Copy constructor. More...
  virtual  ~CObject (void)   Destructor. More...
  CObjectoperator= (const CObject &src) THROWS_NONE   Assignment operator. More...
  bool  CanBeDeleted (void) const THROWS_NONE   Check if object can be deleted. More...
  bool  IsAllocatedInPool (void) const THROWS_NONE   Check if object is allocated in memory pool (not system heap) More...
  bool  Referenced (void) const THROWS_NONE   Check if object is referenced. More...
  bool  ReferencedOnlyOnce (void) const THROWS_NONE   Check if object is referenced only once. More...
  void  AddReference (void) const   Add reference to object. More...
  void  RemoveReference (void) const   Remove reference to object. More...
  void  ReleaseReference (void) const   Remove reference without deleting object. More...
  virtual void  DoNotDeleteThisObject (void)   Mark this object as not allocated in heap – do not delete this object. More...
  virtual void  DoDeleteThisObject (void)   Mark this object as allocated in heap – object can be deleted. More...
  void *  operator new (size_t size)   Define new operator for memory allocation. More...
  void *  operator new[] (size_t size)   Define new[] operator for 'array' memory allocation. More...
  void  operator delete (void *ptr)   Define delete operator for memory deallocation. More...
  void  operator delete[] (void *ptr)   Define delete[] operator for memory deallocation. More...
  void *  operator new (size_t size, void *place)   Define new operator. More...
  void  operator delete (void *ptr, void *place)   Define delete operator. More...
  void *  operator new (size_t size, CObjectMemoryPool *place)   Define new operator using memory pool. More...
  void  operator delete (void *ptr, CObjectMemoryPool *place)   Define delete operator. More...
  virtual void  DebugDump (CDebugDumpContext ddc, unsigned int depth) const   Define method for dumping debug information. More...
    CDebugDumpable (void)   virtual  ~CDebugDumpable (void)   void  DebugDumpText (ostream &out, const string &bundle, unsigned int depth) const   void  DebugDumpFormat (CDebugDumpFormatter &ddf, const string &bundle, unsigned int depth) const   void  DumpToConsole (void) const   bool  m_IsProtein   True for a protein database, false for nucleotide. More...
  bool  m_KeepLinks   True to keep linkout bits from source dbs, false to discard. More...
  TIdToBits  m_Id2Links   Table of linkout bits to apply to sequences. More...
  bool  m_KeepMbits   True to keep membership bits from source dbs, false to discard. More...
  TIdToBits  m_Id2Mbits   Table of membership bits to apply to sequences. More...
  bool  m_KeepLeafs   True to keep leaf taxids from source dbs, false to discard. More...
  TIdToLeafs  m_Id2Leafs   Table of leaf taxids to apply to sequences. More...
  CRef< objects::CObjectManager >  m_ObjMgr   Object manager, used for remote fetching. More...
  CRef< objects::CScope >  m_Scope   Sequence scope, used for remote fetching. More...
  CRef< CTaxIdSetm_Taxids   Set of TaxIDs configured to apply to sequences. More...
  CRef< CWriteDBm_OutputDb   Database being produced here. More...
  CRef< CSeqDBExpertm_SourceDb   Database for duplicating sequences locally (-sourcedb option.) More...
  CRef< IMaskDataSourcem_MaskData   Subject masking data. More...
  ostream &  m_LogFile   Logfile. More...
  bool  m_UseRemote   Whether to use remote resolution and sequence fetching. More...
  int  m_DeflineCount   Define count. More...
  int  m_OIDCount   Number of OIDs stored in this database. More...
  bool  m_Verbose   If true, more detailed log messages will be produced. More...
  bool  m_ParseIDs   If true, string IDs found in FASTA input will be parsed as Seq-ids. More...
  bool  m_LongIDs   If true, use long sequence ids (database|accession) More...
  bool  m_FoundMatchingMasks   If true, there were sequences whose IDs matched those in the provided masking locations (via SetMaskDataSource). More...
  bool  m_SkipCopyingGis   If set to true, when copying BLASTDBs, skip the GIs. More...
  bool  m_SkipLargeGis   If set to true, skip GIs with value > 0x7FFFFFFF. More...
  string  m_OutputDbName   bool  m_ScanBioseq4CFastaReaderUsrObjct  

Build BlastDB format databases from various data sources.

This class provides an API for building BlastDB format databases. The WriteDB library is used internally to produce the actual database; the functionality provided by this class helps to bridge the gap between the WriteDB API and the needs of a command line database construction tool.

Definition at line 136 of file build_db.hpp.

◆ CBuildDatabase() [1/2] CBuildDatabase::CBuildDatabase ( const stringdbname, const stringtitle, bool  is_protein, CWriteDB::TIndexType  indexing, bool  use_gi_mask, ostream *  logfile, bool  long_seqids = false, EBlastDbVersion  dbver = eBDB_Version4, bool  limit_defline = false, Uint8  oid_masks = EOidMaskType::fNone, bool  scan_bioseq_4_cfastareader_usrobj = true  )

Constructor.

Create a database with the specified name, type, and other characteristics. The database will use the specified dbname as the base name for database volumes. Note that the indexing argument will be combined with either eSparseIndex or eDefault, depending on the "sparse" flag.

Parameters
dbname Name of the database to create. [in] title Title to use for newly created database. [in] is_protein Use true for protein, false for nucleotide. [in] sparse Specify true to use sparse Seq-id indexing. [in] Logging will be done to this stream. [in] use_gi_mask if true will generate GI-based mask files [in] logfile file to write the log to [in] long_seqids if true, requires long sequence ids (database|accession) when parsing fasta sequences [in] dbver version of BLAST database to generate [in] scan_bioseq_4_cfastareader_usrobj [in] If true, scan the Bioseq objects for a CFastaReader-created User-object containing a defline

Definition at line 1073 of file build_db.cpp.

References CTime::AsString(), CDirEntry::CreateAbsolutePath(), CreateDirectories(), dbname(), DeleteBlastDb(), CTime::eCurrent, CWriteDB::eNucleotide, CWriteDB::eProtein, m_LogFile, m_LongIDs, m_OutputDb, m_OutputDbName, m_ParseIDs, ParseMoleculeTypeString(), CRef< C, Locker >::Reset(), and CWriteDB::SetMaxFileSize().

◆ CBuildDatabase() [2/2] CBuildDatabase::CBuildDatabase ( const stringdbname, const stringtitle, bool  is_protein, bool  sparse, bool  parse_seqids, bool  use_gi_mask, ostream *  logfil, bool  long_seqids = false, EBlastDbVersion  dbver = eBDB_Version4, bool  limit_defline = false, Uint8  oid_masks = EOidMaskType::fNone, bool  scan_bioseq_4_cfastareader_usrobj = true  )

Constructor.

Create a database with the specified name, type, and other characteristics. The database will use the specified dbname as the base name for database volumes. Note that the indexing argument will be combined with either eSparseIndex or eDefault, depending on the "sparse" flag.

Parameters
dbname Name of the database to create. [in] title Title to use for newly created database. [in] is_protein Use true for protein, false for nucleotide. [in] sparse Specify true to use sparse Seq-id indexing. [in] parse_seqids specify true to parse the sequence IDs [in] use_gi_mask if true will generate GI-based mask files [in] indexing index fields to add to database. [in] long_seqids if true, requires long sequence ids (database|accession) when parsing fasta sequences [in] scan_bioseq_4_cfastareader_usrobj [in] If true, scan the Bioseq objects for a CFastaReader-created User-object containing a defline

Definition at line 1136 of file build_db.cpp.

References CTime::AsString(), CDirEntry::CreateAbsolutePath(), CreateDirectories(), dbname(), DeleteBlastDb(), CTime::eCurrent, CWriteDB::eDefault, CWriteDB::eNucleotide, CWriteDB::eProtein, CWriteDB::eSparseIndex, m_LogFile, m_OutputDb, m_OutputDbName, m_ParseIDs, ParseMoleculeTypeString(), CRef< C, Locker >::Reset(), and CWriteDB::SetMaxFileSize().

◆ ~CBuildDatabase() CBuildDatabase::~CBuildDatabase ( ) ◆ AddFasta() ◆ AddIds()

Add the specified sequences from the source database.

The list of strings are interpreted as GIs if they're composed only of numeric digits, or as Seq-ids otherwise. The sequence IDs will be resolved, and a sequence corresponding to each ID will be added to the output database. If remote resolution is enabled, it will be used to find up-to-date versions for any ambiguously versioned IDs (i.e. unversioned IDs of versioned Seq-id types). Then local fetching will be used to process IDs using the source database if one was specified. If any sequences have not be found, and remote services are enabled, remote fetching will be used for IDs not resolved locally. If any IDs are not found at all, they will be reported as part of the logging output.

Parameters
ids List of sequence IDs as strings.
Returns
true if all sequences were found locally or remotely.

Definition at line 1321 of file build_db.cpp.

References _ASSERT, map_checker< Container >::end(), map_checker< Container >::find(), CSeqDB::GetDBNameList(), CSeqDBGiList::GetGiOid(), CSeqDBGiList::GetNumGis(), CSeqDBGiList::GetNumSis(), CSeqDB::GetSequenceType(), CSeqDBGiList::SGiOid::gi, i, m_LogFile, m_SourceDb, m_UseRemote, m_Verbose, CRef< C, Locker >::NotEmpty(), CSeqDBGiList::SGiOid::oid, x_AddRemoteSequences(), x_DupLocal(), x_ReportUnresolvedIds(), and x_ResolveGis().

Referenced by BOOST_AUTO_TEST_CASE(), and Build().

◆ AddSequences() [1/2]

Add sequences from an IBioseqSource object.

The provided `src' object is queried using GetNext() to get a Bioseq object. The Bioseq is added to the output database (with appropriate modifications of taxid, membership bits, and linkout bits, as configured here). This process repeats until the GetNext() method returns NULL.

Parameters
src An object providing one or more Bioseq objects. add_pig true if PIG should be added if available
Returns
True if at least one sequence was added.

Definition at line 794 of file build_db.cpp.

References CBioseq_Base::CanGetId(), count, debug_mode, CSeq_id_Base::e_Local, CStopWatch::Elapsed(), CStopWatch::eStart, CSeq_id::fAcc_nuc, CSeq_id::fAcc_prot, CBioseq_Base::GetId(), CBioseq::GetLength(), IBioseqSource::GetNext(), CConstRef< C, Locker >::GetNonNullPointer(), GI_CONST, info, CBioseq::IsAa(), label, m_IsProtein, m_LogFile, m_LongIDs, m_SkipLargeGis, m_Verbose, NCBI_THROW, CConstRef< C, Locker >::NotEmpty(), NULL, CBioseq_Base::SetId(), sw, t, and x_EditAndAddBioseq().

Referenced by AddFasta(), BOOST_AUTO_TEST_CASE(), s_TestReadPDBAsn1(), CMakeBlastDBApp::x_AddSeqEntries(), CMakeClusterDBApp::x_BuildDatabase(), BlastdbCopyApplication::x_CopyDB(), BlastdbCopyApplication::x_MakeDBwIDList(), and CMakeBlastDBApp::x_ProcessInputData().

◆ AddSequences() [2/2]

Add sequences from an IRawSequenceSource object.

The provided `src' object is queried using GetNext() to get various "raw format" sequence data and metadata components. These pieces of data are added to the output database (with appropriate modifications of taxid, membership bits, and linkout bits, as configured here). This process repeats until the GetNext() method returns false.

Parameters
src An object providing one or more "raw" sequences.
Returns
True if at least one sequence was added.

Definition at line 904 of file build_db.cpp.

References _ASSERT, CWriteDB::AddColumnMetaData(), CWriteDB::AddSequence(), CBlastDbBlob::Clear(), count, CWriteDB::CreateUserColumn(), CTempString::data(), done, CStopWatch::Elapsed(), CMaskedRangesVector::empty(), CTempString::empty(), CRef< C, Locker >::Empty(), map_checker< Container >::end(), CStopWatch::eStart, map_checker< Container >::find(), CWriteDB::FindColumn(), CBlast_def_line_set_Base::Get(), IRawSequenceSource::GetColumnId(), IRawSequenceSource::GetColumnMetaData(), IRawSequenceSource::GetColumnNames(), IRawSequenceSource::GetNext(), IMaskDataSource::GetRanges(), i, int, ITERATE, m_FoundMatchingMasks, m_IsProtein, m_LogFile, m_MaskData, m_OutputDb, NCBI_THROW, CWriteDB::SetBlobData(), CWriteDB::SetDeflines(), CWriteDB::SetMaskData(), ncbi::grid::netcache::search::fields::size, CTempString::size(), sw, t, CBlastDbBlob::WriteRaw(), x_AddPig(), and x_EditHeaders().

◆ Build()

Build the database.

This method builds a database from the given list of Sequence IDs and the provided file, which should contain FASTA format data. It is equivalent to calling StartBuild(), AddIds(), AddFasta(), and EndBuild() in that order (except that a little additional logging is done with summary information.).

Parameters
ids List of identifiers to add to the database. fasta_file FASTA format data for

Definition at line 1289 of file build_db.cpp.

References AddFasta(), AddIds(), CStopWatch::Elapsed(), EndBuild(), CStopWatch::eStart, m_DeflineCount, m_LogFile, m_OIDCount, StartBuild(), sw, and t.

Referenced by BOOST_AUTO_TEST_CASE().

◆ CreateDirectories() void CBuildDatabase::CreateDirectories ( const stringdbname ) static

Create Directory for blast db.

Parameters
dbname output blast db name (with path)

Definition at line 1051 of file build_db.cpp.

References CDirEntry::CheckAccess(), CDir::CreatePath(), dbname(), CDirEntry::eIfEmptyPath_Empty, CDir::Exists(), CDirEntry::fWrite, CDirEntry::GetDir(), CDirEntry::GetName(), msg(), and NCBI_THROW.

Referenced by CBuildDatabase(), CBlastdbConvertApp::Run(), and CMakeProfileDBApp::x_Run().

◆ EndBuild()

Finish building a new database.

This method closes the newly constructed database, flushing any unflushed volumes, creating an alias file to tie the volumes together, and so on.

Parameters
erase Will erase all files created if true.

Definition at line 1423 of file build_db.cpp.

References CWriteDB::Close(), eUnknown, m_OutputDb, NCBI_EXCEPTION_VAR, NULL, CException::what(), and x_EndBuild().

Referenced by AddFasta(), BOOST_AUTO_TEST_CASE(), Build(), s_TestReadPDBAsn1(), CMakeBlastDBApp::x_BuildDatabase(), CMakeClusterDBApp::x_BuildDatabase(), BlastdbCopyApplication::x_CopyDB(), and BlastdbCopyApplication::x_MakeDBwIDList().

◆ GetOutputDbName() string CBuildDatabase::GetOutputDbName ( ) const inline ◆ RegisterMaskingAlgorithm() [1/2]

Define a masking algorithm.

The returned integer ID will be defined as corresponding to the provided program enumeration (e.g. DUST, SEG, etc) and options string, for subject masking. Each program enumeration (such as DUST) may be used several times with different options strings, however, the combination of program and options should be unique for each algorithm ID. The options string is a free-form string (at least from this class's point of view).

Parameters
program A string to identify the filtering algorithm [in] description A free-form string describing the data [in] options A free-form string describing the options used [in]

Definition at line 1597 of file build_db.cpp.

References m_OutputDb, and CWriteDB::RegisterMaskAlgorithm().

◆ RegisterMaskingAlgorithm() [2/2]

Define a masking algorithm.

The returned integer ID will be defined as corresponding to the provided program enumeration (e.g. DUST, SEG, etc) and options string, for subject masking. Each program enumeration (such as DUST) may be used several times with different options strings, however, the combination of program and options should be unique for each algorithm ID. The options string is a free-form string (at least from this class's point of view).

Parameters
program One of the predefined masking types (dust etc). [in] options A free-form string describing this type of data. The empty string should be used to indicate default parameters. [in] name Name of the GI-base mask file [in]

Definition at line 1584 of file build_db.cpp.

References m_OutputDb, and CWriteDB::RegisterMaskAlgorithm().

Referenced by CClusterDBSource::CClusterDBSource(), CRawSeqDBSource::CRawSeqDBSource(), and CMakeBlastDBApp::x_ProcessMaskData().

◆ SetLeafTaxIds() ◆ SetLinkouts() ◆ SetMaskDataSource()

Specify an object mapping Seq-id to subject masking data.

Masking data is provided to CBuildDatabase by implementing an interface that can produce masking data given the Seq-ids for the sequence that is to be masked. This object could wrap a simple lookup table, an algorithm that produces the data on the fly, or a wrapper around an existing database that fetches the masking data from that database.

Parameters
ranges An object mapping Seq-ids to their masking data.

Definition at line 1609 of file build_db.cpp.

References m_MaskData, and CRef< C, Locker >::Reset().

Referenced by CMakeBlastDBApp::x_ProcessMaskData().

◆ SetMaskLetters() void CBuildDatabase::SetMaskLetters ( const stringmask_letters )

Specify letters to mask out of protein sequence data.

Protein sequences sometimes contain rare (or recently defined) letters that cause trouble for some algorithms. This method specifies a list of protein letters that might be found in the input sequences, but which should be replaced by "X" before adding those sequences to the database.

Parameters
taxids An object providing defline-to-TaxID lookups. [in]

Definition at line 1221 of file build_db.cpp.

References m_OutputDb, and CWriteDB::SetMaskedLetters().

◆ SetMaxFileSize() void CBuildDatabase::SetMaxFileSize ( Uint8  max_file_size ) ◆ SetMembBits() ◆ SetSkipCopyingGis() void CBuildDatabase::SetSkipCopyingGis ( bool  v ) inline ◆ SetSourceDb() [1/2] void CBuildDatabase::SetSourceDb ( const stringsrc_db_name ) ◆ SetSourceDb() [2/2] ◆ SetTaxids() void CBuildDatabase::SetTaxids ( CTaxIdSettaxids ) ◆ SetUseRemote() void CBuildDatabase::SetUseRemote ( bool  use_remote ) inline

Specify whether to use remote fetching for locally absent IDs.

If identifiers in the list provided to Build or to AddIds is not found in the source database (if any), remote sequence fetching APIs can be used to fetch those sequences. Normally this happens in two cases. First, sequences listed in the list of IDs are sometimes too new to be found in the source database. Secondly, sequences may be found in the source database, but newer versions might be available in the remote database.

If the use_remote flag is set to true, this class finds the latest version number for unversioned IDs (but only of types that can have versions in the first place), and will attempt to remotely fetch any sequences for which the source database does not have the latest version. If the flag is specified as false, no remote lookups will be done, and sequences found in ids but not found in the source database will not be added to the output database.

Note: This does not affect the AddSequences, AddRawSequences, or AddFasta methods; in those cases, all provided sequences are added in the form they are provided in.

The default value for this flag is "true".

Parameters
use_remote Specify true for remote checking & fetching.

Definition at line 385 of file build_db.hpp.

References m_UseRemote.

Referenced by BOOST_AUTO_TEST_CASE(), BlastdbCopyApplication::x_CopyDB(), and BlastdbCopyApplication::x_MakeDBwIDList().

◆ SetVerbosity() void CBuildDatabase::SetVerbosity ( bool  v ) inline ◆ StartBuild() void CBuildDatabase::StartBuild ( ) ◆ x_AddMasksForSeqId() ◆ x_AddOneRemoteSequence() void CBuildDatabase::x_AddOneRemoteSequence ( const objects::CSeq_id &  seqid, boolfound, boolerror  ) private ◆ x_AddPig() void CBuildDatabase::x_AddPig ( CRef< objects::CBlast_def_line_set >  headers ) private ◆ x_AddRemoteSequences()

Duplicate IDs from local databases.

This method iterates over the list of IDs; any IDs that were not found in the source database are added by fetching the sequence from remote services. (Whether an ID was found locally can be determined by whether the OID found in the GI list is valid.)

Parameters
gi_list A list of GIs and Seq-ids.
Returns
True if all IDs could be added.

Definition at line 555 of file build_db.cpp.

References count, CStopWatch::Elapsed(), CStopWatch::eStart, CSeqDBGiList::GetGiOid(), CSeqDBGiList::GetKey(), CSeqDBGiList::GetNumGis(), CSeqDBGiList::GetNumSis(), CSeqDBGiList::GetSiOid(), i, m_LogFile, m_Verbose, CSeqDBGiList::SGiOid::oid, CSeqDBGiList::SSiOid::oid, sw, t, and x_AddOneRemoteSequence().

Referenced by AddIds().

◆ x_DupLocal() void CBuildDatabase::x_DupLocal ( ) private

Duplicate IDs from local databases.

This method iterates over the list of IDs, copying sequences found in the source databases to the output database.

Definition at line 235 of file build_db.cpp.

References CWriteDB::AddSequence(), ambig(), buffer, CSeqDB::CheckOrFindOID(), count, CStopWatch::Elapsed(), CStopWatch::eStart, CTaxIdSet::FixTaxId(), CBlast_def_line_set_Base::Get(), CSeqDB::GetHdr(), CSeqDBExpert::GetRawSeqAndAmbig(), m_DeflineCount, m_LogFile, m_OIDCount, m_OutputDb, m_SourceDb, m_Taxids, CWriteDB::SetDeflines(), sw, t, and x_SetLinkAndMbit().

Referenced by AddIds().

◆ x_EditAndAddBioseq() bool CBuildDatabase::x_EditAndAddBioseq ( CConstRef< objects::CBioseq >  bs, objects::CSeqVector *  sv, bool  add_pig = false  ) private

Modify a Bioseq as needed and add it to the database.

The provided Bioseq is added to the database. Modifications are made to the data as needed (but the input object is not affected). In particular, the taxid is set (0 is used if no taxid is known), and linkout and membership bits are set.

Parameters
bs Bioseq to add to the database. bs Sequence data to add to the database. add_pig true if PIG should be added if available
Returns
ture if bioseq has been added, otherwise false

Definition at line 469 of file build_db.cpp.

References CWriteDB::AddSequence(), CWriteDB::ExtractBioseqDeflines(), CBlast_def_line_set_Base::Get(), m_DeflineCount, m_LongIDs, m_OIDCount, m_OutputDb, m_ParseIDs, m_ScanBioseq4CFastaReaderUsrObjct, s_FixBioseqDeltas(), CWriteDB::SetDeflines(), x_AddMasksForSeqId(), x_AddPig(), and x_EditHeaders().

Referenced by AddSequences(), and x_AddOneRemoteSequence().

◆ x_EditHeaders() void CBuildDatabase::x_EditHeaders ( CRef< objects::CBlast_def_line_set >  headers ) private ◆ x_EndBuild() ◆ x_GetScope() CScope & CBuildDatabase::x_GetScope ( ) private ◆ x_ReportUnresolvedIds() ◆ x_ResolveFromSource()

Determine if this string ID can be found in the source database.

The provided string will be looked up as an accession in the source database. If a corresponding sequence is found, it will be returned in the `id' field. The resolution is only considered a match if the provided string is a substring of the FASTA representation of the provided Seq-id, and if that substring seems to represent whole components (so that it's surrounded by delimeters such as `|' and `.' rather than by alphanumeric characters, which may be part of another ID).

Parameters
acc The accession or ID to look up. [in] id The returned Seq-id if one is found. [out]
Returns
true if the resolution was successful.

Definition at line 185 of file build_db.cpp.

References CSeqDB::AccessionToOids(), CSeq_id::AsFastaString(), done, CRef< C, Locker >::Empty(), CSeqDB::GetSeqIDs(), ITERATE, and m_SourceDb.

Referenced by x_ResolveGis().

◆ x_ResolveGis()

Resolve various input IDs (as strings) to GIs.

The input IDs are examined, the type of each is determined as a GIs or some other kind of Seq-id, and each ID is resolved to a GI where possible. The list of GIs and other Seq-ids found is returned in a GI list.

Parameters
ids List of strings representing IDs to resolve.
Returns
GI list produced from the input ids.

Definition at line 116 of file build_db.cpp.

References CInputGiList::AppendGi(), CInputGiList::AppendSi(), CheckAccession(), debug_mode, ITERATE, m_LogFile, m_SourceDb, m_UseRemote, CRef< C, Locker >::NotEmpty(), x_ResolveFromSource(), x_ResolveRemoteId(), and ZERO_GI.

Referenced by AddIds().

◆ x_ResolveRemoteId() void CBuildDatabase::x_ResolveRemoteId ( CRef< objects::CSeq_id > &  seqid, TGigi  ) private

Resolve an ID remotely.

This method looks up the given ID via remote services in order to find an ID for the most up-to-date version of the sequence. The remote service will return a list of Seq-ids; if at least one of these is a GI, that will be returned in `gi'. If no GI is found, but at least one of the returned IDs is of the same type as the input Seq-id, the version number of the input Seq-id will be updated.

Parameters
seqid Sequence identifier to look up remotely. [in|out] gi Genomic ID if one is found, otherwise 0. [out]

Definition at line 65 of file build_db.cpp.

References debug_mode, CSeq_id::GetTextseq_Id(), CSeq_id_Base::IsGi(), CTextseq_id_Base::IsSetVersion(), ITERATE, m_LogFile, NULL, CRef< C, Locker >::Reset(), CSeq_id_Base::Which(), x_GetScope(), and ZERO_GI.

Referenced by x_ResolveGis().

◆ x_SetLeafTaxids() void CBuildDatabase::x_SetLeafTaxids ( CRef< objects::CBlast_def_line_set >  headers ) private

Store leaf taxids in provided headers.

Parameters
headers These deflines will be modified. [in|out]
◆ x_SetLinkAndMbit() void CBuildDatabase::x_SetLinkAndMbit ( CRef< objects::CBlast_def_line_set >  headers ) private ◆ m_DeflineCount int CBuildDatabase::m_DeflineCount private ◆ m_FoundMatchingMasks bool CBuildDatabase::m_FoundMatchingMasks private ◆ m_Id2Leafs ◆ m_Id2Links ◆ m_Id2Mbits ◆ m_IsProtein bool CBuildDatabase::m_IsProtein private ◆ m_KeepLeafs bool CBuildDatabase::m_KeepLeafs private ◆ m_KeepLinks bool CBuildDatabase::m_KeepLinks private

True to keep linkout bits from source dbs, false to discard.

DEPRECATED

Definition at line 601 of file build_db.hpp.

Referenced by SetLinkouts().

◆ m_KeepMbits bool CBuildDatabase::m_KeepMbits private ◆ m_LogFile ostream& CBuildDatabase::m_LogFile private

Logfile.

Definition at line 638 of file build_db.hpp.

Referenced by AddIds(), AddSequences(), Build(), CBuildDatabase(), SetLeafTaxIds(), SetLinkouts(), SetMembBits(), SetSourceDb(), x_AddOneRemoteSequence(), x_AddRemoteSequences(), x_DupLocal(), x_EndBuild(), x_ReportUnresolvedIds(), x_ResolveGis(), and x_ResolveRemoteId().

◆ m_LongIDs bool CBuildDatabase::m_LongIDs private ◆ m_MaskData ◆ m_ObjMgr CRef<objects::CObjectManager> CBuildDatabase::m_ObjMgr private ◆ m_OIDCount int CBuildDatabase::m_OIDCount private ◆ m_OutputDb

Database being produced here.

Definition at line 629 of file build_db.hpp.

Referenced by AddSequences(), CBuildDatabase(), EndBuild(), RegisterMaskingAlgorithm(), SetMaskLetters(), SetMaxFileSize(), x_AddMasksForSeqId(), x_AddPig(), x_DupLocal(), x_EditAndAddBioseq(), and x_EndBuild().

◆ m_OutputDbName string CBuildDatabase::m_OutputDbName private ◆ m_ParseIDs bool CBuildDatabase::m_ParseIDs private ◆ m_ScanBioseq4CFastaReaderUsrObjct bool CBuildDatabase::m_ScanBioseq4CFastaReaderUsrObjct private ◆ m_Scope CRef<objects::CScope> CBuildDatabase::m_Scope private ◆ m_SkipCopyingGis bool CBuildDatabase::m_SkipCopyingGis private ◆ m_SkipLargeGis bool CBuildDatabase::m_SkipLargeGis private ◆ m_SourceDb ◆ m_Taxids ◆ m_UseRemote bool CBuildDatabase::m_UseRemote private ◆ m_Verbose bool CBuildDatabase::m_Verbose private

The documentation for this class was generated from the following files:


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4