A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025693 below:

PubMed and beyond: a survey of web tools for searching biomedical literature

Abstract

The past decade has witnessed the modern advances of high-throughput technology and rapid growth of research capacity in producing large-scale biological data, both of which were concomitant with an exponential growth of biomedical literature. This wealth of scholarly knowledge is of significant importance for researchers in making scientific discoveries and healthcare professionals in managing health-related matters. However, the acquisition of such information is becoming increasingly difficult due to its large volume and rapid growth. In response, the National Center for Biotechnology Information (NCBI) is continuously making changes to its PubMed Web service for improvement. Meanwhile, different entities have devoted themselves to developing Web tools for helping users quickly and efficiently search and retrieve relevant publications. These practices, together with maturity in the field of text mining, have led to an increase in the number and quality of various Web tools that provide comparable literature search service to PubMed. In this study, we review 28 such tools, highlight their respective innovations, compare them to the PubMed system and one another, and discuss directions for future development. Furthermore, we have built a website dedicated to tracking existing systems and future advances in the field of biomedical literature search. Taken together, our work serves information seekers in choosing tools for their needs and service providers and developers in keeping current in the field.

Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/search

Introduction and background

Literature search refers to the process in which people use tools to search for literature relevant to their individual needs. In the context of this review, tools are Web-based online systems; literature is limited to the biomedical domain; and typical user information needs include, but are not limited to, finding the bibliographic information about a specific article, or searching for publications pertinent to a specific topic (e.g. a disease). With the ease of Internet access, the amount of biomedical literature in electronic format is on the rise. As a matter of fact, as pointed out in previous work and shown in Figure 1, the size of the bibliome has grown exponentially over the past few years (1). As of 2010, there are over 20-million citations indexed through PubMed, a free Web literature search service developed and maintained by the National Center for Biotechnology Information (NCBI). PubMed is as part of NCBI’s Entrez retrieval system that provides access to a diverse set of 38 databases (2). PubMed currently includes citations and abstracts from over 5000 life science journals for biomedical articles back to 1948. Since its inception, PubMed has served as the primary tool for electronically searching and retrieving biomedical literature. Millions of queries are issued each day by users around the globe (3), who rely on such access to keep abreast of the state of the art and make discoveries in their own fields.

Figure 1.

Growth of PubMed citations from 1986 to 2010. Over the past 20 years, the total number of citations in PubMed has increased at a ∼4% growth rate. There are currently over 20-million citations in PubMed. 2010 is partial data (through December 1).

Although PubMed provides a broad, up-to-date and efficient search interface, it has become more and more challenging for its users to quickly identify information relevant to their individual needs, owing mainly to the ever-growing biomedical literature. As a result, users are often overwhelmed by the long list of search results: over one-third of PubMed queries result in 100 or more citations (3). In response to such a problem of information overload, the NCBI has made efforts (see detailed discussion in ‘Changes to PubMed and looking into the future’ section) in enhancing standard PubMed searches by suggesting more specific queries (4). At the same time, the free availability of MEDLNE data and Entrez Programming Utilities (2) make it possible for external entities—from either academia or industry—to create alternative Web tools that are complementary to PubMed.

We present herein a list of 28 such systems, group them by their unique features, compare their differences (with PubMed and one another), and highlight their individual innovations. First and foremost, we aim to provide general readers an overview of PubMed and its recent development, as well as short summaries for other comparable systems that are freely accessible from the Internet. The second objective is to provide researchers, developers and service providers a summary of innovative aspects in recently developed systems, as well as a comparison of different systems. Finally, we have developed a website that is dedicated to online biomedical literature search systems. In addition to the systems discussed in this article, we will keep it updated with new systems so that readers can always be informed of the most current advances in the field.

We believe this work represents the most comprehensive review of systems for seeking information in biomedical literature to date. Unlike many other review articles on text-mining systems (5–11), we limited our focus exclusively to systems that are: (i) for biomedical literature search and (ii) comparable to the PubMed system. The most comparable work is an earlier survey of 18 tools in 2008 (12). However, our review is significantly different in several major aspects. First, the majority of the systems (19/28) in our review were not previously discussed due to different selection criteria or emergence since 2008. Second, we use different classification criteria for categorizing and comparing systems so readers can find discussion from different perspectives. Third, we provide a more detailed overview of each system and its unique features. In particular, we describe PubMed and its recent development in greater detail based on our own experience. Lastly, we have built a website with links to existing systems and mechanisms for registering future systems. All together, our work complements the previous survey, and more importantly it provides one-stop shopping for biomedical literature search systems.

PubMed: the primary tool for searching biomedical literature Contents and intended audience

PubMed’s intended users include researchers, healthcare professionals and the general public, who either have a need for some specific articles (e.g. search with an article title) or more generally, they search for the most relevant articles pertaining to their individual interests (e.g. information about a disease). A general workflow of how users interact with PubMed is displayed in Figure 2: a user queries PubMed or other similar systems for a particular biomedical information need. Offered a set of retrieved documents, the user can browse the result set and subsequently click to view abstracts or full-text articles, issue a new query, or abandon the current search.

Figure 2.

Overview of general user interactions with PubMed (or similar systems) for searching biomedical literature. Adapted from Islamaj Dogan et al., (3).

From a search perspective, PubMed takes as input natural language, free-text keywords and returns a list of citations that match input keywords (PubMed ignores stopwords). Its search strategy has two major characteristics: first, by default it adds Boolean operators into user queries and uses automatic term mapping (ATM). Specifically, the Boolean operator ‘And’ is inserted between multi-term user queries to require retrieved documents to contain all the user keywords. For example, if a user issued the query ‘pubmed search’, the Boolean operator ‘AND’ would be automatically inserted between the two words as ‘pubmed AND search’.

In addition, PubMed automatically compares and maps keywords from a user query to lists of pre-indexed terms (e.g. Medical Subject Headings MeSH®) through its ATM process (http://www.nlm.nih.gov/pubs/techbull/mj08/mj08_pubmed_atm_cite_sensor.html; 13). That is, if a user query can be mapped to one or more MeSH concepts, PubMed will automatically add its MeSH term(s) to the original query. As a result, in addition to retrieving documents containing the query terms, PubMed also retrieves documents indexed with those MeSH terms. Take the earlier example ‘pubmed search’ for illustration, because the word ‘pubmed’ can be mapped to MeSH so the final executed search is [‘pubmed’ (MeSH terms) or ‘pubmed’ (all fields)] and ‘search’ (all fields)’ where the PubMed search tags (all fields) and (MeSH terms) indicate the preceding word will be searched in all indexed fields or only the MeSH indexing field, respectively.

The second major uniqueness of PubMed is its choice for ranking and displaying search results in reverse chronological order. More specifically, PubMed returns matched citations in the time sequence of when they were first entered in PubMed by default. This date is formally termed as the Entrez Date (EDAT) in PubMed.

Other tools comparable to PubMed Standards for selecting comparable systems

In this work, we selected systems for review based on the following three criteria. First, they should be Web-based and operate on equivalent or similar content as PubMed. Systems that are designed to search beyond abstract, such as full text (e.g. PubMed Central; Google Scholar) or figure/tables [e.g. BioText (14); Yale image finder (15)] are thus not included for consideration in this work. Moreover, we focus on tools developed specifically for the biomedical domain. Hence, some general Web-based services such as Google Scholar are excluded in the discussion. Second, a system should be capable of searching an arbitrary topic in the biomedical literature as opposed to some limited areas. Although most citations in PubMed are of biologically relevant subjects (e.g. gene or disease), the topics in the entire biomedical literature are of a much broader coverage. For example, it includes a number of interdisciplinary subjects such as bioinformatics. In other words, the proposed system needs to be developed generally enough so that different kinds of topics can be searched. Third, the online Web system should require no installation or subscription fee (i.e. freely accessible), which would allow the users to readily experience the service. By these three standards, a total of 28 qualified systems were found and they are listed in Tables 1 and 2 below. Moreover, we classified them into four categories depending on the best match between their most notable features and the category theme. Note that some systems may have features belonging to multiple groups and that within each group, we list systems in reverse chronological order. In Table 1, we show the year when a system was first introduced and highlight major features that distinguish different systems from the technology development perspective. In Table 2, we compare a set of features that affect the value and utility of different tools from a user perspective. For instance, we report the last content update time for each system as most users would like to keep informed with the latest publications. Specifically, we used the PubMed content as the study control and searched for the latest PubMed citation (PMID: 20726112 on 23 August 2010) in all the systems during comparison. When the citation can be found in a system, we consider its content as ‘current’ with PubMed. Otherwise, either an exact date (if such information is provided at the Website) or approximate year is labeled.

Table 1.

PubMed derivatives are grouped according to their most notable features

Systems Year Major features Ranking search results  RefMed 2010 Featuring multi-level relevance feedback for ranking  Quertle 2009 Allowing searches with concept categories  MedlineRanker 2009 Finding relevant documents through classification  MiSearch 2009 Using implicit feedback for improving ranking  Hakia 2008 Powered by Hakia’s proprietary semantic search technology  SemanticMEDLINE 2008 Powered by cognition’s proprietary search technology  MScanner 2008 Finding relevant documents through classification  eTBLAST 2007 Finding documents similar to input text  PubFocus 2006 Sorting by impact factor and citation volume  Twease 2005 Query expansion with relevance ranking technique Clustering results into topics  Anne O’Tate 2008 Clustering by important words, topics, journals, authors, etc.  McSyBi 2007 Clustering by MeSH or UMLS concepts  GoPubMed 2005 Clustering by MeSH or GO terms  ClusterMed 2004 Clustering by MeSH, title/abstract, author, affiliation, or date  XplorMed 2001 Clustering by extracted keywords from abstracts Extracting and displaying semantics and relations  MedEvi 2008 Providing textual evidence of semantic relations in output  EBIMed 2007 Displaying proteins, GO annotations, drugs and species  CiteXplore 2006 EBI’s tool for integrating biomedical literature and data  MEDIE 2006 Extracting text fragments matching queried semantics  PubNet 2005 Visualizing literature-derived network of bio-entities Improving search interface and retrieval experience  iPubMed 2010 Allow fuzzy search and approximate match  PubGet 2007 Retrieving results in PDFs  BabelMeSH 2006 Multi-language search interface  HubMed 2006 Export data in multiple format; visualization; etc  askMEDLINE 2005 Converting questions into formulated search as PICO  SLIM 2005 Slider interface for PubMed searches  PICO 2004 Search with patient, intervention, comparison, outcome  PubCrawler 1999 Alerting users with new articles based on saved searches Table 2.

Comparison of system features

Systems Content last update Service provider profile Source code available System output format PubMed ID links Full-text links Related article links Export search results RefMed 2010 Academic × List ✓ × × × Quertle 2010 Private × List ✓ ✓ × ✓ MedlineRanker Current Academic × List ✓ × × × MiSearch Current Academic × List ✓ × × × Hakia 2010 Private × List ✓ × × × SemanticMEDLINE 8 June 2010 Private × List ✓ × × × MScanner 2007 Academic ✓ List ✓ × × × eTBLAST 2010 Academic × List ✓ × × × PubFocus Current Private × List × × × × Twease Current Academic ✓ List ✓ × ✓ × Anne O’Tate Current Academic × List ✓ × ✓ × McSyBi Current Academic × List ✓ × × × GoPubMed Current Private × List ✓ ✓ ✓ ✓ ClusterMed Current Private × List ✓ × × ✓ XplorMed Current Academic × List ✓ × × × MedEvi 2010 Govn’t × Table ✓ × × × EBIMed 2010 Govn’t × Table ✓ × × × CiteXplore Current Govn’t × List ✓ ✓ × ✓ MEDIE 12 October 2009 Academic × List ✓ × × × PubNet Current Academic × Graph ✓ × × ✓ iPubMed Current Academic × List ✓ × × × PubGet Current Private × List ✓ ✓ × ✓ BabelMeSH 2010 Govn’t × List ✓ ✓ × × HubMed Current Private × List ✓ ✓ ✓ ✓ askMEDLINE 2010 Govn’t × List ✓ ✓ ✓ × SLIM Current Govn’t × List ✓ ✓ ✓ × PICO Current Govn’t × List ✓ ✓ ✓ × PubCrawler Current Academic × List ✓ × ✓ ✓

Based on the content of both tables, we have the following observations:

  1. The majority (16/28) of systems contains either ‘Pub’ or ‘Med’ in their name, indicating their strong bond to the PubMed system.

  2. All reviewed systems have been developed continuously during the past 10 or so years, starting from the introduction of PubCrawler in 1999 to iPubMed, the newest member in 2010. It is roughly the same period of time that a significant advance and maturity take place in the fields of text mining and Web technology. Many novel techniques in those two fields (e.g. named entity recognition techniques) were driving forces in the development of various systems reviewed in this work.

  3. Most systems were developed by academics researchers. Yet, several systems also came from the private sector (i.e. Hakia, Cognition, ClusterMed, Quertle) or the public sector (e.g. CiteXplore from the European Bioinformatics Institute). In addition to free access (a requirement for all the systems), the source code of two academic systems (MScanner and Twease) are freely available at their websites under the GNU General Public License.

  4. Similar to the general Web search engines such as Google, the presentation of search results in the reviewed tools is primarily list based. For some systems that perform result clustering, the list can be further grouped into different topics. Other output formats include tabular and graph presentations, which are designed for systems that are able to extract and display semantic relations.

  5. Although only few systems offer links to full-text and related articles, and allow export to bibliographic management software after searches (desirable functions in literature search), one can always (except in one system) follow the PubMed link to use those utilities.

  6. When comparing the four different development themes, improving ranking and the user interface seem to be the more popular directions. In the following sections, we describe each of the 28 systems in greater detail.

Ranking search results

PubMed returns search results in reverse chronological order by default. In other words, most recent publications are always returned first. Although returning results by time order has its own advantages, several systems are devoted to seeking alternative strategies in ranking results.

Clustering results into topics

The common theme of the five systems in the second group is about categorization of search results, aiming for quicker navigation and easier management of large numbers of returned results. Such a technique is developed to respond to the problem of information overload: users are often overwhelmed by a long list of returned documents. As pointed out in ref. (31), this technique is generally shown to be effective and useful for seeking relevant information from medical journal articles. As discussed in details below, the five systems mainly differ in the manner by which search results are clustered.

Enriching results with semantics and visualization

The five systems in this group aim to analyze search results and present summarized knowledge of semantics (biomedical concepts and their relationships) based on information extraction techniques. They differ in three aspects: (i) the types of biomedical concepts and relations to be extracted; (ii) the computational techniques used for information extraction; and (iii) how they present extraction results.

Improving search interface and retrieval experience

Systems in this group provide alternative interfaces to the standard PubMed searches. They aim to improve the efficiency of literature search and often take advantage of new Web technologies. They feature novel search/retrieval functions that are currently not available through PubMed, which may be preferred by some users in practice.

Other honorable mentions

Several other systems are noteworthy even though they are not listed in Table 1 due to failing to meet one or more of our predefined requirements:

Use cases beyond typical PubMed searches

Based on the novel features in each system described above, we show in Figure 3 a list of specific use scenarios that are beyond typical searches in PubMed. Specifically, we first identified a diverse set of 12 use cases, to each of which we further attached applicable systems accordingly. For instance, one can use tools surveyed in this work to search for experts on a specific topic or to visualize search results in networks. Although traditionally PubMed can not meet many of the listed special user needs, its recent development allowed it to perform certain tasks such as identifying similar publications, alerting users with updates and providing feedback in query refinement. More details are presented in ‘Changes to PubMed and looking into the future’ section.

Figure 3.

A diverse set of use cases in which different tools may be used.

Discussions on new features

Comparing the 28 systems to PubMed and each other, we see novel proposals for mainly three areas: searching, results analysis and interface/usability.

Searching

Since most users only examine a few returned results on the first result page [Figure 7 in ref. (3)], it is unquestionable that displaying citations by relevance is a desired feature in literature search. The 10 systems listed in ‘Ranking search results’ section differed with PubMed in this regard. Although most of those systems take as input user keywords, they differ from each other on how they process the keywords and subsequently use them to retrieve relevant citations. Like PubMed’s ATM, Twease also has its own query expansion component where additional MeSH terms and others can be added to the original user keywords. This technique can typically boost recall and is especially useful when the original query retrieves few or zero results (13). On the other hand, other systems listed in ‘Ranking search results’ section are mostly aim for improved precision over PubMed’s default reverse time sorting scheme. Their ranking strategies are very different from one another, ranging from traditional IR techniques like explicit/implicit feedback (RefMed/MiSearch) and relevance ranking (Twease), to utilizing domain specific importance factors like journal impact factors and citation numbers (PubFocus), to some unknown proprietary semantic NLP technologies (Hikia and SemanticSearch).

Results analysis

By default, PubMed returns 20 search results in a page and displays the title, abstract and other bibliographic information when a result is clicked. Recent studies focus on two kinds of extensions to the standard PubMed output. First, because a PubMed search typically results in a long list of citations for manual inspection, systems mentioned in ‘Clustering results into topics’ section aim to provide an aid with a short list of major topics summarized from the retrieved articles. Thus, users can navigate and choose to focus on the subjects of interest. This is similar to building filters for the result set (66). In this regard, choosing appropriate topic terms to cluster search results into meaningful groups is the key to the success of such approaches. Currently, most systems rely on selecting either important words from title/abstract or terms from biomedical controlled vocabularies/ontologies (e.g. MeSH) as representative topic terms.

The second extension to the standard PubMed output is due to the advances in text-mining techniques. In particular, semantic annotation is believed to be one of the probable cornerstones in future scientific publishing (67) despite the fact that its full benefits are yet to be determined. Thus with the development and maturity of techniques in named entity recognition and biomedical information extraction, some systems present summarized results of deep semantic enrichment. Existing systems (‘Enriching results with semantics and visualization’ section) have mostly focused on finding genes, proteins, drugs, diseases and species in free text and their biological relationships such as protein–protein interactions. Problems in these areas have received the most attention in the text mining community (68,69).

Interface and usability

In addition to providing improved search quality, a number of systems strive to provide a better search interface, including various changes to input and output. An innovative feature in iPubMed is ‘search-as-you-type’, thus enabling users to dynamically choose queries while inspecting retrieved results. Other proposals for an alternative input interfaces facilitate user-specific questions (PICO, askMedline), allow non-English queries (BabelMeSH), and promote use of sliders to set limits (SLIM). With respect to changes to output, there are two major directions. First, two systems employ additional components to make summarized results visible in graphs (ALiBaba and PubNet). Second, several systems provide easier access to PDFs (PubGet) and external citation mangers (PubMed assistant; HubMed).

Changes to PubMed and looking into the future

In response to the great need and challenge in literature search, PubMed has also gone through a series of significant changes to better serve its users. As shown in Figure 4, many of the recent changes happened during the same time period the 28 reviewed systems were developed. So they may have learned from each other. Indeed, some features were first developed in PubMed (e.g. related articles) while others in third party applications (e.g. email alerts).

Figure 4.

Technology development timeline for PubMed (in light green color) and other biomedical literature search tools (in light orange color). For PubMed, it shows the staring year when various recent changes (limited to those mentioned in ‘Changes to PubMed and looking into the future’ section) were introduced. For other tools, we show the time period in which tools of various features were first appeared.

A new initiative geared towards promoting scientific discoveries was introduced to PubMed a few years ago. Specifically, by providing global search across NCBI’s different databases through the Entrez System (http://www.ncbi.nlm.nih.gov/gquery/), users now have integrated access to all the stored information in different databases to know about a biological entity—be it related publications, DNA sequences or protein structures. Furthermore, inter-database links have been established and made obvious in search result pages, making the related data readily accessible between literature and other NCBI’s biological databases. For instance, through integrated links originating in PubMed results, users can access information about chemicals in PubChem or protein structures in the Structure database. Another category of discovery components is known as sensors (http://www.nlm.nih.gov/pubs/techbull/nd08/nd08_pm_gene_sensor.html; http://www.nlm.nih.gov/pubs/techbull/mj08/mj08_pubmed_atm_cite_sensor.html). A sensor detects certain types of search terms and provides access to relevant information other than literature. For instance, PubMed’s gene sensor detects gene mentions in user queries and shows links directing users to the associated gene records in Entrez Gene. Although these new additions are specific to PubMed and developed independently, they nevertheless all reflect the idea of semantically enriching the literature with biological data of various kinds, to achieve the goal of more efficient acquisition of knowledge.

With respect to research and retrieval, there are also several noteworthy endeavors in PubMed development although its default sorting schema has been kept intact. First, the related article feature was integrated into PubMed so that users can readily examine similar articles in content. eTBLAST has a similar feature, but as explained earlier, the two systems rely on different techniques for obtaining similar documents. Second, specific tools were added into PubMed for different information needs. For instance, the citation matcher is designed for those who search for specific articles. Another example is clinical queries, an interface designed to serve the specific needs of clinicians. It is fundamentally akin to the idea of categorizing search results (‘Ranking search results’ section) because the tool essentially discards any non-clinical results using a set of predefined filters. Finally, in order to help users avert a long list of return results and narrow their searches, a new feature named ‘also try’ was recently introduced, which offers query suggestions from the most popular PubMed queries that contain the user search term (4).

Regarding the user interface and usability, the My NCBI tool was introduced to PubMed, which let users select and create filter options, save search results, apply personal preferences like highlighting search terms in results, and share collections of citations. Similar to PubCrawler, it also allows users to set automatic emails for receiving updates of saved searches. Additional search help such as a spell checker and query auto-complete have also been deployed in PubMed. Finally in 2009, the PubMed interface including its homepage was substantially redesigned such that it is now simplified and easier to navigate and use.

Literature search is a fundamentally important problem in research and it will only become harder as the literature grows at a faster speed and broader scope (across the traditional disciplinary boundaries). Therefore we expect continuous developments and new emerging systems in this field. In particular, with the advances in search and Web technologies in general, we are likely to see progress in literature search as well. With the maturity of biomedical text-mining techniques in recognizing biological entities and their relations, better semantic identification and summarization of search results may be achieved, especially for such entities as author names, disorders, genes/proteins and chemicals/drugs as they are repeatedly and heavily sought topics (3,70) in biomedicine. In addition, one key factor for future system developers is the need to keep their content current with the growth of the literature, as literature search has a recency effect—most users still prefer to be informed of the most current findings in the literature. Finally, to be able to provide one-stop shopping for all 28 reviewed systems plus the ones in the ‘Other honorable mentions’ section and keep track of future developments in this area, we have built a website at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/search. It contains for every system, a highlight and short description of its unique features, one or more related publications, and a link to the actual system on the Internet. To facilitate busy scientists to quickly find appropriate tools for their specific search needs, we have built a set of search filters. For instance, one can narrow down the entire list of systems to the only ones that keep its content current with PubMed. Future systems will be added to the website either through our quarterly update or by individual request. On the website, we have set up a mechanism for registering future systems. Once we receive such a request, we will curate the necessary information (e.g. system highlights) about the submitted system and make it immediately available at the website.

Conclusions

By our three selection standards, a total of 28 Web systems were included in this review. They are comparable to PubMed given that they are designed for the same purpose and make use of full or partial PubMed data. We first provided a general description of PubMed including its content and unique characteristics. Next, according to their different features, we classified the 28 systems into four major groups in which we further described each of them in greater detail and showed their differences. Finally we reviewed the 28 systems as a whole and discussed their innovative aspects with respect to searching, result analysis and enrichment, and user interface/usability. This review can directly serve both non-experts and expert users when they wish to find systems other than PubMed. Moreover, the review provides a detailed summary for the recent advances in the field of biomedical literature search. This is particularly useful for existing service providers and anyone interested in future development in the field. Finally the constructed website make an integrated and readily access to all reviewed systems and provides a venue for registering future systems.

Acknowledgements

The author is grateful to the helpful discussion with John Wilbur, Minlie Huang and Natalie Xie.

Funding

Funding for this work and open access charge: Intramural Research Program of the National Institutes of Health, National Library of Medicine.

Conflict of interest: None declared.

References

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4