A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://bioconductor.org/packages/devel/bioc/vignettes/HubPub/inst/doc/CreateAHubPackage.html below:

Creating A Hub Package: ExperimentHub or AnnotationHub

Contents Overview

First, one must decide if an ExperimentHub or AnnotationHub package is appropriate.

The AnnotationHubData package provides tools to acquire, annotate, convert and store data for use in Bioconductor’s AnnotationHub. BED files from the Encode project, gtf files from Ensembl, or annotation tracks from UCSC, are examples of data that can be downloaded, described with metadata, transformed to standard Bioconductor data types, and stored so that they may be conveniently served up on demand to users via the AnnotationHub client. While data are often manipulated into a more R-friendly form, the data themselves retain their raw content and are not normally filtered or curated like those in ExperimentHub. Each resource has associated metadata that can be searched through the AnnotationHub client interface.

ExperimentHubData provides tools to add or modify resources in Bioconductor’s ExperimentHub. This ‘hub’ houses curated data from courses, publications, or experiments. It is often convenient to store data to be used in package examples, testings, or vignettes in the ExperimentHub. The resources can be files of raw data or more often are R / Bioconductor objects such as GRanges, SummarizedExperiment, data.frame etc. Each resource has associated metadata that can be searched through the ExperimentHub client interface.

It is advisable to create a separate package for annotations or experiment data rather than an all encompassing package of data and code. However, it is sometimes understandable to have a Software package that also serves as the package front end for the hubs. Although this is generally not recommended; if you think you have a use case please reach out to hubs@bioconductor.org to confirm before proceeding with a single package rather than the accompanied package approach.

Setting up a package to use a Hub New Hub package

Related resources are added to AnnotationHub or ExperimentHub by creating a package. The package should minimally contain the resource metadata, man pages describing the resources, and a vignette. It may also contain supporting R functions the author wants to provide. This is a similar design to the existing Bioconductor experimental data packages or annotation packages except the data is stored in Microsoft Azure Genomic Data Lake or other publicly accessibly sites (like Amazon S3 buckets or institutional servers) instead of the data/ or inst/extdata/ directory of the package. This keeps the package light weight and allows users to download only necessary data files.

Below are the steps required for creating the package and adding new resources:

Notify Bioconductor team member

The man page and vignette examples in the package will not work until the data are available in AnnotationHub or ExperimentHub. If you are not hosting the data on a stable web server (github and dropbox does not suffice), you should look into a stable option. We highly recommend zenodo; other options can include cloudflare, S3 buckets, mircorsoft azure data lake, or an institutional level server. If you do not have access to a secure location, you can reach out to Bioconductor team member to discuss at hubs@bioconductor.org. Adding data to the live location will also require reaching out to hubs@bioconductor.org. To have the data live in the appropriate hub, the metadata.csv file will have to be created (See inst/extdata section below) and the description file of the package will need to be accurate.

Building the package

When a resource is downloaded from one of the hubs the associated package is loaded in the workspace making the man pages and vignettes readily available. Because documentation plays an important role in understanding these resources please take the time to develop clear man pages and a detailed vignette. These documents provide essential background to the user and guide appropriate use the of resources.

Below is an outline of package organization. The files listed are required unless otherwise stated.

inst/extdata/ inst/scripts/ vignettes/ R/

For ExperimentHub resources only: - zzz.R: Optional. You can include a .onLoad() function in a zzz.R file that exports each resource name (i.e., metadata.csv field title) into a function. This allows the data to be loaded by name, e.g., resource123().

``` r
.onLoad <- function(libname, pkgname) {
   fl <- system.file("extdata", "metadata.csv", package=pkgname)
   titles <- read.csv(fl, stringsAsFactors=FALSE)$Title
   createHubAccessors(pkgname, titles)
}
```

`ExperimentHub::createHubAccessors()` and
`ExperimentHub:::.hubAccessorFactory()` provide internal
detail. The resource-named function has a single 'metadata'
argument. When metadata=TRUE, the metadata are loaded (equivalent
to single-bracket method on an ExperimentHub object) and when
FALSE the full resource is loaded (equivalent to double-bracket
method).
man/ DESCRIPTION / NAMESPACE Data objects

Large data are not formally part of the software package and are stored separately in a publicly accessible hosted site.

Package review

Once the metadata have been added to the production database the man pages and vignette can be finalized. When the package passes R CMD build and check it can be submitted to the package tracker for review. The package should be submitted without any of the data that is now located remotely. This keeps the package light weight and minimal size while still providing access to key large data files now stored remotely. If the data files were added to the github repository please see removing large data files and clean git tree to remove the large files and reduce package size.

Many times these data package are created as a supplement to a software package. There is a process for submitting multiple package under the same issue.

Additional resources to existing Hub package

Metadata for new versions of the data can be added to the same package as they become available.

Contact hubs@bioconductor.org or maintainer@bioconductor.org with any questions.

Converting a non AnnotationHub annotation package or non ExperimentHub

experiment data package to utilizing the Hub.

The concepts and directory structure of the package would stay the same. The main steps involved would be

  1. Restructure the inst/extdata and inst/scripts to include metadata.csv and make-data.R as described in the section above for creating new packages. Ensure the metadata.csv file is formatted correctly by running AnnotationHubData::makeAnnotationHubMetadata() or ExperimentHubData::makeExperimentHubMetadata() on your package.

  2. Add biocViews term “AnnotationHub” or “ExperimentHub” to DESCRIPTION (or “AnnotationHubSoftware”, “ExperimentHubSoftware” if appropriate).

  3. Upload the data to a publicly accessible site and remove the data from the package. See the section on “Storage of Data Files” below.

  4. Once the data is officially added to the hub, update any code to utilize AnnotationHub or ExperimentHub for retrieving data.

  5. Push all changes with a version bump back to Bioconductor git.bioconductor.org location

Bug fixes

A bug fix may involve a change to the metadata, data resource or both.

Update the resource Remove resources

Removing resources should be done with caution. The intent is that resources in the Hubs be ‘reproducible’ research by providing a stable snapshot of the data. Data made available in Bioconductor version x.y.z should be available for all versions greater than x.y.z. Unfortunately this is not always possible. If you find it necessary to remove data from AnnotationHub/ExperimentHub please contact hubs@bioconductor.org or maintainer@bioconductor.org for assistance.

When a resource is removed from ExperimentHub or AnnotationHub two things happen: the ‘rdatadateremoved’ field is populated with a date and the ‘status’ field is populated with a reason why the resource is no longer available. Once these changes are made, the ExperimentHub() or AnnotationHub() constructor will not list the resource among the available ids. An attempt to extract the resource with ‘[[’ and the EH/AH id will return an error along with the status message. The function getInfoOnIds() will display metadata information for any resource including resources still in the database but no longer available.

In general, resources are only removed when they are no longer available (e.g., moved from web location, no longer provided etc.).

To remove a resource from AnnotationHub contact hubs@bioconductor.org or maintainer@bioconductor.org.

Versioning

Versioning of resources is handled by the maintainer. If you plan to provide incremental updates to a file for the same organism / genome build, we recommend including a version in the title of the resource so it is easy to distinguish which is most current. We also would recommend when uploading the data to genomic data lake or your publicly accessible site to have a directory structure accounting for versioning.

If you do not include a version, or make the title unique in some way, multiple files with the same title will be listed in the ExperimentHub or AnnotationHub object. The user will have to use the ‘rdatadateadded’ metadata field to determine which file is the most current or try an infer from ids which can lead to confusion.

Visibility

Several metadata fields control which resources are visible when a user invokes ExperimentHub()/AnnotationHub(). Records are filtered based on these criteria:

Once a record is added to ExperimentHub/AnnotationHub it is visible from that point forward until stamped with ‘rdatadateremoved’. For example, a record added on May 1, 2017 with ‘biocVersion’ 3.6 will be visible in all snapshots >= May 1, 2017 and in all Bioconductor versions >= 3.6.

A special filter for OrgDb is utilized in AnnotationHub. Only one OrgDb is available per release/devel cycle. Therefore contributed OrgDb added to a devel cycle are masked until the following release. There are options for debugging these masked resources. See ?setAnnotationHubOption

Storage of Data Files

The data should not be included in the package. This keeps the package light weight and quick for a user to install. This allows the user to investigate functions and documentation without downloading large data files and only proceeding with the download when necessary. When at all possible data should be hosted on a publicaly accessible site designated by the package maintainer. If this is not possible contact a core team member at hubs@bioconductor.org to request options for hosting.

Hosting Data on a Publicly Accessible Site

Data can be accessed through the hubs from any publicly accessible site. The metadata.csv file[s] created will need the column Location_Prefix to indicate the hosted site. See more in the description of the metadata columns/fields below but as a quick example if the link to the data file is ftp://mylocalserver/singlecellExperiments/dataSet1.Rds an example breakdown of the Location_Prefix and RDataPath for this entry in the metadata.csv file would be ftp://mylocalserver/ for the Location_Prefix and singlecellExperiments/dataSet1.Rds for the RDataPath. Github and Dropbox are not an acceptable hosting platform for data. We highly recommend zenodo; other possiblities include cloudflare, S3 buckets, microsoft data lakes, or possible a server located at your home institution.

Uploading Data to Microsoft Azure Genomic Data Lake

In some cases we may allow access to a Bioconductor Microsoft Azure Genomic Data Lake. Instead of providing the data files via dropbox, ftp, github, etc. we will grant temporary access to S3 directory where you can upload your data for preprocessing.

If interesting in hosting on Bioconductor, please email hubs@bioconductor.org. You should provide the following information:

  1. Description or link to package and description of hosted data. Include why existing data already provided in similar packages or hub is not appropriate and why you would like to host your own.

  2. Why a location like zenodo is not appropriate and/or verification that you do not have access to an institutional level hosting location or something like S3 bucket or Azure is not available.

  3. The number and size of files to be hosted and the total size that is to be uploaded.

Please upload the data with the appropriate directory structure, including subdirectories as necessary (i.e. top directory must be software package name, then if applicable, subdirectories of versions, …).

Once the upload is complete, email hubs@bioconductor.org to continue the process. To add the data officially the data will need to be uploaded and the metadata.csv file will need to be created in the github repository.

R Interface with R package

In some cases we may allow access to a Bioconductor Microsoft Azure Genomic Data Lake. Please email hubs@bioconductor.org to obtain necessary information. Examples below we assume the data on your system is in a directory call YourLocalDataDir and will use the following that would be provided by core team:

  1. username: hubtest1
  2. key: 102da5beeebe1339ef50dd9138589d8e46a354d1ad69a7b909f165d265f38a33
  3. core team bucket name: userdata

A helper R package has been created to assist with upload called BiocHubsIngestR; It is currently on github. A contributor can use the following commands in R to upload data:

## install package
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Bioconductor/BiocHubsIngestR")

## set up authentication
BiocHubsIngestR::auth(username = "hubtest1", password = "102da5beeebe1339ef50dd9138589d8e46a354d1ad69a7b909f165d265f38a33")

## Upload data
BiocHubsIngestR::upload("/Local/Path/To/YourLocalDataDir", bucket="userdata")
Command Line via AWS S3 CLI

In some cases we may allow access to a Bioconductor Microsoft Azure Genomic Data Lake. Instead of providing the data files via dropbox, ftp, github, etc. we will grant temporary access to S3 bucket where you can upload your data for preprocessing. The command line interface for upload is through AWS S3 Command Line Interface. You should install the AWS CLI on your machine. Please email hubs@bioconductor.org to obtain necessary information. Examples below we assume the data on your system is in a directory call YourLocalDataDir and will use the following that would be provided by core team:

  1. username: hubtest1
  2. key: 102da5beeebe1339ef50dd9138589d8e46a354d1ad69a7b909f165d265f38a33
  3. core team bucket name: userdata

To set up credentials on your system you would use the command aws configure --profile <username>. It will take you through prompts for AWS Access Key Id, AWS Secret Access Key, Default region name, and Default output format. Using our example information it would be something like the following:

> aws configure --profile hubtest1
AWS Access Key ID:  hubtest1
AWS Secret Access Key: 102da5beeebe1339ef50dd9138589d8e46a354d1ad69a7b909f165d265f38a33
Default region name: <leave blank>
Default output format: <leave blank>

You would then be able to access the userdata that was set up for you. You can use s3 cp to upload data. Use with recursive to upload directories. The general form will be

aws --profile <username>
--endpoint-url https://<username>.hubsingest.bioconductor.org/
s3 cp --recursive <path to yourlocal directory>
s3://<coreteam bucket name>/<local directory name>

So using our example data:

aws --profile hubtest1
--endpoint-url https://hubtest1.hubsingest.bioconductor.org/
s3 cp --recursive /path/to/YourLocalDataDir
s3://userdata/YourLocalDataDir

You can check the upload with s3 ls. With our example data it would look something like

aws --profile hubtest1
--endpoint-url https://hubtest1.hubsingest.bioconductor.org/
s3 ls --recursive s3://userdata/

In general, all files should be in a folder that matches your package name. Only upload data files; subdirectories are optionally okay to include to distguish versions or characteristics of the data (i.e species, tissue types). Do not upload your entire package directory (i.e DESCRIPTION, NAMESPACE, R/, etc.)

Once the upload is complete, email hubs@bioconductor.org to continue the process. To add the data officially the data will need to be uploaded and the metadata.csv file will need to be created in the github repository.

Utilizing the Bioconductor Docker container

Coming soon!

Validating

The best way to validate record metadata is to read inst/extdata/metadata.csv (or aptly named csv file in inst/extdata) using the AnnotationHubData::makeAnnotationHubMetadata() or ExperimentHubData::makeExperimentHubMetadata(). If that is successful the metadata should be valid and able to be entered into the database.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4