According to the best practices we need to make a protocol that completely specifies how we plan to execute our study. This protocol will be assessed by the governance boards of the participating data sources in your network study. For this a template could be used but we prefer to automate this process as much as possible by adding functionality to automatically generate study protocol from a study specification. We will discuss this in more detail later.
Study implementationNow we have completely design our study we have to implement the study. We have to generate the target and outcome cohorts and we need to develop the R code to run against our CDM that will execute the full study.
Cohort instantiationFor our study we need to know when a person enters the target and outcome cohorts. This is stored in a table on the server that contains the cohort start date and cohort end date for all subjects for a specific cohort definition. This cohort table has a very simple structure as shown below:
cohort_definition_id
, a unique identifier for distinguishing between different types of cohorts, e.g. cohorts of interest and outcome cohorts.subject_id
, a unique identifier corresponding to the person_id
in the CDM.cohort_start_date
, the date the subject enters the cohort.cohort_end_date
, the date the subject leaves the cohort.How do we fill this table according to our cohort definitions? There are two options for this:
use the interactive cohort builder tool in ATLAS which can be used to create cohorts based on inclusion criteria and will automatically populate this cohort table.
write your own custom SQL statements to fill the cohort table.
Both methods are described below for our example prediction problem.
ATLAS cohort builderTarget Cohort ACE inhibitors
ATLAS allows you to define cohorts interactively by specifying cohort entry and cohort exit criteria. Cohort entry criteria involve selecting one or more initial events, which determine the start date for cohort entry, and optionally specifying additional inclusion criteria which filter to the qualifying events. Cohort exit criteria are applied to each cohort entry record to determine the end date when the person’s episode no longer qualifies for the cohort. For the outcome cohort the end date is less relevant. As an example, Figure 6 shows how we created the ACE inhibitors cohort and Figure 7 shows how we created the angioedema cohort in ATLAS.
Outcome Cohort Angioedema
The T and O cohorts can be found here:
In depth explanation of cohort creation in ATLAS is out of scope of this vignette but can be found on the OHDSI wiki pages (link).
Note that when a cohort is created in ATLAS the cohortid is needed to extract the data in R. The cohortid can be found at the top of the ATLAS screen, e.g. 1770617 in Figure 6.
Custom cohortsIt is also possible to create cohorts without the use of ATLAS. Using custom cohort code (SQL) you can make more advanced cohorts if needed.
For our example study, we need to create at table to hold the cohort data and we need to create SQL code to instantiate this table for both the AF and Stroke cohorts. Therefore, we create a file called AceAngioCohorts.sql with the following contents:
/***********************************
File AceAngioCohorts.sql
***********************************/
/*
Create a table to store the persons in the T and C cohort
*/
IF OBJECT_ID('@resultsDatabaseSchema.PLPAceAngioCohort', 'U') IS NOT NULL
DROP TABLE @resultsDatabaseSchema.PLPAceAngioCohort;
CREATE TABLE @resultsDatabaseSchema.PLPAceAngioCohort
(
cohort_definition_id INT,
subject_id BIGINT,
cohort_start_date DATE,
cohort_end_date DATE
);
/*
T cohort: [PatientLevelPrediction vignette]: T : patients who are newly
dispensed an ACE inhibitor
- persons with a drug exposure record of any 'ACE inhibitor' or
any descendants, indexed at the first diagnosis
- who have >364 days of prior observation before their first dispensing
*/
INSERT INTO @resultsDatabaseSchema.AceAngioCohort (cohort_definition_id,
subject_id,
cohort_start_date,
cohort_end_date)
SELECT 1 AS cohort_definition_id,
Ace.person_id AS subject_id,
Ace.drug_start_date AS cohort_start_date,
observation_period.observation_period_end_date AS cohort_end_date
FROM
(
SELECT person_id, min(drug_exposure_date) as drug_start_date
FROM @cdmDatabaseSchema.drug_exposure
WHERE drug_concept_id IN (SELECT descendant_concept_id FROM
@cdmDatabaseSchema.concept_ancestor WHERE ancestor_concept_id IN
(1342439,1334456, 1331235, 1373225, 1310756, 1308216, 1363749, 1341927, 1340128, 1335471 /*ace inhibitors*/))
GROUP BY person_id
) Ace
INNER JOIN @cdmDatabaseSchema.observation_period
ON Ace.person_id = observation_period.person_id
AND Ace.drug_start_date >= dateadd(dd,364,
observation_period.observation_period_start_date)
AND Ace.drug_start_date <= observation_period.observation_period_end_date
;
/*
C cohort: [PatientLevelPrediction vignette]: O: Angioedema
*/
INSERT INTO @resultsDatabaseSchema.AceAngioCohort (cohort_definition_id,
subject_id,
cohort_start_date,
cohort_end_date)
SELECT 2 AS cohort_definition_id,
angioedema.person_id AS subject_id,
angioedema.condition_start_date AS cohort_start_date,
angioedema.condition_start_date AS cohort_end_date
FROM
(
SELECT person_id, condition_start_date
FROM @cdmDatabaseSchema.condition_occurrence
WHERE condition_concept_id IN (SELECT DISTINCT descendant_concept_id FROM
@cdmDatabaseSchema.concept_ancestor WHERE ancestor_concept_id IN
(432791 /*angioedema*/) OR descendant_concept_id IN
(432791 /*angioedema*/)
) angioedema
;
This is parameterized SQL which can be used by the SqlRender
package. We use parameterized SQL so we do not have to pre-specify the names of the CDM and result schemas. That way, if we want to run the SQL on a different schema, we only need to change the parameter values; we do not have to change the SQL code. By also making use of translation functionality in SqlRender
, we can make sure the SQL code can be run in many different environments.
To execute this sql against our CDM we first need to tell R how to connect to the server. PatientLevelPrediction
uses the DatabaseConnector
package, which provides a function called createConnectionDetails
. Type ?createConnectionDetails
for the specific settings required for the various database management systems (DBMS). For example, one might connect to a PostgreSQL database using this code:
connectionDetails <- createConnectionDetails(
dbms = "postgresql",
server = "localhost/ohdsi",
user = "joe",
password = "supersecret"
)
cdmDatabaseSchema <- "my_cdm_data"
cohortsDatabaseSchema <- "my_results"
cdmVersion <- "5"
The last three lines define the cdmDatabaseSchema
and cohortsDatabaseSchema
variables, as well as the CDM version. We will use these later to tell R where the data in CDM format live, where we want to create the cohorts of interest, and what version CDM is used. Note that for Microsoft SQL Server, databaseschemas need to specify both the database and the schema, so for example cdmDatabaseSchema <- "my_cdm_data.dbo"
.
library(SqlRender)
sql <- readSql("AceAngioCohorts.sql")
sql <- render(sql,
cdmDatabaseSchema = cdmDatabaseSchema,
cohortsDatabaseSchema = cohortsDatabaseSchema
)
sql <- translate(sql, targetDialect = connectionDetails$dbms)
connection <- connect(connectionDetails)
executeSql(connection, sql)
In this code, we first read the SQL from the file into memory. In the next line, we replace four parameter names with the actual values. We then translate the SQL into the dialect appropriate for the DBMS we already specified in the connectionDetails
. Next, we connect to the server, and submit the rendered and translated SQL.
If all went well, we now have a table with the events of interest. We can see how many events per type:
sql <- paste(
"SELECT cohort_definition_id, COUNT(*) AS count",
"FROM @cohortsDatabaseSchema.AceAngioCohort",
"GROUP BY cohort_definition_id"
)
sql <- render(sql, cohortsDatabaseSchema = cohortsDatabaseSchema)
sql <- translate(sql, targetDialect = connectionDetails$dbms)
querySql(connection, sql)
## cohort_definition_id count
## 1 1 0
## 2 2 0
Study script creation
In this section we assume that our cohorts have been created either by using ATLAS or a custom SQL script. We will first explain how to create an R script yourself that will execute our study as we have defined earlier.
Now we can tell PatientLevelPrediction
to extract all necessary data for our analysis. This is done using the FeatureExtractionPackage
. In short the FeatureExtractionPackage allows you to specify which features (covariates) need to be extracted, e.g. all conditions and drug exposures. It also supports the creation of custom covariates. For more detailed information on the FeatureExtraction package see its vignettes. For our example study we decided to use these settings:
covariateSettings <- createCovariateSettings(
useDemographicsGender = TRUE,
useDemographicsAge = TRUE,
useConditionGroupEraLongTerm = TRUE,
useConditionGroupEraAnyTimePrior = TRUE,
useDrugGroupEraLongTerm = TRUE,
useDrugGroupEraAnyTimePrior = TRUE,
useVisitConceptCountLongTerm = TRUE,
longTermStartDays = -365,
endDays = -1
)
The final step for extracting the data is to run the getPlpData
function and input the connection details, the database schema where the cohorts are stored, the cohort definition ids for the cohort and outcome, and the washoutPeriod which is the minimum number of days prior to cohort index date that the person must have been observed to be included into the data, and finally input the previously constructed covariate settings.
databaseDetails <- createDatabaseDetails(
connectionDetails = connectionDetails,
cdmDatabaseSchema = cdmDatabaseSchema,
cohortDatabaseSchema = resultsDatabaseSchema,
cohortTable = "AceAngioCohort",
cohortId = 1,
outcomeDatabaseSchema = resultsDatabaseSchema,
outcomeTable = "AceAngioCohort",
outcomeIds = 2
)
restrictPlpDataSettings <- createRestrictPlpDataSettings(
sampleSize = 10000
)
plpData <- getPlpData(
databaseDetails = databaseDetails,
covariateSettings = covariateSettings,
restrictPlpDataSettings = restrictPlpDataSettings
)
Note that if the cohorts are created in ATLAS its corresponding cohort database schema needs to be selected. There are many additional parameters for the getPlpData
function which are all documented in the PatientLevelPrediction
manual. The resulting plpData
object uses the package ff
to store information in a way that ensures R does not run out of memory, even when the data are large.
Creating the plpData
object can take considerable computing time, and it is probably a good idea to save it for future sessions. Because plpData
uses ff
, we cannot use R’s regular save function. Instead, we’ll have to use the savePlpData()
function:
We can use the loadPlpData()
function to load the data in a future session.
To completely define the prediction problem the final study population is obtained by applying additional constraints on the two earlier defined cohorts, e.g., a minumim time at risk can be enforced (requireTimeAtRisk, minTimeAtRisk
) and we can specify if this also applies to patients with the outcome (includeAllOutcomes
). Here we also specify the start and end of the risk window relative to target cohort start. For example, if we like the risk window to start 30 days after the at-risk cohort start and end a year later we can set riskWindowStart = 30
and riskWindowEnd = 365
. In some cases the risk window needs to start at the cohort end date. This can be achieved by setting addExposureToStart = TRUE
which adds the cohort (exposure) time to the start date.
In Appendix 1, we demonstrate the effect of these settings on the subset of the persons in the target cohort that end up in the final study population.
In the example below all the settings we defined for our study are imposed:
populationSettings <- createStudyPopulationSettings(
washoutPeriod = 364,
firstExposureOnly = FALSE,
removeSubjectsWithPriorOutcome = TRUE,
priorOutcomeLookback = 9999,
riskWindowStart = 1,
riskWindowEnd = 365,
minTimeAtRisk = 364,
startAnchor = "cohort start",
endAnchor = "cohort start",
requireTimeAtRisk = TRUE,
includeAllOutcomes = TRUE
)
Spliting the data into training/validation/testing datasets
When developing a prediction model using supervised learning (when you have features paired with labels for a set of patients), the first step is to design the development/internal validation process. This requires specifying how to select the model hyper-parameters, how to learn the model parameters and how to fairly evaluate the model. In general, the validation set is used to pick hyper-parameters, the training set is used to learn the model parameters and the test set is used to perform fair internal validation. However, cross-validation can be implemented to pick the hyper-parameters on the training data (so a validation data set is not required). Cross validation can also be used to estimate internal validation (so a testing data set is not required).
In small data the best approach for internal validation has been shown to be bootstrapping. However, in big data (many patients and many features) bootstrapping is generally not feasible. In big data our research has shown that it is just important to have some form of fair evaluation (use a test set or cross validation). For full details see our BMJ open paper.
In the PatientLevelPrediction package, the splitSettings define how the plpData are partitioned into training/validation/testing data. Cross validation is always done, but using a test set is optional (when the data are small, it may be optimal to not use a test set). For the splitSettings we can use the type (stratified/time/subject) and testFraction parameters to split the data in a 75%-25% split and run the patient-level prediction pipeline:
splitSettings <- createDefaultSplitSetting(
trainFraction = 0.75,
testFraction = 0.25,
type = "stratified",
nfold = 2,
splitSeed = 1234
)
Note: it is possible to add a custom method to specify how the plpData are partitioned into training/validation/testing data, see vignette('AddingCustomSplitting')
.
There a numerous data processing settings that a user must specify when developing a prediction model. These are: * Whether to under-sample or over-sample the training data (this may be useful when there is class imballance (e.g., the outcome is very rare or very common)) * Whether to perform feature engineering or feature selection (e.g., create latent variables that are not observed in the data or reduce the dimensionality of the data) * Whether to remove redundant features and normalize the data (this is required for some models)
The default sample settings does nothing, it simply returns the trainData as input, see below:
However, the current package contains methods of under-sampling the non-outcome patients. To perform undersampling, the type
input should be ‘underSample’ and numberOutcomestoNonOutcomes
must be specified (an integer specifying the number of non-outcomes per outcome). It is possible to add any custom function for over/under sampling, see vignette('AddingCustomSamples')
.
It is possible to specify a combination of feature engineering functions that take as input the trainData and output a new trainData with different features. The default feature engineering setting does nothing:
However, it is possible to add custom feature engineering functions into the pipeline, see vignette('AddingCustomFeatureEngineering')
.
Finally, the preprocessing setting is required. For this setting the user can define minFraction
, this removes any features that is observed in the training data for less than 0.01 fraction of the patients. So, if minFraction = 0.01
then any feature that is seen in less than 1 percent of the target population is removed. The input normalize
specifies whether the features are scaled between 0 and 1, this is required for certain models (e.g., LASSO logistic regression). The input removeRedundancy
specifies whether features that are observed in all of the target population are removed.
In the set function of an algorithm the user can specify a list of eligible values for each hyper-parameter. All possible combinations of the hyper-parameters are included in a so-called grid search using cross-validation on the training set. If a user does not specify any value then the default value is used instead.
For example, if we use the following settings for the gradientBoostingMachine: ntrees=c(100,200), maxDepth=4 the grid search will apply the gradient boosting machine algorithm with ntrees=100 and maxDepth=4 plus the default settings for other hyper-parameters and ntrees=200 and maxDepth=4 plus the default settings for other hyper-parameters. The hyper-parameters that lead to the bestcross-validation performance will then be chosen for the final model. For our problem we choose to build a logistic regression model with the default hyper-parameters
The runPlP
function requires the plpData
, the outcomeId
specifying the outcome being predicted and the settings: populationSettings
, splitSettings
, sampleSettings
, featureEngineeringSettings
, preprocessSettings
and modelSettings
to train and evaluate the model.
gbmResults <- runPlp(
plpData = plpData,
outcomeId = 2,
analysisId = "singleDemo2",
analysisName = "Demonstration of runPlp for training single PLP models",
populationSettings = populationSettings,
splitSettings = splitSettings,
sampleSettings = sampleSettings,
featureEngineeringSettings = featureEngineeringSettings,
preprocessSettings = preprocessSettings,
modelSettings = gbmModel,
logSettings = createLogSettings(),
executeSettings = createExecuteSettings(
runSplitData = TRUE,
runSampleData = TRUE,
runFeatureEngineering = TRUE,
runPreprocessData = TRUE,
runModelDevelopment = TRUE,
runCovariateSummary = TRUE
),
saveDirectory = file.path(tempdir(), "singlePlpExample2")
)
Under the hood the package will now use the R xgboost package to fit a a gradient boosting machine model using 75% of the data and will evaluate the model on the remaining 25%. A results data structure is returned containing information about the model, its performance etc.
You can save the model using:
You can load the model using:
You can also save the full results structure using:
To load the full results structure use:
gbmResults <- loadPlpResult(file.path(file.path(tempdir(), "gbm"))
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4