RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://ohdsi.github.io/PatientLevelPrediction/articles/BuildingPredictiveModels.html below:

Building patient-level predictive models • PatientLevelPrediction

Study Specification Problem Definition Target Cohort (T) ‘Patients who are newly dispensed an ACE inhibitor’ defined as the first drug record of any ACE inhibitor Outcome Cohort (O) ‘Angioedema’ defined as an angioedema condition record during an inpatient or ER visit Time-at-risk (TAR) 1 day till 365 days from cohort start Population Definition Washout Period 365 Enter the target cohort multiple times? No Allow prior outcomes? No Start of time-at-risk 1 day End of time-at-risk 365 days Require a minimum amount of time-at-risk? Yes (364 days) Model Development Algorithm Gradient Boosting Machine Hyper-parameters ntree:5000, max depth:4 or 7 or 10 and learning rate: 0.001 or 0.01 or 0.1 or 0.9 Covariates Gender, Age, Conditions (ever before, <365), Drugs Groups (ever before, <365), and Visit Count Data split 75% train, 25% test. Randomly assigned by person

According to the best practices we need to make a protocol that completely specifies how we plan to execute our study. This protocol will be assessed by the governance boards of the participating data sources in your network study. For this a template could be used but we prefer to automate this process as much as possible by adding functionality to automatically generate study protocol from a study specification. We will discuss this in more detail later.

Study implementation

Now we have completely design our study we have to implement the study. We have to generate the target and outcome cohorts and we need to develop the R code to run against our CDM that will execute the full study.

Cohort instantiation

For our study we need to know when a person enters the target and outcome cohorts. This is stored in a table on the server that contains the cohort start date and cohort end date for all subjects for a specific cohort definition. This cohort table has a very simple structure as shown below:

cohort_definition_id, a unique identifier for distinguishing between different types of cohorts, e.g. cohorts of interest and outcome cohorts.
subject_id, a unique identifier corresponding to the person_id in the CDM.
cohort_start_date, the date the subject enters the cohort.
cohort_end_date, the date the subject leaves the cohort.

How do we fill this table according to our cohort definitions? There are two options for this:

use the interactive cohort builder tool in ATLAS which can be used to create cohorts based on inclusion criteria and will automatically populate this cohort table.
write your own custom SQL statements to fill the cohort table.

Both methods are described below for our example prediction problem.

ATLAS cohort builder

Target Cohort ACE inhibitors

ATLAS allows you to define cohorts interactively by specifying cohort entry and cohort exit criteria. Cohort entry criteria involve selecting one or more initial events, which determine the start date for cohort entry, and optionally specifying additional inclusion criteria which filter to the qualifying events. Cohort exit criteria are applied to each cohort entry record to determine the end date when the person’s episode no longer qualifies for the cohort. For the outcome cohort the end date is less relevant. As an example, Figure 6 shows how we created the ACE inhibitors cohort and Figure 7 shows how we created the angioedema cohort in ATLAS.

Outcome Cohort Angioedema

The T and O cohorts can be found here:

Ace inhibitors (T): https://atlas-demo.ohdsi.org/#/cohortdefinition/1770617
Angioedema (O) : https://atlas-demo.ohdsi.org/#/cohortdefinition/1770616

In depth explanation of cohort creation in ATLAS is out of scope of this vignette but can be found on the OHDSI wiki pages (link).

Note that when a cohort is created in ATLAS the cohortid is needed to extract the data in R. The cohortid can be found at the top of the ATLAS screen, e.g. 1770617 in Figure 6.

Custom cohorts

It is also possible to create cohorts without the use of ATLAS. Using custom cohort code (SQL) you can make more advanced cohorts if needed.

For our example study, we need to create at table to hold the cohort data and we need to create SQL code to instantiate this table for both the AF and Stroke cohorts. Therefore, we create a file called AceAngioCohorts.sql with the following contents:

  /***********************************
    File AceAngioCohorts.sql 
  ***********************************/
    /*
    Create a table to store the persons in the T and C cohort
  */
    
    IF OBJECT_ID('@resultsDatabaseSchema.PLPAceAngioCohort', 'U') IS NOT NULL 
  DROP TABLE @resultsDatabaseSchema.PLPAceAngioCohort;
  
  CREATE TABLE @resultsDatabaseSchema.PLPAceAngioCohort 
  ( 
    cohort_definition_id INT, 
    subject_id BIGINT,
    cohort_start_date DATE, 
    cohort_end_date DATE
  );
  
  
  /*
    T cohort:  [PatientLevelPrediction vignette]:  T : patients who are newly 
  dispensed an ACE inhibitor
  - persons with a drug exposure record of any 'ACE inhibitor' or 
  any descendants, indexed at the first diagnosis
  - who have >364 days of prior observation before their first dispensing
  */
    INSERT INTO @resultsDatabaseSchema.AceAngioCohort (cohort_definition_id, 
                                                       subject_id, 
                                                       cohort_start_date, 
                                                       cohort_end_date)
  SELECT 1 AS cohort_definition_id,
  Ace.person_id AS subject_id,
  Ace.drug_start_date AS cohort_start_date,
  observation_period.observation_period_end_date AS cohort_end_date
  FROM
  (
    SELECT person_id, min(drug_exposure_date) as drug_start_date
    FROM @cdmDatabaseSchema.drug_exposure
    WHERE drug_concept_id IN (SELECT descendant_concept_id FROM 
                              @cdmDatabaseSchema.concept_ancestor WHERE ancestor_concept_id IN 
                              (1342439,1334456, 1331235, 1373225, 1310756, 1308216, 1363749, 1341927, 1340128, 1335471 /*ace inhibitors*/))
    GROUP BY person_id
  ) Ace
  INNER JOIN @cdmDatabaseSchema.observation_period
  ON Ace.person_id = observation_period.person_id
  AND Ace.drug_start_date >= dateadd(dd,364, 
                                     observation_period.observation_period_start_date)
  AND Ace.drug_start_date <= observation_period.observation_period_end_date
  ;
  
  /*
    C cohort:  [PatientLevelPrediction vignette]:  O: Angioedema
  */
    INSERT INTO @resultsDatabaseSchema.AceAngioCohort (cohort_definition_id, 
                                                       subject_id, 
                                                       cohort_start_date, 
                                                       cohort_end_date)
  SELECT 2 AS cohort_definition_id,
  angioedema.person_id AS subject_id,
  angioedema.condition_start_date AS cohort_start_date,
  angioedema.condition_start_date AS cohort_end_date
  FROM  
  (
    SELECT person_id, condition_start_date
    FROM @cdmDatabaseSchema.condition_occurrence
    WHERE condition_concept_id IN (SELECT DISTINCT descendant_concept_id FROM 
                                   @cdmDatabaseSchema.concept_ancestor WHERE ancestor_concept_id IN 
                                   (432791 /*angioedema*/) OR descendant_concept_id IN 
                                   (432791 /*angioedema*/)
    ) angioedema
    
    ;

This is parameterized SQL which can be used by the SqlRender package. We use parameterized SQL so we do not have to pre-specify the names of the CDM and result schemas. That way, if we want to run the SQL on a different schema, we only need to change the parameter values; we do not have to change the SQL code. By also making use of translation functionality in SqlRender, we can make sure the SQL code can be run in many different environments.

To execute this sql against our CDM we first need to tell R how to connect to the server. PatientLevelPrediction uses the DatabaseConnector package, which provides a function called createConnectionDetails. Type ?createConnectionDetails for the specific settings required for the various database management systems (DBMS). For example, one might connect to a PostgreSQL database using this code:

connectionDetails <- createConnectionDetails(
  dbms = "postgresql",
  server = "localhost/ohdsi",
  user = "joe",
  password = "supersecret"
)

cdmDatabaseSchema <- "my_cdm_data"
cohortsDatabaseSchema <- "my_results"
cdmVersion <- "5"

The last three lines define the cdmDatabaseSchema and cohortsDatabaseSchema variables, as well as the CDM version. We will use these later to tell R where the data in CDM format live, where we want to create the cohorts of interest, and what version CDM is used. Note that for Microsoft SQL Server, databaseschemas need to specify both the database and the schema, so for example cdmDatabaseSchema <- "my_cdm_data.dbo".

library(SqlRender)
sql <- readSql("AceAngioCohorts.sql")
sql <- render(sql,
  cdmDatabaseSchema = cdmDatabaseSchema,
  cohortsDatabaseSchema = cohortsDatabaseSchema
)
sql <- translate(sql, targetDialect = connectionDetails$dbms)

connection <- connect(connectionDetails)
executeSql(connection, sql)

In this code, we first read the SQL from the file into memory. In the next line, we replace four parameter names with the actual values. We then translate the SQL into the dialect appropriate for the DBMS we already specified in the connectionDetails. Next, we connect to the server, and submit the rendered and translated SQL.

If all went well, we now have a table with the events of interest. We can see how many events per type:

sql <- paste(
  "SELECT cohort_definition_id, COUNT(*) AS count",
  "FROM @cohortsDatabaseSchema.AceAngioCohort",
  "GROUP BY cohort_definition_id"
)
sql <- render(sql, cohortsDatabaseSchema = cohortsDatabaseSchema)
sql <- translate(sql, targetDialect = connectionDetails$dbms)

querySql(connection, sql)

##   cohort_definition_id count
## 1                    1     0
## 2                    2     0

Study script creation

In this section we assume that our cohorts have been created either by using ATLAS or a custom SQL script. We will first explain how to create an R script yourself that will execute our study as we have defined earlier.

Now we can tell PatientLevelPrediction to extract all necessary data for our analysis. This is done using the FeatureExtractionPackage. In short the FeatureExtractionPackage allows you to specify which features (covariates) need to be extracted, e.g. all conditions and drug exposures. It also supports the creation of custom covariates. For more detailed information on the FeatureExtraction package see its vignettes. For our example study we decided to use these settings:

covariateSettings <- createCovariateSettings(
  useDemographicsGender = TRUE,
  useDemographicsAge = TRUE,
  useConditionGroupEraLongTerm = TRUE,
  useConditionGroupEraAnyTimePrior = TRUE,
  useDrugGroupEraLongTerm = TRUE,
  useDrugGroupEraAnyTimePrior = TRUE,
  useVisitConceptCountLongTerm = TRUE,
  longTermStartDays = -365,
  endDays = -1
)

The final step for extracting the data is to run the getPlpData function and input the connection details, the database schema where the cohorts are stored, the cohort definition ids for the cohort and outcome, and the washoutPeriod which is the minimum number of days prior to cohort index date that the person must have been observed to be included into the data, and finally input the previously constructed covariate settings.

databaseDetails <- createDatabaseDetails(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = cdmDatabaseSchema,
  cohortDatabaseSchema = resultsDatabaseSchema,
  cohortTable = "AceAngioCohort",
  cohortId = 1,
  outcomeDatabaseSchema = resultsDatabaseSchema,
  outcomeTable = "AceAngioCohort",
  outcomeIds = 2
)

restrictPlpDataSettings <- createRestrictPlpDataSettings(
  sampleSize = 10000
)

plpData <- getPlpData(
  databaseDetails = databaseDetails,
  covariateSettings = covariateSettings,
  restrictPlpDataSettings = restrictPlpDataSettings
)

Note that if the cohorts are created in ATLAS its corresponding cohort database schema needs to be selected. There are many additional parameters for the getPlpData function which are all documented in the PatientLevelPrediction manual. The resulting plpData object uses the package ff to store information in a way that ensures R does not run out of memory, even when the data are large.

Creating the plpData object can take considerable computing time, and it is probably a good idea to save it for future sessions. Because plpData uses ff, we cannot use R’s regular save function. Instead, we’ll have to use the savePlpData() function:

We can use the loadPlpData() function to load the data in a future session.

Additional inclusion criteria

To completely define the prediction problem the final study population is obtained by applying additional constraints on the two earlier defined cohorts, e.g., a minumim time at risk can be enforced (requireTimeAtRisk, minTimeAtRisk) and we can specify if this also applies to patients with the outcome (includeAllOutcomes). Here we also specify the start and end of the risk window relative to target cohort start. For example, if we like the risk window to start 30 days after the at-risk cohort start and end a year later we can set riskWindowStart = 30 and riskWindowEnd = 365. In some cases the risk window needs to start at the cohort end date. This can be achieved by setting addExposureToStart = TRUE which adds the cohort (exposure) time to the start date.

In Appendix 1, we demonstrate the effect of these settings on the subset of the persons in the target cohort that end up in the final study population.

In the example below all the settings we defined for our study are imposed:

populationSettings <- createStudyPopulationSettings(
  washoutPeriod = 364,
  firstExposureOnly = FALSE,
  removeSubjectsWithPriorOutcome = TRUE,
  priorOutcomeLookback = 9999,
  riskWindowStart = 1,
  riskWindowEnd = 365,
  minTimeAtRisk = 364,
  startAnchor = "cohort start",
  endAnchor = "cohort start",
  requireTimeAtRisk = TRUE,
  includeAllOutcomes = TRUE
)

Spliting the data into training/validation/testing datasets

When developing a prediction model using supervised learning (when you have features paired with labels for a set of patients), the first step is to design the development/internal validation process. This requires specifying how to select the model hyper-parameters, how to learn the model parameters and how to fairly evaluate the model. In general, the validation set is used to pick hyper-parameters, the training set is used to learn the model parameters and the test set is used to perform fair internal validation. However, cross-validation can be implemented to pick the hyper-parameters on the training data (so a validation data set is not required). Cross validation can also be used to estimate internal validation (so a testing data set is not required).

In small data the best approach for internal validation has been shown to be bootstrapping. However, in big data (many patients and many features) bootstrapping is generally not feasible. In big data our research has shown that it is just important to have some form of fair evaluation (use a test set or cross validation). For full details see our BMJ open paper.

In the PatientLevelPrediction package, the splitSettings define how the plpData are partitioned into training/validation/testing data. Cross validation is always done, but using a test set is optional (when the data are small, it may be optimal to not use a test set). For the splitSettings we can use the type (stratified/time/subject) and testFraction parameters to split the data in a 75%-25% split and run the patient-level prediction pipeline:

splitSettings <- createDefaultSplitSetting(
  trainFraction = 0.75,
  testFraction = 0.25,
  type = "stratified",
  nfold = 2,
  splitSeed = 1234
)

Note: it is possible to add a custom method to specify how the plpData are partitioned into training/validation/testing data, see vignette('AddingCustomSplitting').

Preprocessing the training data

There a numerous data processing settings that a user must specify when developing a prediction model. These are: * Whether to under-sample or over-sample the training data (this may be useful when there is class imballance (e.g., the outcome is very rare or very common)) * Whether to perform feature engineering or feature selection (e.g., create latent variables that are not observed in the data or reduce the dimensionality of the data) * Whether to remove redundant features and normalize the data (this is required for some models)

The default sample settings does nothing, it simply returns the trainData as input, see below:

However, the current package contains methods of under-sampling the non-outcome patients. To perform undersampling, the type input should be ‘underSample’ and numberOutcomestoNonOutcomes must be specified (an integer specifying the number of non-outcomes per outcome). It is possible to add any custom function for over/under sampling, see vignette('AddingCustomSamples').

It is possible to specify a combination of feature engineering functions that take as input the trainData and output a new trainData with different features. The default feature engineering setting does nothing:

However, it is possible to add custom feature engineering functions into the pipeline, see vignette('AddingCustomFeatureEngineering').

Finally, the preprocessing setting is required. For this setting the user can define minFraction, this removes any features that is observed in the training data for less than 0.01 fraction of the patients. So, if minFraction = 0.01 then any feature that is seen in less than 1 percent of the target population is removed. The input normalize specifies whether the features are scaled between 0 and 1, this is required for certain models (e.g., LASSO logistic regression). The input removeRedundancy specifies whether features that are observed in all of the target population are removed.

Model Development

In the set function of an algorithm the user can specify a list of eligible values for each hyper-parameter. All possible combinations of the hyper-parameters are included in a so-called grid search using cross-validation on the training set. If a user does not specify any value then the default value is used instead.

For example, if we use the following settings for the gradientBoostingMachine: ntrees=c(100,200), maxDepth=4 the grid search will apply the gradient boosting machine algorithm with ntrees=100 and maxDepth=4 plus the default settings for other hyper-parameters and ntrees=200 and maxDepth=4 plus the default settings for other hyper-parameters. The hyper-parameters that lead to the bestcross-validation performance will then be chosen for the final model. For our problem we choose to build a logistic regression model with the default hyper-parameters

The runPlP function requires the plpData, the outcomeId specifying the outcome being predicted and the settings: populationSettings, splitSettings, sampleSettings, featureEngineeringSettings, preprocessSettings and modelSettings to train and evaluate the model.

gbmResults <- runPlp(
  plpData = plpData,
  outcomeId = 2,
  analysisId = "singleDemo2",
  analysisName = "Demonstration of runPlp for training single PLP models",
  populationSettings = populationSettings,
  splitSettings = splitSettings,
  sampleSettings = sampleSettings,
  featureEngineeringSettings = featureEngineeringSettings,
  preprocessSettings = preprocessSettings,
  modelSettings = gbmModel,
  logSettings = createLogSettings(),
  executeSettings = createExecuteSettings(
    runSplitData = TRUE,
    runSampleData = TRUE,
    runFeatureEngineering = TRUE,
    runPreprocessData = TRUE,
    runModelDevelopment = TRUE,
    runCovariateSummary = TRUE
  ),
  saveDirectory = file.path(tempdir(), "singlePlpExample2")
)

Under the hood the package will now use the R xgboost package to fit a a gradient boosting machine model using 75% of the data and will evaluate the model on the remaining 25%. A results data structure is returned containing information about the model, its performance etc.

You can save the model using:

You can load the model using:

You can also save the full results structure using:

To load the full results structure use:

gbmResults <- loadPlpResult(file.path(file.path(tempdir(), "gbm"))

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4