Patient Prediction Library

Introductory Tutorial

Patient Prediction Library (PPL) allows OMOP-standard (CDM v6) medical data like Claims and EHR information to be processed efficiently for predictive tasks. The library allows users to precisely define cohorts of interest, patient-level time series features, and target variables of interest. Relevant data is automatically extracted and surfaced in formats suitable for most machine learning algorithms, and the (often extreme) sparsity of patient-level data is fully taken into account to provide maximum performance.

The library provides several benefits for modelling, both in terms of ease of use and performance:

All that needs to be specified are cohort and outcome definitions, which can often be done using simple SQL queries.
Our fast data ingestion and transformation pipelines allow for easy and efficient tuning of algorithms – we have seen significant improvements in out-of-sample performance of predictors after hyperparameter tuning that would take days with simple SQL queries but minutes with Prediction Library
We modularize the data extraction and modelling processes, allowing users to use new models as they become available with very little modification to the code. Tools ranging from simple regressions to deep neural net models can easily be substituted in and out in a plug-and-play manner.

PPL serves as a modern python alternative to the PatientLevelPrediction R library. We allow seamless integration of many Python based machine learning and data science libraries by supporting generic sklearn-stye classifiers. Our new data storage paradigm also allows for more on-the-fly feature engineering as compared to previous libraries.

In this tutorial, we walk through the process of using PPL for an end-of-life prediction task for Medicare patients with clear applications to improving palliative care. The code used can also be found in the example notebook, and can be run on your own data as you explore PPL. The control flow diagram below also links to relevant sections of the library documentation.

1. Defining a Predictive Task

To formally specify our task, we require a set of rules to decide who is included in a group representing the population of interest, patient-level features for each of the members of this group, and an outcome or result per patient. Furthermore, each of these parameters must be specified with respect to a timeframe. Diagram of a Predictive Task Specification We define our end-of-life task as follows:

For each patient who is on Medicare, and is enrolled in an insurance plan for which we have claims data available for 95% of the days of calendar year 2016, and is alive as of March 31, 2017: predict if the patient will die during the interval of time between April 1, 2017 and September 31, 2017 using data including the drugs prescribed, procedures performed, conditions diagnosed and the medical specialties of the clinicians who cared for the patient during 2016.

PPL splits the conversion of this natural language specification of a task to code into two natural steps. First, we define a cohort of patients, each of which has an outcome. Second, we generate features for each of these patients – these two steps are kept independent of each other in PPL, allowing different cohorts or feature sets to very quickly be tested and evaluated. We explain how cohorts and features are initialized through the example of the end-of-life problem.

1.1 Cohort Initialization

OMOP’s PERSON table is the starting point for cohort creation, and is filtered via SQL query. Note that these SQL queries can be written with variable parameters which can be adjusted for different analyses. These parameters are implemented as Python templates. In this example, we leave dates as parameters to show how cohort creation can be flexible.

We first want to establish when patients were enrolled in insurance plans which we have access to. We do so using OMOP’s OBSERVATION_PERIOD table. Our SQL logic finds the number of days within our data collection period (all of 2016, in this case) that a patient was enrolled in a particular plan:

death_training_elig_counts as (
        select
            person_id,
            observation_period_start_date as start,
            observation_period_end_date as finish,
            greatest(
                least (
                    observation_period_end_date,
                    date '{ training_end_date }'
                ) - greatest(
                    observation_period_start_date,
                    date '{ training_start_date }'
                ), 0
            ) as num_days
        from cdm.observation_period
    )

Note that the dates are left as template strings that can be filled later on. Next, we want to filter for patients who are enrolled for 95% of the days in our data collection period - note that we must be careful to include patients who used multiple different insurance plans over the course of the year by aggregating the intermediate table death_training_elig_counts which is specified above. Thus, we first aggregate and then collect the person_id field for patients with sufficient coverage over the data collection period:

    death_trainingwindow_elig_perc as (
        select
            person_id
        from
            death_training_elig_counts
        group by
            person_id
        having
            sum(num_days) >= 0.95 * (date '{ training_end_date }' - date '{ training_start_date }')
    )

The next step is to find outcomes. We do so only for Medicare users by using a proprietary non-OMOP table cdm_aux.medicare_ppl listing the person_id of all Medicare patients in our data, demonstrating the use of an auxiliary schema:

death_dates as (
        select
            p.person_id,
            p.death_datetime
        from
            cdm.person p
        inner join
            cdm_aux.medicare_ppl m
        on
            p.person_id = m.person_id
    )

Finally, we can create the cohort:

    select
        row_number() over (order by p.person_id) - 1 as example_id,
        p.person_id,
        date '{ training_start_date }' as start_date,
        date '{ training_end_date }' as end_date,
        d.death_datetime as outcome_date,
        coalesce(
            (d.death_datetime between
                date '{ training_end_date }'
                 + interval '{ gap }'
                and
                date '{ training_end_date }'
                 + interval '{ gap }'
                 + interval '{ outcome_window }'
            ), false
        )::int as y
    from
        cdm_aux.medicare_ppl p
        inner join death_testwindow_elig_perc te on te.person_id = p.person_id
        left join death_dates d on d.person_id = p.person_id
    where
        (
            d.death_datetime is null
            or d.death_datetime >= (date '{ training_end_date }' + interval '{ gap }')
        )

The full cohort creation SQL query can be found here.

Note the following key fields in the resulting table:

Field	Meaning
`example_id`	A unique identifier for each example in the dataset. While in the case of end-of-life each patient will occur as a positive example at most once, this is not the case for all possible prediction tasks, and thus this field offers more flexibility than using the patient ID alone.
`y`	A column indicating the outcome of interest. Currently, PPL supports binary 0/1 outcomes
`person_id`	A column indicating the ID of the patient
`start_date` and `end_date`	Columns indicating the beginning and end of the time periods to be used for data collection for this patient. This will be used downstream for feature generation.

We are now ready to build a cohort. We use the CohortGenerator class to pass in a cohort name, a path to a SQL script, and relevant parameters in Python:

cohort_name = '__eol_cohort'
cohort_script_path = config.SQL_PATH_COHORTS + '/gen_EOL_cohort.sql'
params = {'schema_name'           : schema_name,
          'aux_data_schema'       : config.CDM_AUX_SCHEMA,
          'training_start_date'   : '2016-01-01',
          'training_end_date'     : '2017-01-01',
          'gap'                   : '3 months',
          'outcome_window'        : '6 months'
         }

cohort = CohortGenerator.Cohort(
    schema_name=schema_name,
    cohort_table_name=cohort_name,
    cohort_generation_script=cohort_script_path,
    cohort_generation_kwargs=params
)

Note that this does not run the SQL queries – the CohortGenerator object currently just stores how to set up the cohort in any system, allowing for more portability. Thus, our next step is to materialize the actual cohort to a table in a specific database db by calling cohort.build(db).

1.2 Feature Initialization

With a cohort now fully in place, we are ready to associate features with each patient in the cohort. These features will be used downstream to predict outcomes.

The OMOP Standardized Clinical Data tables offer several natural features for a patient, including histories of condition occurrence, procedures, etc. PPL includes SQL scripts to collect time-series of these common features automatically for any cohort, allowing a user to very quickly set up a feature set. To do so, we first initialize a FeatureGenerator object with a database indicating where feature data is to be found. Similar to the CohortGenerator, this does not actually create a feature set – that is only done once all parameters are specified. We next select the pre-defined features of choice, and finally select a cohort for which data is to be collected:

featureSet = FeatureGenerator.FeatureSet(db)
featureSet.add_default_features(
    ['drugs','conditions','procedures'],
    schema_name,
    cohort_name
)
featureSet.build(cohort, cache_file='eol_feature_matrix', from_cached=False)

Since collecting the data is often the most time-consuming part of the setup process, we cache intermediate results in the cache_name file (if from_cached is False) and can later use this data instead of executing the relevant queries (if from_cached is True).

Additional customized features can also be created by advanced users, by adding files to the feature SQL directory. The added queries should output rows in the same format as the existing SQL scripts.

2. Ingesting Feature Data

Once we have called build on a FeatureSet object, PPL will begin collecting all the relevant data from the OMOP database. To efficiently store and manipulate this information, we use sparse tensors in COO format, with indices accessed via bi-directional hash maps.

The tensor itself can be accessed by calling featureSet.get_sparr_rep(). This object has three axes corresponding to patients, timestamps, and OMOP concepts respectively. The axes can be manipulated as outlined in the table below:

Axis	Index Maps	Description	Example Utilization
Patient	`featureSet.id_map` and `featureSet.id_map_rev`	Each index corresponds to a patient in the cohort. The data for the patient with id `a` will be at index `featureSet.id_map_rev[a]` and likewise index `b` corresponds to the patient with id `fearureSet.id_map[b]`.	To get data for patients whose patient ID’s are in the list `filtered_ids`, we would find the relevant indices by running `filtered_indices = [featureSet.id_map_rev[id] for id in filtered_ids]`, then indexing into Patient axis of the sparse tensor with `filtered_tensor = featureSet.get_sparr_rep()[filtered_indices, :, :]`.
Time	`featureSet.time_map` and `featureSet.concept_map_rev`	Each index corresponds to a unique timestamp. At present, the data we use comes in at daily ticks, so each index corresponds to a day on which an OMOP code was assigned to a patient. The index corresponding to timestamp `t` is `featureSet.time_map_rev[t]` and the timestamp corresponding to index `i` is `featureSet.time_map[i]`.	To get data from April 2016 onwards only, we can filter the indices of the time axis by running `time_indices_filtered = [i for i in featureSet.time_map if featureSet.time_map[i] > pd.to_datetime('2017-4-1')]`, then index into the sparse tensor along the time axis : `filtered_tensor = featureSet.get_sparr_rep()[:, time_indices_filtered, :]`.
OMOP Concept	`featureSet.concept_map` and `featureSet.concept_map_rev`	Each index corresponds to a unique OMOP concept. The index corresponding to concept `c` is `featureSet.concept_map_rev[c]` and the timestamp corresponding to index `i` is `featureSet.concept_map[i]`.	To get data for all codes where an OMOP concept is matched, we would want to exclude codes that map to “no matching concept”. We can get the indices corresponding to the non-excluded codes with `feature_indices_filtered = [i for i in featureSet.concept_map if '- No matching concept' not in featureSet.concept_map[i]]`, then indexing in with `filtered_tensor = featureSet.get_sparr_rep()[:, :, feature_indices_filtered]`.

In our EOL example, we will filter on both time and concept axes. We filter on the concept axis exactly as above, removing OMOP’s catch-all “no matching concept” buckets since they don’t correspond to any real medical feature. We create features by collecting counts of how many times each OMOP code has been applied to a patient over the last T days for several values of T, and then concatenating these varibales together into a feature vector. Thus for each backwards-looking window T we must create a seperate time filter – this advanced filtering is already pre-coded into PPL and can be called as follows:

feature_matrix_counts, feature_names = data_utils.window_data(
    window_lengths = [30, 180, 365, 730],
    feature_matrix = feature_matrix_3d,
    all_feature_names = good_feature_names,
    cohort = cohort,
    featureSet = featureSet
)

This function takes in the raw sparse tensor of features, filters several times to collect data from the past d days for each d in window_lengths, then sums along the time axis to find the total count of the number of times each code was assigned to a patient over the last d days. These count matrices are then concatenated to each other to build a final feature set of windowed count features. Note that unlike a pure SQL implementation of this kind of feature, PPL can quickly rerun the analysis for a different set of windows – this ability to tune the parameters allows us to use a validation set to determine optimal values and thus significantly increase model performance.

This feature matrix can then be used with any sklearn modelling pipeline – see the example notebook for an example pipeline involving some pre-processing followed by a heavily regularized logistic regression.

Code Documentation

Code documentation can be accessed here

Files

config.py

This file contains global constants and the parameters needed to connect to a postgres database in which OMOP data is stored. The password field has been reset and must be entered to run the code

Utils

dbutils.py

dbutils.py provides tools for interacting with a postgres database into which a set of OMOP compliant tables have been loaded. The Database object can be instantiated using a standard postgres connection string, and can then be used (via ‘query’, ‘execute’ and ‘fast_query’) to run arbitrary SQL code and return results in Pandas dataframes.

PopulateAux.py

PopulateAux.py allows for the definition of custom tables that do not exist in the OMOP framework, but are required over multiple models by the user. These can be instantiated and kept in an auxiliary schema, and used persistently as needed.

Generators

This directory contains the implementation of classes to store and instantiate Cohorts of patients and sets of Features that can be used for prediction tasks.

CohortGenerator.py

Cohorts are defined by giving the schema in which the cohort table will be materialized, a unique cohort name, and a SQL script that uses OMOP standard tables (and/or user defined auxiliary tables) to generate the cohort itself.

An example script can be found in /sql/Cohorts. As in that script, cohort definitions should give at minimum a unique example ID, a person ID corresponding to the patient’s unique identifier in the rest of the OMOP database, and an outcome column (here denoted by ‘y’) indicating the outcome of interest for this particular patient.

FeatureGenerator.py

The FeatureGenerator file defines two objects: Features and FeatureSets. Features are defined by a SQL script and a set of keyword arguments that can be used to modify the SQL script just before it is run through Python’s ‘format’ functionality. Several SQL scripts are already pre-implemented and can be seen in /sql/Features. At present, PredictionLibrary supports time-series of binary features. Thus, feature SQL scripts should generate tables with at least the following columns:

A person ID to join with the cohort and identify which patient this feature is associated with
A feature name, which often will be generated by joining with OMOP’s concept table to get a human-readable description of a OMOP concept
A timestamp value

FeatureSet objects simply collect a list of Feature objects. When the ‘build’ function is called, the FeatureSet will run all SQL associated with each Feature and insert the resulting rows into a highly data-efficient three-dimensional sparse tensor representation, with the three axes of this tensor representing distinct patients, distinct features, and distinct timestamps respectively. The tensor can then be accessed directly and manipulated as needed for any chosen modelling approach.

PL2 Test Driver.ipynb

This notebook walks through all the present functionality of the library through the example of building a relatively simple yet performant end-of-life prediction model from OMOP data loaded from IBC. Use this file as a tutorial and as a way to see the correct way to call the functions in the library.