omop-learn
Introductory Tutorial
omop-learn
allows OMOP-standard (CDM v5.3 and v6) medical data like claims and EHR information to be processed efficiently for predictive tasks. The library allows users to precisely define cohorts of interest, patient-level time series features, and target variables of interest. Relevant data is automatically extracted and surfaced in formats suitable for most machine learning algorithms, and the (often extreme) sparsity of patient-level data is fully taken into account to provide maximum performance.
The library provides several benefits for modeling, both in terms of ease of use and performance:
- All that needs to be specified are cohort and outcome definitions, which can often be done using simple SQL queries.
- Our fast data ingestion and transformation pipelines allow for easy and efficient tuning of algorithms. We have seen significant improvements in out-of-sample performance of predictors after hyperparameter tuning that would take days with simple SQL queries but minutes with
omop-learn
. - We modularize the data extraction and modeling processes, allowing users to use new models as they become available with very little modification to the code. Tools ranging from simple regression to deep neural net models can easily be substituted in a plug-and-play manner.
omop-learn
serves as a modern python alternative to the PatientLevelPrediction R library. We allow seamless integration of many Python-based machine learning and data science libraries by supporting generic sklearn
-style classifiers. Our new data storage paradigm also allows for more on-the-fly feature engineering as compared to previous libraries.
In this tutorial, we walk through the process of using omop-learn
for an end-of-life prediction task for synthetic Medicare patients with clear applications to improving palliative care. The code used can also be found in the example notebook, and can be run on your own data as you explore omop-learn
. The control flow diagram below also links to relevant sections of the library documentation.
1. Defining a Predictive Task
To formally specify our task, we require a set of rules to decide who is included in a group representing the population of interest, patient-level features for each of the members of this group, and an outcome or result per patient. Furthermore, each of these parameters must be specified with respect to a timeframe.
We define our end-of-life task as follows:
For each patient who is over the age of 70 at prediction time, and is enrolled in an insurance plan for which we have claims data available for 95% of the days of calendar year 2009, and is alive as of March 31, 2010: predict if the patient will die during the interval of time between April 1, 2010 and September 31, 2010 using the drugs prescribed, procedures performed, and conditions diagnosed during the year 2009.
omop-learn
splits the conversion of this natural language specification of a task to code into two natural steps. First, we define a cohort of patients, each of which has an outcome. Second, we generate features for each of these patients. These two steps are kept independent of each other, allowing different cohorts or feature sets to very quickly be tested and evaluated. We explain how cohorts and features are initialized through the example of the end-of-life problem.
1.1 Data Backend Initialization
omop-learn
supports a collection of data backend engines depending on where the source OMOP tables are stored: PostgreSQL, Google BigQuery, and Apache Spark. The PostgresBackend
, BigQueryBackend
, and SparkBackend
classes inherit from OMOPDatasetBackend
which defines the set of methods to interface with the data storage as well as the feature creation.
Configuration parameters used to initialize the backend are surfaced through python .env
files, for example bigquery.env
. For this example, the .env
file stores the name of the project in Google BigQuery, schemas to read data from and write the cohort to, as well as local directories to store feature data and trained models. The backend can then simply be created as:
load_dotenv("bigquery.env")
config = Config({
"project_name": os.getenv("PROJECT_NAME"),
"cdm_schema": os.getenv("CDM_SCHEMA"),
"aux_cdm_schema": os.getenv("AUX_CDM_SCHEMA"),
"prefix_schema": os.getenv("PREFIX_SCHEMA"),
"datasets_dir": os.getenv("OMOP_DATASETS_DIR"),
"models_dir": os.getenv("OMOP_MODELS_DIR")
})
# Set up database
backend = BigQueryBackend(config)
1.2 Cohort Initialization
OMOP’s PERSON
table is the starting point for cohort creation, and is filtered via SQL query. Note that these SQL queries can be written with variable parameters which can be adjusted for different analyses. These parameters are implemented as Python templates. In this example, we leave dates as parameters to show how cohort creation can be flexible.
We first want to establish when patients were enrolled in insurance plans which we have access to. We do so using OMOP’s OBSERVATION_PERIOD
table. Our SQL logic finds the number of days within our data collection period (all of 2009, in this case) that a patient was enrolled in a particular plan:
death_training_elig_counts as (
select
o.person_id,
o.observation_period_start_date as start,
o.observation_period_end_date as finish,
greatest(
date_diff(
least(o.observation_period_end_date, date '{training_end_date}'),
greatest(o.observation_period_start_date, date '{training_start_date}'),
day
), 0
) as num_days
from {cdm_schema}.observation_period o
inner join eligible_people p
on o.person_id = p.person_id
)
Note that the dates are left as template strings that can be filled later on. Next, we want to filter for patients who are enrolled for 95% of the days in our data collection period. Note that we must be careful to include patients who used multiple different insurance plans over the course of the year by aggregating the intermediate table death_training_elig_counts
which is specified above. Thus, we first aggregate and then collect the person_id
field for patients with sufficient coverage over the data collection period:
death_trainingwindow_elig_perc as (
select
person_id
from
death_training_elig_counts
group by
person_id
having
sum(num_days) >= 0.95 * extract(day from (date '{training_end_date}' - date '{training_start_date}'))
)
The next step is to find outcomes.
death_dates as (
select
p.person_id,
a.death_date
from
{cdm_schema}.person p
inner join
{cdm_schema}.death a
on
p.person_id = a.person_id
)
Then, we select for patients over the age of 70 at prediction time:
eligible_people as (
select p.person_id
from {cdm_schema}.person p
where extract(
year from date '{training_end_date}'
) - p.year_of_birth > 70
)
Finally, we can create the cohort:
select
row_number() over (order by te.person_id) - 1 as example_id,
te.person_id,
date '{training_start_date}' as start_date,
date '{training_end_date}' as end_date,
d.death_date as outcome_date,
cast(coalesce(
(d.death_date between
date '{training_end_date}'
+ interval {gap}
and
date '{training_end_date}'
+ interval {gap}
+ interval {outcome_window}
), false
) as int) as y
from
death_testwindow_elig_perc te
left join death_dates d on d.person_id = te.person_id
where
(
d.death_date is null
or d.death_date >= (date '{training_end_date}' + interval {gap})
)
The full cohort creation SQL query can be found here.
Note the following key fields in the resulting table:
Field | Meaning |
---|---|
example_id |
A unique identifier for each example in the dataset. While in the case of end-of-life each patient will occur as a positive example at most once, this is not the case for all possible prediction tasks, and thus this field offers more flexibility than using the patient ID alone. |
y |
A column indicating the outcome of interest. Currently, omop-learn supports binary outcomes. |
person_id |
A column indicating the ID of the patient. |
start_date and end_date |
Columns indicating the beginning and end of the time periods to be used for data collection for this patient. This will be used downstream for feature generation. |
We are now ready to build a cohort. We construct a Cohort
object by passing the path to a defining SQL script, the relevant data backend, and the set of cohort params.
cohort_params = {
"cohort_table_name": "eol_cohort",
"schema_name": config.prefix_schema,
"cdm_schema": config.cdm_schema,
"aux_data_schema": config.aux_cdm_schema,
"training_start_date": "2009-01-01",
"training_end_date": "2009-12-31",
"gap": "3 month",
"outcome_window": "6 month",
}
sql_dir = "examples/eol/bigquery_sql"
sql_file = open(f"{sql_dir}/gen_EOL_cohort.sql", 'r')
cohort = Cohort.from_sql_file(sql_file, backend, params=cohort_params)
1.3 Feature Initialization
With a cohort now fully in place, we are ready to associate features with each patient in the cohort. These features will be used downstream to predict outcomes.
The OMOP Standardized Clinical Data tables offer several natural features for a patient, including histories of condition occurrence, procedures, and drugs administered. omop-learn
includes SQL scripts to collect time-series of these common features automatically for any cohort, allowing a user to quickly set up a feature set. We supply paths to the feature SQLs as well as the name of each feature in constructing features using the Feature
object:
sql_dir = "examples/eol/bigquery_sql"
feature_paths = [f"{sql_dir}/drugs.sql"]
feature_names = ["drugs"]
features = [Feature(n, p) for n, p in zip(feature_names, feature_paths)]
ntmp_feature_paths = [f"{sql_dir}/age.sql", f"{sql_dir}/gender.sql"]
ntmp_feature_names = ["age", "gender"]
features.extend([Feature(n, p, temporal=False) for n, p in zip(ntmp_feature_names, ntmp_feature_paths)])
By default, the package assumes that added features are temporal in nature, i.e. that observations are collected at a time interval for a patient. omop-learn
also supports nontemporal features which are assumed to be static in nature for a given time period, such as age and gender. This is specified by setting the flag temporal=False
in the construction of the Feature
object.
Finally, we create an OMOPDataset
object to trigger creation of the features via the backend. Here we pass in initialization arguments which include the Config
object used to specify backend parameters, the backend itself (e.g. BigQueryBackend
), the previously created Cohort
object, and the list of features storing Feature
objects:
init_args = {
"config" : config,
"name" : "bigquery_eol_cohort",
"cohort" : cohort,
"features": features,
"backend": backend,
"is_visit_dataset": False,
"num_workers": 10
}
dataset = OMOPDataset(**init_args)
Note that feature extraction outputs the set of features to local disk in the directory specified by data_dir
into the initialization arguments (if left blank, this defaults to the directory supplied in the .env
file). This directory outputs features in a data.json
file, which defaults to storing a patient’s features in a single json line. Here the features are stored as a list of lists in the json key visits
, in which the outer list stores features for a given date, and the inner list stores the concepts that appeared on that day. The corresponding dates can be extracted from the patient line using the json key dates
. Note also that person_id
and static features such as age
and gender
are saved down into the json. The argument is_visit_dataset=True
configures an alternative feature representation in which a single line of data.json
represents a visit, rather than a patient.
Feature extraction is written with python’s multiprocessing
library for enhanced performance; the num_workers
argument can be used to configure the number of parallel processes. Additional customized features can also be created by adding additional files to the feature SQL directory. The added queries should output rows in the same format as the existing SQL scripts.
2. Ingesting Feature Data
Once features are created using the OMOPDataset
object, omop-learn
uses sparse tensors in COO format to aggregate features for use in models, with indices accessed via bi-directional hash maps. These are interfaced through the OMOPDatasetSparse
class.
For temporal features, this tensor can be accessed through the variable OMOPDatasetSparse.feature_tensor
. This object has three axes corresponding to patients, timestamps, and OMOP concepts respectively.
In our EOL example, we will filter on both time and concept axes. We filter on the concept axis exactly as above, removing OMOP’s catch-all “no matching concept” buckets since they don’t correspond to any real medical feature. We create features by collecting counts of how many times each OMOP code has been applied to a patient over the last d
days for several values of d
, and then concatenating these variables together into a feature vector. Thus for each backwards-looking window d
we must create a seperate time filter. This filtering is executed using the OMOPDatasetWindowed
class by calling to_windowed()
.
# Re-load a pre-built dataset
dataset = OMOPDataset.from_prebuilt(config.datasets_dir)
# Window the omop dataset and split it
window_days = [30, 180, 365, 730, 1500, 5000, 10000]
windowed_dataset = dataset.to_windowed(window_days)
windowed_dataset.split()
The to_windowed()
function takes in the raw sparse tensor of features, filters several times to collect data from the past d
days for each d
in window_lengths
, then sums along the time axis to find the total count of the number of times each code was assigned to a patient over the last d
days. These count matrices are then concatenated to build a final feature set of windowed count features. Note that unlike a pure SQL implementation of this kind of feature, omop-learn
can quickly rerun the analysis for a different set of windows; this ability to tune parameters allows use of a validation set to determine optimal values and thus significantly increase model performance. Note that we can also easily split the windowed data into train, validation, and test sets by calling the method split()
on the windowed dataset in evaluating model performance.
Files
We review the subdirectories in the source package for omop-learn
.
backends
The set of backends interfaces with the data storage and the compute engine to run feature extraction. We support PostgreSQL, Google BigQuery, and Apache Spark. The set of defining methods are inherited from OMOPDatasetBackend
. Note that backend feature creation leverages python’s multiprocessing
library to extract features, parallelized by OMOP person_id
.
data
Data methods include the Cohort
, Feature
, and ConceptTokenizer
classes. Cohort and features can be initialized using the previously reviewed code snippets.
The ConceptTokenizer
class offers a compact representation for storing the set of relevant OMOP concepts by providing a mapping from concept index to name. This class also includes a set of special tokens, including beginning of sequence, end of sequence, separator, pad, and unknown, for use with language modeling applications.
hf
Utilities for interfacing with Hugging Face libraries are provided. This includes a mapping from the OMOPDataset
object to dataset objects ingestible by Hugging Face models.
models
The files transformer.py
and visit_transformer.py
provide modeling methods used to create the SARD architecture [Kodialam et al. 2021]. The methods in transformer.py
define transformer blocks and multi-head attention in the standard way. The methods in visit_transformer.py
define a transformer-based architecture over visits consisting of OMOP concepts.
sparse
The classes in sparse
allow for end-to-end modeling over the created feature representation using sparse tensors in COO format. data.py
defines the previously reviewed OMOPDatasetSparse
and OMOPDatasetWindowed
classes which aggregate features over multiple time windows. models.py
defines a wrapper over the sklearn
LogisticRegression
object, which integrates tightly with the OMOPDatasetWindowed
class to define an end-to-end modeling pipeline.
torch
The classes in data.py
define a wrapper around the OMOPDataset
object for use with pytorch tensors. Similar to the classes in hf
, this allows for quick modeling with torch
code. models.py
gives some example models that can ingest OMOPDatasetTorch
objects, including an alternate implementation for the VisitTransformer
.
utils
A variety of utils
are provided which support both data ingestion and modeling. config.py
defines a simple configuration object for use in constructing the backend, while methods in date_utils.py
are used for conversion between unix timestamps and datetime objects.
embedding_utils.py
defines a gensim word embedding model used in the end-of-life example notebook.