1. What is i2b2? What does it stand for?

i2b2 (Informatics for Integrating Biology and the Bedside) was an NIH-funded National Center for Biomedical Computing based at Partners HealthCare System. The i2b2 NCBC developed a scalable informatics framework that is designed to bridge clinical research data and the vast data banks arising from basic science research in order to better understand the genetic bases of complex diseases. The architecture consists of two major pieces. The first is the back-end infrastructure (the "Hive") that takes care of things like security, access rights and managing the underlying data repository. The second piece is an application suite of query and mining tools that allows users to ask questions about the data (the workbench). Visit http://www.i2b2.org for more information.

2. Where was it first developed?

The system was first developed within the Partner's HealthCare system in Boston at Massachusetts General Hospital (MGH). It served as the architecture for their Research Patient Data Registry (RPDR). Investigators at Harvard Medical School (Isaac Kohane, PI), in conjunction with investigators at MGH, applied to the NIH for funding to open-source the code and release it to the general research community. The grant was approved and the first version of the code was released in November 2007.

3. Is anyone else using it?

There is an active i2b2 Academic User's Group comprised of several dozen institutions.  More information can be found here.

4. What is a research data warehouse?

The term means different things to different people, but in our view, a research data warehouse is a repository that integrates information on patients from multiple sources. These can include electronic health records, lab results, genetic and research data, as well as other public sources such as birth registries and government data such as Medicaid. This information is aggregated, cleaned and de-identified. Once this process is complete, it will be presented to the user, who will then be able to query the data.

5. Doesn't Cincinnati Children's already have at least one data warehouse?

Yes, but the systems have very different functions and target user groups. There is an Enterprise Data Warehouse that is designed to provide decision support for queries of a financial nature. The hospital is piloting the Epic data warehouse to answer questions related to clinical operations.  The research data warehouse is focused on tasks like cohort identification and hypothesis generation.

6. How does all this relate to Epic? And what is Clarity?

Epic is the electronic health record (EHR) system at Cincinnati Children's. It is designed for patient care and is therefore focused on providing the information and functionality necessary to support the patient visit. It has an associated database, but it is transactional in nature, and architected to support the operation of the EHR.
Clarity is Epic's main reporting environment. It receives nightly updates from the production Epic system (Chronicles) and serves as an operational data store. Reports generated from this system provide information on patient outcomes, clinical effectiveness, etc.

7. Why is i2b2 being extended to serve as a platform for research registries?

There are several reasons.  i2b2 was designed for cohort identification.  Many of the queries asked of a registry are essentially forms of cohort identification: how many patients are on medication X, how many are adhering to evidence-based guideline Y.  In addition, since all of the hospital's electronic medical record data is already being loaded into i2b2, building registries on top of i2b2 removes the need to either: a) load the data into multiple database systems or b) have users manually re-enter the relevant EMR data.  By building research registries using i2b2, users can add data that is not collected in the EMR and they are free to collaborate with outside institutions.  There are a number of extensions that are needed to add the required functionality.  They are described below.  


1. What types of information are in the warehouse?

The primary source of data for the research data warehouse is the Epic EHR.  We have loaded much of the discrete information that is stored in the EHR including demographics (age, gender, race, etc.), diagnoses (ICD-9), allergies, procedures, medication orders, lab results and vitals.  We are in the process of adding additional data including history (surgical, medical, social, family) and other condition-specific variables (stored in flowsheets).  Once this data load is complete, we will work on adding billing information to the warehouse. 
We have also worked with the CCHMC Biobank to include the existence of research samples.  Investigators can run a query to determine if there are patients that match their inclusion/exclusion criteria and also have a sample.  If there is a large enough population, they can submit a request to the Biobank and work with them to obtain the samples by submitting a request to the Tissue Use Committee.  We have also added the existence of genome-wide association studies from the gwadb.

2. What is the i2b2 data model?

The i2b2 framework employs a simple, yet powerful data model. It consists of facts and dimensions. A fact is the piece of information being queried, and the dimensions are groups of hierarchies and descriptors that describe the facts. The i2b2 database utilizes a star schema that consists of one fact table surrounded by numerous dimension tables (see figure below). Facts in i2b2 are observations about a patient, including things like diagnoses, demographics, laboratory results, etc.
Figure 1: i2b2 star schema

Figure 2: Information stored about a laboratory result "fact."

One of the benefits to using this kind of data model is that one can easily add and integrate data from multiple sources without having to redesign the system or change the underlying architecture. All new observations are simply added to the fact table. In order to accurately describe and navigate through these facts, the use of metadata becomes crucial. Metadata is employed by i2b2 to allow the user to create a hierarchical categorization of the different concepts within the database. Query terms are selected from these categorizations. Example categories include diagnoses, procedures, demographics and laboratory tests. In instances where standardized terminologies are used to capture the data, such as ICD-9 for diagnoses and CPT for procedures, those hierarchies can be directly ported into the metadata table.
[Information used in this answer was taken from a presentation at the 2008 i2b2 Academic Users Group Workshop - slides posted on the i2b2 website: https://www.i2b2.org/events/aug08.html.]

3. How does the warehouse team merge and clean the data?

The procedures we use to load and clean the data, particularly from Epic/Clarity are available upon request.  Most of these procedures are specific to CCHMC's Epic implementation, and should only be viewed as a guide. 
Data from different sources are merged based on patient MRN. For overlapping fields like race, date of birth and sex, when there are discrepancies, we look at the number of times a certain value appears and the system where the value originated before making a decision (some systems are weighted more than others). Once all of the data are cleaned, we replace the patient MRN with a random value. This anonymized version of the database is the one that can be queried by the end user.

4. How often will the warehouse be updated?

The warehouse is not intended to be real-time. It will always lag the other clinical and research repositories. The process to clean and reconcile all the different data sources is rather lengthy, so we will update the warehouse every 2-4 weeks (this time may increase to a weekly refresh in the near future).  When this occurs, all of the existing data will be refreshed and replaced, though changes are saved in an audit log.


1. What can I do with the native version of i2b2?

The warehouse is best suited for tasks like cohort identification, hypothesis generation and retrospective data analysis. Automated software tools will facilitate some of these functions, while others will require more of a manual process. The initial software tools will be focused around cohort identification.
We have developed a set of web-based tools that allow the user to query the warehouse after logging in with their unique username and password. Using the workbench (see picture below), a user can drag-and-drop search terms (1) into a Venn diagram-like interface (2). Once executed (3), the query will return the aggregate number of patients meeting the specified criteria (4). We also present a data grid that breaks down the patient count by age, race and gender (5). If there are more than 5 patients in any age, race and gender grouping, that value is presented to the user. If the value is less than 5 (but greater than zero), we denote it with a '*'. In addition, if the total patient count is less than ten, the grid is not displayed and the value returned by the query is replaced with "<10." The user is able to view their previous queries (6), and can regenerate the results in grid format by dragging the previous query to the appropriate location (7).

We will soon be deploying functionality that will allow i2b2 to serve as a "de-identified registry," allowing investigators to see a longitudinal view of the patient's chart, minus any identifying information.  Access to this version of de-identified chart review will be granted for a specific length of time to individuals who submit a research plan, which will be periodically audited and reviewed by the i2b2 team and the IRB.  

2. What kind of functionality is being added to i2b2?

i2b2 is being extended to serve as a platform to research registries.  It will allow for the creation of EMR-based registries, where data is collected in the EMR and then fed into i2b2 for further analysis.  We are adding capabilities for data collection, to capture information that is not collected in clinic.  We have also developed mechanism for reporting and visualization.

3. What can't I do?

The warehouse is not really suited for tasks like clinical trials, sample tracking, study administration or providing real-time alerts. Nor is it designed to serve as a clinical repository. While it is true that the warehouse is essentially a massive database, BMI has more appropriate solutions for investigators interested in study and data management. See our Research IT Services web site, or email the BMI Help Desk for more information.

4. What is the process for getting IRB approval to view a fully identified extract?

Contact us.  We work with investigators on a case-by-case basis.

5. What if I have suggestions about additional functionality?

We'd love to hear them.

Integration with Research

1. If I include my research data, will everyone be able to access it?

The only people who will be able to see your data are those to whom you grant authorization. If the information can be provided to the general research community, we will add it to the warehouse. If it cannot, we will mark it so that only you (or others in your group with proper approval) can access it.

2. What benefits do I gain from adding my research data?

At a minimum, when reconciling data from different sources, we will be able to notify the source holders of discrepancies that may arise in overlapping fields (conflicting sex, data of birth, etc).
Also, when adding research data to information that has already been collected clinically, we can provide investigators with a more comprehensive view of their patient's condition than might otherwise be possible. Along the same lines, adding specialized research data to the general repository can improve the quality of information for the entire research community.

3. How can I add my research database to the warehouse?

We are working to include sources as quickly as we can. Our initial focus will be on loading the largest sources and those that provide the biggest bang for the buck. We will be contacting researchers to gauge their interest over the coming weeks and months.

4. What do I need to do to ask questions about my data?

In order to ask questions about data in the warehouse, several issues must be addressed:

  • Collection: Is the data needed to answer the question being collected?
  • Modeling: Does an appropriate model of the data exist? Creating a representation for data types like numbers or text from a pick-list is relatively straightforward. For more complex sources like free-text clinical notes or images, one must decide on an appropriate transformation.
  • Integration: How should we handle discrepancies within your data? If a field within your database overlaps with a field in Epic or DocSite, how do we relate the two? How do we handle data conflicts? Is there a source we should treat as the "gold standard?"
  • Analysis: Given the sources available, what types of questions would you like to ask?

Ontologies and Terminologies

1. How are the i2b2 query ontologies constructed?

The query ontologies used in the workbench are largely based on the data available. Our diagnoses data are coded in ICD-9 (with IMO extensions), so that is the hierarchy we use in the workbench. The same holds true for procedures. With our medication and laboratory data, it is not coded in terminology with a standard hierarchy, so we had to organize the data using other methods.

2. Which terminologies are currently used?

The following table provides a list of the current terminologies/standards used in our warehouse:

Data Type Terminology
Diagnoses ICD-9 and IMO (ICD-10 coming in 2014)
Procedures CPT
Medications Medications are classified based on the Medispan hierarchy used in the CCHMC Epic build.  (RxNorm coming soon)
Laboratory Results Laboratory tests are identified by a mixture of LOINC codes and internal Cerner numbers. The tests are listed under the same hierarchy used in Epic.

3. What if I'm interested in a term that's not included?

If the term is a synonym for an individual code or if it is based on a set of codes (for instance, a combination of CPT and ICD-9), it can be added to the hierarchy. If the term is based on data that is being collected but is not yet included in the warehouse, it can also be added once the source data is loaded into the warehouse. If the underlying data is not being collected, however, there is no reason to include the term in the query ontology.