Skip to content. | Skip to navigation

i2b2

Research Data Warehouse

Sections
Personal tools
You are here: Home FAQs

FAQs

Frequently asked questions about the research data warehouse.

Background

1. What is i2b2? What does it stand for?

i2b2 (Informatics for Integrating Biology and the Bedside) is an NIH-funded National Center for Biomedical Computing based at Partners HealthCare System. The i2b2 NCBC is developing a scalable informatics framework that will bridge clinical research data and the vast data banks arising from basic science research in order to better understand the genetic bases of complex diseases. The architecture consists of two major pieces. The first is the back-end infrastructure (the "Hive") that takes care of things like security, access rights and managing the underlying data repository. The second piece is an application suite of query and mining tools that allows users to ask questions about the data (the workbench). Visit http://www.i2b2.org for more information.

2. Where was it first developed?

The system was first developed within the Partner's HealthCare system in Boston at Massachusetts General Hospital (MGH). It served as the architecture for their Research Patient Data Registry (RPDR). Investigators at Harvard Medical School (Isaac Kohane, PI), in conjunction with investigators at MGH, applied to the NIH for funding to open-source the code and release it to the general research community. The grant was approved and the first version of the code was released in November 2007. The Harvard-Partner's group is using i2b2 as part of their CTSA, so it will be implemented at Brigham and Women's, Beth Israel and Children's Hospital Boston, in addition to MGH.

3. Is anyone else using it?

Outside the Harvard-Partner's group, there are several institutions that are in various stages of implementation. An incomplete list includes Denver Children's, UC-SF, Morehouse, UMass, UC-Davis, and OHSU.

4. What is a research data warehouse?

The term means different things to different people, but in our view, a research data warehouse is a repository that integrates information on patients from multiple sources. These include electronic health records, lab results, genetic and research data, as well as other public sources such as birth registries and government data such as Medicaid. This information will be aggregated, cleaned and de-identified. Once this process is complete, it will be presented to the user, who will then be able to query the data.

5. Doesn't Cincinnati Children's already have a data warehouse?

Yes, but both systems have very different functions and target user groups. The existing Data Warehouse is designed to provide decision support for queries of a financial nature. The research data warehouse is focused on tasks like cohort identification and hypothesis generation.

6. How does all this relate to Epic? And what is Clarity?

Epic is the electronic medical record (EMR) system at Cincinnati Children's. It is designed for patient care and is therefore focused on providing the information and functionality necessary to support the patient visit. It has an associated database, but it is transactional in nature, architected to support the operation of the EMR.

Clarity is Epic's main reporting environment. It receives nightly updates from the production Epic system (Chronicles) and servers as an operational data store. Reports generated from this system provide information on patient outcomes, clinical effectiveness, etc.

Data

1. What types of information are in the warehouse?

During our initial roll-out, the warehouse will be loaded with an "historical extract" of our legacy systems consisting of patient information from all the visits to Cincinnati Children's between 2003 and 2007. This information includes demographics (age, gender, race), diagnoses (ICD-9), procedures, medications and lab results.

We eventually plan to include extracts from Epic, DocSite, and the new Cerner laboratory system. Once the "main" systems are in place, we will begin to load public data sources, data from the different divisions or research cores (such as images or genetic data), as well as the research databases from individual groups or investigators. A partial list is provided in the table below.

We can load almost anything into the warehouse, but in order for it to be useful, we must know how to query or describe it.

Data Source Contents
INVISION (KIDS/ICIS) Patient demographics, diagnoses, procedures and medication orders for all inpatient and ambulatory encounters.
Cerner Classic (to be replaced by Cerner Millennium) Laboratory Results
Clarity (Epic) Patient demographics, diagnoses, procedures, medication orders, laboratory results and other patient data for ambulatory and (eventually) inpatient encounters.
DocSite Demographics, diagnoses, procedures, medication orders, laboratory results and other clinical/research data for patients contained within each registry.
Cerner Millennium Pathology Pathology Reports
CardioIMS Cardiology Reports
GE Radiology Radiology Reports
Discharge Summaries Discharge Summaries

The data sources listed in this table will be used to populate the initial load of the i2b2 repository. As time progresses, other sources will be identified and added to warehouse. Examples include genetic or microarray data, reports from neonatology, public sources, data from other institutions (UC, Good Samaritan, HealthBridge), etc.

2. What is the i2b2 data model?

The i2b2 framework employs a simple, yet powerful data model. It consists of facts and dimensions. A fact is the piece of information being queried, and the dimensions are groups of hierarchies and descriptors that describe the facts. The i2b2 database utilizes a star schema that consists of one fact table surrounded by numerous dimension tables (see figure below). Facts in i2b2 are observations about a patient, including things like diagnoses, demographics, laboratory results, etc.

Figure 1: i2b2 star schema

Figure 2: Information stored about a laboratory result "fact."

One of the benefits to using this kind of data model is that one can easily add and integrate data from multiple sources without having to redesign the system or change the underlying architecture. All new observations are simply added to the fact table. In order to accurately describe and navigate through these facts, the use of metadata becomes crucial. Metadata is employed by i2b2 to allow the user to create a hierarchical categorization of the different concepts within the database. Query terms are selected from these categorizations. Example categories include diagnoses, procedures, demographics and laboratory tests. In instances where standardized terminologies are used to capture the data, such as ICD-9 for diagnoses and CPT for procedures, those hierarchies can be directly ported into the metadata table.

[Information used in this answer was taken from a presentation at the 2008 i2b2 Academic Users Group Workshop - slides posted on the i2b2 website: https://www.i2b2.org/events/aug08.html.]

3. How does the warehouse team merge and clean the data?

We plan to place all of our procedures online shortly after we put the warehouse into production. The short version is that data are merged based on patient MRN. For overlapping fields like race, date of birth and sex, when there are discrepancies, we look at the number of times a certain value appears and the system where the value originated before making a decision (some systems are weighted more than others). Once all of the data are cleaned, we replace the patient MRN with a random value. This anonymized version of the database is the one that can be queried by the end user.

4. How often will the warehouse be updated?

The warehouse is not intended to be real-time. It will always lag the other clinical and research repositories. We do, however, plan to update the warehouse periodically. The process to clean and reconcile all the different data sources can be rather lengthy, so we envision updating the warehouse every 2-4 weeks. When this occurs, all of the existing data will be refreshed and replaced. In time, the update frequency may increase.

Functionality

1. What can I do with i2b2?

The warehouse is best suited for tasks like cohort identification, hypothesis generation and retrospective data analysis. Automated software tools will facilitate some of these functions, while others will require more of a manual process. The initial software tools will be focused around cohort identification.

We have developed a set of web-based tools that allow the user to query the warehouse after logging in with their unique username and password. Using the workbench (see picture below), a user can drag-and-drop search terms (1) into a Venn diagram-like interface (2). Once executed (3), the query will return the aggregate number of patients meeting the specified criteria (4). We also present a data grid that breaks down the patient count by age, race and gender (5). If there are more than 5 patients in any age, race and gender grouping, that value is presented to the user. If the value is less than 5 (but greater than zero), we denote it with a '*'. In addition, if the total patient count is less than ten, the grid is not displayed and the value returned by the query is replaced with "<10." The user is able to view their previous queries (6), and can regenerate the results in grid format by dragging the previous query to the appropriate location (7).

Assuming there are no IRB issues, we are also planning to allow the user to generate a limited, de-identified export of the patient cohort. We are still working on implementing this functionality, but if the patient cohort is sufficiently large (greater than 10) but not overwhelmingly so (less than a few thousand), we would allow the user to download an extract containing data on demographics, diagnoses, procedures, medications and laboratory tests. For larger extracts, or for extracts requiring other information, the user would have to submit a request to the warehouse team. Before receiving any extract, either through the workbench or directly from the warehouse team, it is likely that all users would be required to sign a Limited Data Use Agreement form.

2. What can't I do?

The warehouse is not really suited for tasks like electronic data collection, sample tracking, study administration or providing real-time alerts. Nor is it designed to serve as a clinical repository. While it is true that the warehouse is essentially a massive database, BMI has more appropriate solutions for investigators interested in study and data management. See our Research IT Services web site, or email the BMI Help Desk for more information.

3. What is the process for getting IRB approval to view a fully identified extract?

That process is still undetermined, but we are working on it. Until then, we can work with investigators on a case-by-case basis.

4. What if I have suggestions about additional functionality?

We'd love to hear them.

Integration with Research

1. Will i2b2 replace my research database?

No. While we at BMI are perfectly happy to host your research database, we do not intend for the research data warehouse to serve as a replacement. When we update the database, we will perform a complete refresh, wiping away all of the present data. We will reload everything from the different data sources, including any data that was added or changed since the last refresh. Therefore, we would prefer that the different research databases be kept separate, in order to simplify the data reconciliation process and prevent overlapping fields from being accidentally overwritten.

2. If I include my research data, will everyone be able to access it?

The only people who will be able to see your data are those to whom you grant authorization. If the information can be provided to the general research community, we will add it to the warehouse. If it cannot, we will mark it so that only you (or others in your group with proper approval) can access it.

3. What benefits do I gain from adding my research data?

At a minimum, when reconciling data from different sources, we will be able to notify the source holders of discrepancies that may arise in overlapping fields (conflicting sex, data of birth, etc).

Also, when adding research data to information that has already been collected clinically, we can provide investigators with a more comprehensive view of their patient's condition than might otherwise be possible. Along the same lines, adding specialized research data to the general repository can improve the quality of information for the entire research community.

4. How can I add my research database to the warehouse?

We are working to include sources as quickly as we can. Our initial focus will be on loading the largest sources and those that provide the biggest bang for the buck. We will be contacting researchers to gauge their interest over the coming weeks and months.

5. What do I need to do to ask questions about my data?

In order to ask questions about data in the warehouse, several issues must be addressed:

  • Collection: Is the data needed to answer the question being collected?
  • Modeling: Does an appropriate model of the data exist? Creating a representation for data types like numbers or text from a pick-list is relatively straightforward. For more complex sources like free-text clinical notes or images, one must decide on an appropriate transformation.
  • Integration: How should we handle discrepancies within your data? If a field within your database overlaps with a field in Epic or DocSite, how do we relate the two? How do we handle data conflicts? Is there a source we should treat as the "gold standard?"
  • Analysis: Given the sources available, what types of questions would you like to ask?

Other

1. What if I'm having trouble getting my research data out of Epic?

We anticipate this being a short-term problem that will gradually be resolved after the main Epic go-live. That being said, no one should be forced to go weeks or months without access to their data. Please contact the warehouse team and we will work with you to provide the necessary extracts.

2. Why do I have to enter my DocSite data into Epic (or vice versa)?

While technically not related to the research data warehouse project, we hear enough complaints about this issue to address it here. Getting Epic and DocSite to talk to one another is not a trivial process. Biomedical Informatics and Cincinnati Children's Information Services are committed to finding a solution and hope to have something in place soon.

3. Will the warehouse be able to change the data in Epic?

This is highly unlikely. There are strict legal restrictions on the modification of a patient's health care record. Epic serves as the patient's health record. Any data that is put into Epic and "approved" is considered correct. If information is later added or a note is made to ignore previous information (hypothetically), we can load the most current information, but we do not plan to send any information from the warehouse to Epic.

4. What is the timeline for the project? When can I start using it?

We have started initial user testing and plan a slow roll-out now that we have received approval from the IRB.  The purpose of our initial testing is to gauge the reaction of the research community to the existing query interface/ontology and begin work to add more pediatric query terms.

5. What if my question isn't listed? Who should I contact for further information?

Please contact the project lead, Keith Marsolo, PhD.

User Group

1. What is the i2b2 working group?

An INFORMAL network of researchers and investigators who are being asked to collect a set of clinical research queries designed around their particular specialty. The members of this group are intended to serve as the points of contact for their division, aggregating and filtering the queries collected into a reasonable, representative set. These sets will be crucial to the warehouse team as it works to make i2b2 more useful to our investigators.

2. How can I contribute?

First, by working with colleagues to formulate a set of clinical research queries. Second, by identifying a set of query "terms" that could be used for tasks like cohort identification. All information should be directed to Keith Marsolo.

3. Why are you asking for these queries?

We are modifying the number and identity of terms listed in the query interface of the i2b2 workbench. We aim to make the interface "pediatric-friendly" and to include the diseases and inclusion/exclusion criteria that are of interest to Cincinnati Children's investigators (in collaboration with colleagues from Denver Children's). We also plan to eliminate terms that are not relevant to pediatrics or for which we have no data.

At the same time, we are asking for more powerful or more specific clinical research queries so we can create a standard set of "views" that would allow investigators to submit direct queries to the warehouse. This view would simply transform the data in the warehouse from a format suitable for cohort identification into one that is geared more toward data analysis.

Finally, if the are enough commonalities between the queries, we intend to develop a set of software tools that will allow a user to analyze or query the database instead of having to perform the task themselves.

Note: YOU CAN ONLY QUERY ON INFORMATION THAT EXISTS IN THE WAREHOUSE. IF IT'S NOT COLLECTED, OR IS COLLECTED IN AN UNQUERYABLE FORMAT, FOR ALL PRACTICAL PURPOSES, THE INFORMATION DOES NOT EXIST.

Ontologies and Terminologies

1. How are the i2b2 query ontologies constructed?

The query ontologies used in the workbench are largely based on the data available. Our diagnoses data are coded in ICD-9, so that is the hierarchy we use in the workbench. The same holds true for procedures. With our medication and laboratory data, it is not coded in terminology with a standard hierarchy, so we had to organize the data using other methods.

We are working with colleagues in Health Information Management (HIM) to create a set of terms based on the billing information found in the Hospital's coding and abstraction guidelines. In addition, when we receive data that employs other hierarchies, for instance SNOMED or IMO, we will add those terms to our query tool.

As we integrate individual research databases, or registries such as DocSite, we plan to add a series of self-contained branches to the ontology that will include all the terms that might be used by the investigator to query that set of patients. In the case of DocSite, that would include all of the variables unique to the registry as well as a subset of the diagnoses, procedures, laboratory hierarchies for which the DocSite patients have data.

2. Which terminologies are currently used?

The following table provides a list of the current terminologies/standards used in our warehouse:

Data Type Terminology
Diagnoses ICD-9
Procedures ICD-9 & CPT
Medications Medications identified by NDC value listed on each order and organized by the hierarchy employed by Epic.
Laboratory Results Laboratory tests are identified by a mixture of LOINC codes and internal Cerner numbers. The tests are listed under the same hierarchy used in ICIS.

3. What if I'm interested in a term that's not included?

If the term is a synonym for an individual code or if it is based on a set of codes (for instance, a combination of CPT and ICD-9), it can be added to the hierarchy. If the term is based on data that is being collected but is not yet included in the warehouse, it can also be added once the source data is loaded into the warehouse. If the underlying data is not being collected, however, there is no reason to include the term in the query ontology.

4. Why don't you include terminology X?

If the terminology cannot be mapped to data within the warehouse, then there is no need to include it as part of the query ontology, since a search using that terminology would return an empty set of patients.

If the terminology can be mapped to the data (i.e. SNOMED to ICD-9), it might make sense to include it. In some cases, the mapping is not one-to-one, so there could be some confusion about which terminology is "correct." Rather than include every terminology in the search ontology, it is more likely that we will utilize search tools that will allow a user to search a number of terminologies for a particular term or phrase, using tools based on applications like the UMLS, Apelon or LexGrid. We plan to let the user conduct a search, and then based on the results, highlight any terms that matched or are similar to those that existing within the warehouse. Initially, this functionality will be housed within our Ontology Browser (see below).

5. What is the Ontology Browser?

The Ontology Browser is a stand-alone tool developed here at Cincinnati Children's that allows the user to view basic statistics about the query terms included in the warehouse. We are already working on a new version that will allow the user to gain more insight into the data. For instance, for a given laboratory test, the user could see a histogram breakdown for each of the reference ranges associated with the test, and from there, drill down to get additional information on the demographics (age, race, gender) of the underlying data. We are working to incorporate functionality that will allow the user to search multiple ontologies through the Apelon Distributed Terminology Server, and then use the results to determine which data, if any, are included in the warehouse.

Document Actions