Data Sourcing | 11 | Data Warehousing for Biomedical Informatics

ABSTRACT

Sourcing data into the warehouse is the critical touchpoint between the vast array of systems and data structures that you want in your warehouse and the generic design-patterned structure of the warehouse you are implementing. The secret to getting to production in less than a year is getting this step right. You want to source your data so they look exactly the same, regardless of where they came from, so they can be processed by the single generic ETL workflows that you’ll see in the next chapter. Since all sourced data will look the same, work to source more data can be carried out in parallel to the development of subsequent ETL jobs. Sourcing one dataset is sufficient to enable further development. Breaking the dependency between sourcing and loading is the reason why the warehouse implementation can be accomplished so quickly. More data can be added at any time, and sequence dependency testing can be accomplished by using the reinitialization tools developed in the last chapter. It all comes down to removing any dependencies on the structure or semantics of the source system data.