Melissa and Doug Customer Success Story

Melissa and Doug, Wilton, Connecticut, www.melissaanddoug.com, is a privately held American manufacturer of children’s toys with about 1,000 employees worldwide. Melissa and Doug has experienced steady growth since the 1980’s and was operating the business with a homegrown ERP and Warehouse Management system with ad-hoc reporting capabilities using Tableau. 2020 brought changing business requirements and MnD experienced a difficult time accommodating them due to an inflexible Data Management architecture.

Myers-Holum, Inc (MHI) provided the strategic recommendation to implement an improved Data Management platform using Google Cloud with Looker for their data modernization efforts. We worked with the business to document future state KPI requirements, facilitated a data governance program, and defined their future state BigQuery EDW architecture using Cloud Composer, Cloud Storage, BigQuery SQL.

New requirements came along at the end of the implementation project to report across newly created channel hierarchies, sales hierarchies, and region hierarchies. The newly built Data Management solution allowed for a quick implementation of the new requirements to satisfy business needs to respond to the changing business landscape of 2020.

“The newly built data management solution  allowed for a quick implementation of new requirements that satisfied the business need to respond to the fast-changing landscape in 2020”  Chris Conway, CTO, Melissa and Doug

Aristocrat/VGT Customer Success Story

Aristocrat/VGT, Franklin, Tennessee, www.aristocrat.com, is a leading developer, manufacturer, and distributor of Class II casino games in North America. Aristocrat handles an exorbitant amount of data and had an existing 5-year-old Hadoop Big Data platform that was unable to scale to their 10s of TBs of data volumes. They experienced functional requirement changes, had concerns around stability, high availability, and uptime, and maintained only a single environment with no separation for their dev, test, and prod environments. Furthermore, there were serious capacity concerns that required a separate Vertica-based BI layer which meant increased costs, maintenance, and more data pipelines to manage.

Myers-Holum, Inc (MHI) engaged with Aristocrat to provide expertise on their migration to Google BigQuery. We leveraged a Data Transfer Appliance to move their 10s of TB of data off Hadoop and to Google BigQuery and rebuilt the existing data processing pipelines using Google Cloud Composer and BigQuery SQL capabilities. MHI also completed the data modeling and best practices for Data Warehousing on BigQuery. Finally, we enabled analytics using Tableau.

The results of this initiative were more precise and faster data insights which left Aristocrat positioned and enabled for future data growth.

“We successfully addressed serious capacity concerns that previously required a separate Hadoop and Vertica instances with a single unified GCP Data Management platform which reduced costs, reduced complexity, and increased capacity to onboard new data sources” – Rick Watts, Sr. Architect

Myers-Holum, Inc. Achieves the Data Analytics Partner Specialization in the Google Cloud Partner Program

Google Cloud Recognizes Myers-Holum, Inc.’s Technical Proficiency and Proven Success In Data Analytics

New York, April 17, 2019 — Myers-Holum, Inc. today announced that it has achieved the Data Analytics Partner Specialization in the Google Cloud Partner Program. By earning the Partner Specialization, Myers-Holum, Inc. has proven their expertise and success in building customer solutions in the data analytics field using Google Cloud Platform technology.

 

Specializations in the Google Cloud Partner Program are designed to provide Google Cloud customers with qualified partners that have demonstrated technical proficiency and proven success in specialized solution and service areas.

 

Partners with this specialization have proven success from ingestion to data preparation, storage, and analysis.

 

“We are thrilled at the culmination of our experience and partnership with Google Cloud to come together and offer us this opportunity to be a specialized partner and we look forward to growing the Google Cloud Platform footprint through Google BigQuery, DataFlow, Dataproc, DataFusion, DataPrep, and other Google Analytics products utilization,” said Darius Kemeklis, EVP of Cloud Practice, Myers Holum.

 

Myers-Holum, Inc. (MHI) is a privately held enterprise systems and data integration consulting firm founded in 1981 in New York, New York. Having consulted for more than 800 companies ranging from the Fortune 500 to the lower Mid-Market, our staff represents a diverse and ambitious group of consulting and development professionals with in-depth industry, systems, and data management expertise. Today, MHI has regional presence in core markets across North America.

 

Boll and Branch Builds Enterprise Data Warehouse on Google BigQuery

Boll and Branch, New Jersey, USA, https://www.bollandbranch.com, is a premium bedding and sheet e-commerce retailer with a fast-growing online business had the need for a better data-driven business decision-making approach.  Existing BigQuery based data mart had limited scope and capabilities. The need for a comprehensive Enterprise Data Warehousing solution has been identified.

Myers-Holum Inc. led the Enterprise Data Warehousing implementation for Boll and Branch using Google Cloud Platform serverless technologies.  Data was landed daily directly into BigQuery from multiple SaaS operational systems such as Netsuite, Shopify, Zendesk, Iterable using stitch.com serverless platform, as well as directly from Segment audience tracking system. Monthly data files from 3rd party providers were being landed onto Google Cloud Storage. Cloud Scheduler, Cloud Functions, and BigQuery SQL were used to serverlessly process landed data into staging, consolidated, prepared layers. Data Studio dashboards were implemented to show analytical reports.  Dimensional data modeling techniques were applied to build the consolidated layer with multiple fact, type 1 and type 2 slowly changing dimension tables.

With the new Enterprise Data Warehouse in place Boll and Branch was able to get additional and deeper insights into business performance and plan for continued growth.

PepkorIT Migrates Oracle Enterprise Data Warehouse to Google BigQuery

Steinhoff International Holdings, Cape Town, South Africa, www.steinhoffinternational.com, is a global retailer with stores across multiple regions. Steinhoff’s IT division PepkorIT was responsible for maintaining an existing on-prem multi-tenant Oracle-based Enterprise Data Warehouse (EDW) with custom SQL script ETL pipelines, daily store transaction activity batch loads, and various monthly consumer information data feeds, and analytical BI dashboards.

Legacy Oracle EDW was running out of capacity and needed to improve time to insights from days to hours and minutes.

The decision was made to migrate EDW to Google Cloud Platform and needed planning, design, implementation assistance. Myers-Holum Inc. (MHI) led the project to define the future Google Cloud-Based solution architecture and implement a data ingestion framework using Data Flow that reused the same pipeline for both batch ingestion from Google Cloud Storage, and real-time ingestion from OLTP database binary logs streaming through Google Cloud PubSub into BigQuery.

The MHI solution centered around self-healing Data Flow pipelines that allowed for schema changes over time with minimum operational intervention and automatic data reprocessing. The solution included sensitive data masking, balance and control system tables, full data lineage for data landed into GCP, data quality rules implementation, BigQuery schema design based on Myers-Holum industry best practices, downstream data processing for BI and Analytics use cases, and job monitoring using Stackdriver and Datastudio dashboards.

Google Cloud BigQuery-based EDW allowed Steinhoff to reduce time to insights from days to minutes. Google Cloud serverless technologies such as Data Flow provided scalable infrastructure to ingest batch and real-time data quickly and reliably while reducing CapEx and Opex costs.

Article 3: Designing a Data Warehouse for Google Cloud Platform

This is the third in a series of articles, collaboratively written by data and solution architects at Myers-Holum, Inc., PRA Health Sciences, and Google, describing an architectural framework for conversions of data warehouses from Teradata to the Google Cloud Platform.  The series will explore common architectural patterns for Teradata data warehouses and outline best-practice guidelines for porting these patterns to the GCP toolset.

David Chu, Knute Holum and Darius Kemeklis, Myers-Holum, Inc.

Michael Trolier, Ph.D., PRA Health Sciences

August 2018

Table of Contents

⇒ Capture versus land

⇒ What does a data warehouse look like on Google Cloud Platform?

⇒ Dust-off Your Semantic Model

⇒ Creating the Physical Schema

⇒ Example using our Travel Reservation Model

⇒ Semantic Model Example using our Travel Reservation Star Schema

⇒ Physical Model Example for Reservation_Payment_Fact

⇒ Physical Model Example for Room_Reservation_Fact

⇒ Additional Considerations

⇒ Wrap-up, what’s next

Recap:  In the first article of this series, Article 1: Understanding Your Current Data Warehouse, we discussed source data capture as the first architectural layer of most data warehouses.  We categorized the three types of source capture processes (pull, push or stream) and the typical Teradata implementation strategies for landing the data into the data warehouse for each type of process.

In the second article of this series, Article 2: Source Data Capture, we recommended the creation of a unified landing model, as an architectural component we called the “Landing Zone”.  All source data capture activities target the Landing Zone with associated control processes managing the accessibility and lifespan of those slices of captured data.   We discussed the detailed requirements you should consider in your implementation of the Landing Zone to facilitate your decisions on whether existing Teradata source data capture processes can truly be converted.

Conversion versus Re-implementation:  As you read this series of articles, we are going to differentiate between approaches for converting your data warehouse from Teradata to the Google Cloud Platform versus re-implementing your data warehouse in Google Cloud Platform.  Arbitrarily, we are going to assume certain differences between the two approaches.   The conversion project will be assumed as justified based on an expense reduction/capital spend avoidance ROI model, and therefore the cost of the conversion must be constrained to fit within this cost reduction model.   We would also expect the implementation timeline to be oriented toward a single deliverable, and timed such that the next Teradata infrastructure upgrade can be avoided.

On the other hand, the reimplementation project will be assumed as justified based on a business benefit model, with implementation (capital) costs factored into the internal rate of return for the project.  We would also expect the implementation timeline to be phased into multiple deliverables based on the order of business initiatives you want to undertake.

What does a data warehouse look like on Google Cloud Platform?

To understand the scope of your Teradata conversion effort, you need to understand both the source and target requirements.  Only with this knowledge, can you understand the data integration processes in between that will make up the bulk of your conversion effort.

Cloud BigQuery is Google’s recommended technology for implementing your data warehouse.   BigQuery is a fully managed, petabyte-scale, low-cost enterprise data warehouse for business intelligence.  It is probably one of the principal reasons you are considering a data warehouse conversion.   

Migrating your Teradata data warehouse means that you will be instantiating your semantic logical data model into a new physical data model optimized for BigQuery.  As you recall in Article 1, we discussed the importance understanding how the Teradata semantic layer has been implemented and the extent to which it is actually being used.    

The migration from Teradata will require a bottom up approach, you will be implementing a series of BigQuery tables representing the semantic requirements of a given subject area.  Subject areas will be integrated using common dimensional data (such as customer, product, supplier, organization or employee).

The bottom up approach will allow for an iterative conversion process.  Subject areas can be implemented using an incremental approach, where you weigh subject areas that provide highest conversion benefit.  For example, at our client “Company X”, the customer marketing subject area consumes 60 percent of the existing Teradata storage capacity.  In addition, 14 of the top 25 most frequently used queries access this data.  Therefore, migrating this subject area off Teradata not only provides BigQuery benefits to the users of customer marketing, but also to the remaining Teradata users utilizing different subject areas which now benefit from the new Teradata capacity now available.   

Dust-off Your Semantic Model

You will need a semantic logical data model that represents the data presentation requirements for a given subject area before you can begin any BigQuery physical design.  We recommend using a star schema (following Kimball standards) as the basis for your semantic model.  We also recommend the use of a data modeling tool (ER/Studio, CA-ERWin, InfoSphere, etc.) to maintain it.   The metadata created and maintained in the data modeling tool will become an important component of your overall data warehouse metadata strategy, which we will cover in a future article.     

If you do not have an existing semantic logical data model, creating one in a conversion scenario is not that difficult.  Remember you have an existing Teradata implementation you can analyze to create it.   Here are some approaches for doing that:

  • Use your Teradata database views:
    • If you have used views to standardize end user access to your data warehouse, and those views are more than simple wrappers of your data warehouse tables, then these views may be an effective approach for creating a semantic logical data model.
  • Use the semantic model from your BI tool:
    • Many BI products like Business Objects, MicroStrategy, and Cognos incorporate semantic layers.
    • If you have predominantly standardized end user access via the BI tool, this may be an effective source for creating a semantic logical data model.
  • Analyze your current reports and dashboards:
    • Within a representative sample of reports:
      • What is the underlying granularity of the report (fact level)
      • To what level is the report aggregated (by region, by fiscal period, by product)
      • What other dimensional data is reported
    • We create an Excel spreadsheet to contain this analysis with the rows reflecting the underlying granularity and the columns reflecting the aggregate levels and other dimensions.

Creating the Star Schema

As you create the star schema representing your semantic model, it is important to focus on the following objectives:

  • Identifying fact tables and declaring their associated grain:
    • Fact tables tend to mirror business processes and the associated measures for that business process.  It is important that you use this information to declare the grain of the fact table independently of the levels of granularity contained within the source data.  For example, our sample reservation source model contains levels of granularity reflecting the reservation (order level), reservation item (order item level), booking level (inventory level) and booking detail (pricing level).   However, our reservation star schema contains fact tables representing the business processes at the grain we want to analyze: reservation, reservation item, and reservation payment.  What you name your fact tables should clearly indicate the grain.  
    • Fact tables always have a date (or date/time) dimension, typically representing the transaction or event date.  And that date dimension is typically qualified each time the fact table is accessed by an end user.  In other words, we are looking in advance to identify how we intend to the date partition each fact table.
    • Fact tables are the only tables in a star schema with a composite primary key.  That primary key is always composed of the date dimension and enough other referenced dimensions to guarantee uniqueness.     
  • Identifying the dimension types (type 1, 2 or 3):
    • As you identify the dimensions that apply to each fact table, you need to understand the strategy to handle any change in the associated dimensional values.  Thus the dimension should be clearly declared as to its type; where type 1 means overwrite the value, type 2 means add a dimensional row, and type 3 means add a dimensional column.   
    • In effect, the type of dimension consistently declares the point in time relationship between the dimensional values and the associated fact measures.   Having a consistent definition of this point in time relationship will be immensely helpful as when we create the physical schema, and further physically de-normalize it for performance.
    • For type 2 and type 3 dimensions, don’t automatically assume an effective end date (or current indicator) is needed.   It has no relationship to the associated fact table and typically is not mandatory as a dimension attribute.   In the next article, we will identify the significant data integration benefit of not having to maintain it.  
    • Remember it is OK to have the same dimension as both type 1 and type 2.  This gives you both the historical and as the current point in time in relation to the fact.  You would simply define the same dimension as different entities in your logical model.          
  • Identifying aggregate needs:
    • Make sure you have the dimensions available to support the common aggregations you captured during your analysis of your current reports and dashboards.   If you have implemented aggregate awareness via either Teradata join indexes or within your BI tool, this will be another method of identifying these dimensions you summarize by.
    • Since BigQuery is column oriented store and can very efficiently aggregate column values, we typically do not model pre-computed aggregates (aka summary tables).       

Creating the Physical Schema

The physical schema represents how your end users and associated end user oriented tool sets will see the data warehouse in BigQuery.  Implementing this schema requires the same compromise between ease of use (with potentially inefficient access) versus efficient data access (with potential difficulty to use) that we have all made many times before.  Our physical design approach seeks to optimize ease of use for as many business use cases as possible but also recognizing that physical schema optimization for query performance and cost may be needed in certain situations.

BigQuery’s ability to support a huge data set is a clear advantage of the Google product and most of our clients have at least future plans to utilize this capability.  However, it is important to realize that in a conversion scenario, typical data set sizes coming from Teradata will allow a degree of sub-optimal design while still meeting expected end user SLA’s and cost constraints.  

BigQuery physical schema optimization revolves around two key capabilities:

  • Implementing date based partitioning
  • Using nested and repeating fields to flatten multiple tables into one table to co-locate the data and eliminate required joins

Our design approach seeks to apply a consistent set of physical modeling rules to the star schema logical model, then perform testing using actual data, and then flatten tables further if justified.   Here are the physical modeling rules we use:

  • Start with the fact tables and always keep the declared grain intact.  While we may move dimensional data into the physical fact table to flatten it, we do not move in any data that is at a lower level of granularity than the declared grain.  
  • Fact tables should be date partitioned based on the date dimension you identified in the logical model as most suitable.  We are expecting almost every end user query to specify _PARTITIONTIME to qualify the range of fact table rows needed.  
  • Type 2 dimension tables that are at the same grain as the fact table should be moved into the physical table to flatten it.  For example, in the logical model you have a sales order fact and a sales order dimension because you wanted to separate the additive versus the descriptive attributes.
  • Since type 1 dimensions are often conforming dimensions or contain a limited number of values, we make every effort to preserve them as separate tables.  Know the compressed data sizes of your type 1 dimensions and whether they fit within the 100MB limit that allows for the best performing BigQuery joins.
  • Other type 2 dimensions are candidates to be moved into the physical table to flatten it, but we wait until performance concerns dictate it.
  • We create any summary tables that are needed for exact numeric aggregations.

Beyond physical modeling rules, it is important to consider the data integration requirements that will correspond to the physical schema being created.   A single source to single target is the ideal (and easiest to implement) data integration model.   You will find the more you create a big flat table, the more data sources you will need at data integration time to load it.  And the complexities of ensuring all those source tables are in sync will often require sophisticated data staging approaches that will need to be coded and that will impact your end latency.

⇒ Example using our Travel Reservation Model

Let’s work through the logical and physical modeling process using a subset of our travel industry reservation model.   Here is the entity relationship diagram for the model:

A travel reservation is a variant of a standard sales order.  In the above model, colors are used to reflect the various levels of granularity within this model.  

  • Grey – Reflects the itinerary or sales order level.  Your travel agency reservation confirmation number would be at this level.  
  • Yellow – A reservation can be made up of multiple items.  For example, you may have booked a flight, hotel and car rental under the same travel agency reservation.   Like a sales order item, we carry the common attributes for each reservation item at this level of granularity.
  • Red – Reservation items are bookings with different travel providers.  You booked your flight with United, your hotel with Marriot and your car rental with Avis.  Here we carry the attributes specific to the type of travel provider.     
  • Green – Each booking (air, hotel or car) has fulfillment details associated with it.  In addition, there are attributes reflecting pricing details like rate, taxes, and fees.  

⇒ Semantic Model Example using our Travel Reservation Star Schema

As we design our semantic layer, we identify the levels of granularity we want to present to our business users and the associated measures we want to incorporate at that grain.  We represent these requirements using fact tables at the associated grain.  In our example, we are declaring fact tables at the reservation, reservation payment, and reservation item levels.   We also identify the supporting dimensions, their dimensional type and whether they are conformed dimensions in this semantic model.      

Let’s look at a subset of the star schema model we designed:

In the above model, colors are used to reflect the type of entity.   Let’s declare the grain of the two fact tables (shown in light blue):  

  • Reservation_Payment_Fact – Payments are specific to a reservation as a whole, so the overall grain is reservation level.   The primary key is the reservation, payment date and payment time.  
  • Room_Reservation_Fact – A room reservation is a specific booking type within a reservation item, so the overall grain is reservation item level.  The primary key is reservation and reservation item.

⇒ Physical Model Example for Reservation_Payment_Fact

Now let’s go through our physical modeling rules we discussed in “Creating the Physical Schema” above.

  • What is the fact table grain? - The Reservation_Payment_Fact table is at reservation level.  
  • How will the fact table be date partitioned? - It will be partitioned based on payment date (Payment_Dim_ID).  
  • Are there type 2 dimension tables that are at the same grain as the fact table? – Yes, Reservation_Dim should be moved into the fact table.
  • Are there large type 1 dimensions that might not efficiently join to the fact table? – No, Date_Dim and Time_Dim are small dimensions of static values.  While the fact that they contain static values would make them easy to move into the fact table, they are also conforming dimensions making our preference to leave them as separate tables until performance issues dictate a different approach.  

We end up with the following physical BigQuery table:

⇒ Physical Model Example for Room_Reservation_Fact

Using our physical modeling rules we discussed in “Creating the Physical Schema” above.

  • What is the fact table grain? - The Room_Reservation_Fact table is at the reservation item level.  
  • How will the fact table be date partitioned? - It will be partitioned based on reservation date (Reservation_Date_Dim_ID).  
  • Are there type 2 dimension tables that are at the same grain as the fact table? – Yes, Reservation_Item_Dim should be moved into the fact table.     
  • Are there large type 1 dimensions that might not efficiently join to the fact table? – No, Business_Unit_Dim and Channel_Dim are very small dimensions of static values.  Resort_Dim is a larger dimension, but it is a conforming dimension that we would leave as a separate table until performance issues dictate a different approach.   The two bridge tables Reservation_Status_Bridge and Reservation_Class_Bridge, each reference a single row in their associated dimensions.  We need these to stay as separate tables so that any change in the status of classification becomes immediately effective (type 1 point in time).
  • Are there large type 2 dimensions that might not efficiently join to the fact table? – Yes, Reservation_Dim is at the reservation level and therefore can be moved into the fact table without any requirement for nested/repeating attributes.  Product_Dim is another conforming dimension that we would leave as a separate table until performance issues dictate a change.  Again, due to the star schema design, no requirement for nested/repeating attributes is needed.

We end up with the following physical table:

⇒ Additional Considerations

We try to keep conforming dimensions as separate tables as much as possible.   They are enterprise dimensions standardized across many subject areas.  They are typically enhanced (more attributes) over time as more and more subject areas use them.  As they are enhanced, every subject area using them should benefit, which will happen automatically if they remain as separate tables.

With BigQuery’s support for nested and repeating structures, we could have physically modeled this as a single nested table at the reservation grain, with nested structures for reservation item, room booking, booking detail and booking amount.  Essentially, you are just implementing your operational model as a single nested table and you will end up with nesting within nesting due to the relational nature of the original data model.   For this type of data, we find this nested/repeating structure more difficult to use for the business end user, and less compatible with many legacy BI tools implemented in your enterprise.  While we certainly implement BigQuery’s support for nested and repeating structures for other types of data structures (for example unstructured or XML based), in a Teradata conversion scenario, the dominant source data structures will be relational.   Therefore, we like the use the star schema fact table based approach.

⇒ Wrap-up, what’s next

In this third article of the series, we explored the details of how to design your data warehouse for Google Cloud BigQuery.  Starting with a semantic logical model, we discussed logical data modeling techniques using a star schema.  We then discussed how to apply a consistent set of physical modeling rules to the star schema in order to create a well performing physical schema for BigQuery.   Finally, we used a specific subject area example (travel reservations) to further explain the techniques we use.   

In the next (fourth) article of the series, we will start to discuss the data integration approaches that will be necessary to load the actual data warehouse.  We will focus introduce the staging area and the typical data integration use cases that require it.


Article 2: Source Data Capture

This is the second in a series of articles, collaboratively written by data and solution architects at Myers-Holum, Inc., PRA Health Sciences, and Google, describing an architectural framework for conversions of data warehouses from Teradata to the Google Cloud Platform.  The series will explore common architectural patterns for Teradata data warehouses and outline best-practice guidelines for porting these patterns to the GCP toolset.

David Chu, Knute Holum and Darius Kemeklis, Myers-Holum, Inc.

Michael Trolier, Ph.D., PRA Health Sciences

July 2018

Table of Contents

Capture versus land

Capture is a formal contract with the source system

Nuances of landing data

Checklist for evaluating your source data capture processes

ETL is still the most robust and platform portable tool for implementing your source data capture processes

Implementing the Landing Zone on Google Cloud Platform

Sample scenarios for Landing Zone implementations

⇒ Wrap-up, what’s next

Recap:  In the first article of this series, Article 1: Understanding Your Current Data Warehouse, we discussed source data capture as the first architectural layer of most data warehouses.  We categorized the three types of source capture processes (pull, push or stream) and the typical Teradata implementation strategies for landing the data into the data warehouse for each type of process.

We recommended conversion rather than re-implementation of your current source data capture processes as a very important aid to facilitate side-by-side validation of your converted data warehouse.   We also recommended the creation of a unified landing model, as an architectural component we called the “Landing Zone”.  All source data capture activities target the Landing Zone with associated control processes managing the accessibility and lifespan of those slices of captured data.

In this article, we will discuss detailed requirements you should consider in your implementation of the Landing Zone.  An understanding of these requirements will facilitate your decisions on whether existing Teradata source data capture processes can truly be converted.

Conversion versus Re-implementation:  As you read this article, we are going to differentiate between approaches for converting your data warehouse from Teradata to the Google Cloud Platform versus re-implementing your data warehouse in Google Cloud Platform.  Arbitrarily, we are going to assume certain differences between the two approaches.   The conversion project will be assumed as justified based on an expense reduction/capital spend avoidance ROI model, and therefore the cost of the conversion must be constrained to fit within this cost reduction model.   We would also expect the implementation timeline to be oriented toward a single deliverable, and timed such that the next Teradata infrastructure upgrade can be avoided.  

On the other hand, the re-implementation project will be assumed as justified based on a business benefit model, with implementation (capital) costs factored into the internal rate of return for the project.  We would also expect the implementation timeline to be phased into multiple deliverables based on the order of business initiatives you want to undertake.    

Capture versus land

Your existing Teradata implementation first captures data from various sources of record, then lands the data either in some form of staging (flat files, relational tables or XML formats) or loads it directly to your data warehouse.   Let’s discuss the capture and land processes separately.    

Capture is a formal contract with the source system

This is important to remember.  We doubt your original data warehouse implementers had rights to access any source data of their choosing without regard to operational system impact, data security constraints, or IT infrastructure considerations.  They worked out a “contract” with the stakeholders of that source data repository, within which the source capture process operates today.   

So in the analysis to determine the extent you can convert your existing source capture processes, it is important to separate the analysis of how well the contract meets the business need, from how well the actual source capture process is implemented.  Over time you may have created a variety of methods in which you implemented the actual source capture (BTEQ to TPT to ETL).  Clearly, you can define scope within your conversion process to standardize the implementation method without impacting the underlying contract it operates within.   Saying this in another way, you may consider the modernization of your source capture processes while striving to deliver the same dataset as before.     

It is when you decide the contract no longer meets the business need that you cross from conversion to re-implementation.  If the majority of source data contracts no longer meet the business need, then you should be able to quantify those business requirements to justify the additional cost of a re-implementation, especially considering that source data capture changes will make automating side by side testing against the existing Teradata data warehouse much more difficult.

Nuances of landing data

All relational databases, Teradata included, enforce data integrity through the use of pre-defined schemas and data definitions that include type, length, and nullability.  Therefore, there is always a data quality component to landing captured source data into a relational database.   Specific functionality may be added to your source data capture processes to handle source data that does not meet these minimal requirements.  Below are some examples of how you may be handling this:

  • ETL (Informatica PowerCenter) – Generates a flat file store (BAD files) of any source records rejected by the target.  
  • Teradata Utilities – Populate error tables if any source records are rejected because of source format, data conversion, or constraint violations.

So in the analysis to determine the extent you can convert your existing source capture processes, it is important to determine which of those processes are depending on BAD files or error tables, either to stop the capture process entirely, or provide a mechanism to analyze and potentially correct and recycle the bad source data.   Knowing which source capture processes rely on the data integrity provided by the relational database is needed to best understand which Google Cloud Platform product can supply similar capabilities.

Checklist for evaluating your source data capture processes

You can use the following checklist to help evaluate the extent that conversion rather than re-implementation of your current source data capture processes will be possible:

  • Are the majority of your source data contracts meeting business needs?  If not, are the new business needs quantifiable and able to justify a re-implementation approach?
  • How have the various source data capture processes been implemented?   For example, via ETL or BTEQ scripts or via an external scripting tool (shell, perl)
    • Do you have a few standardized implementation patterns or are they all source specific implementations?
    • How portable are each of the implementations assuming Teradata was removed?
      • ETL would be very portable, especially if the ETL tool (PowerCenter, Talend, etc.) directly integrates with Google Cloud Platform.
      • External scripting that implements all the control flow and merely calls out to a Teradata utility to execute specific SQL commands would be fairly portable, assuming you implement the equivalent utility in the specific Google Cloud Platform product.  
      • Teradata specific scripting especially when the control flow is embedded within that script or when Teradata specific features are used (INMOD) would not be very portable.
      • Note that there are several metadata analysis tools that will greatly help with this assessment (Manta Tools).
  • Which of the various source data process capture processes have downstream data quality requirements?
    • Requirement to meet data type, format, and nullability standards.
    • Requirement to standardize data (“CA”, “CALIF”, “California” all become “CA” )
    • Requirement to cleanse data (“102 Main” becomes “102 E. Main St.”)
    • Requirement for a less formal schema (hyper-changing source formats)
  • Which of the various source data process capture have integrated balance and control processes?
    • Source capture process also captures control totals (counts, sum checks) about the capture itself.
    • These control totals are then used by independent processes to validate completeness of all data movement.
  • To what extent have your source data process captured lineage metadata?
    • How pervasive is the use of this metadata within your organization?

ETL is still the most robust and platform portable tool for implementing your source data capture processes

The more you have used ETL to implement your existing source data capture processes for loading your Teradata data warehouse, the easier your conversion will be.   If you have Teradata specific source data load processes you need to modernize, you should consider re-implementing them in ETL.

  • ETL can efficiently access both your on premise source systems as well as cloud sources (Salesforce, Google Analytics).
  • ETL can source data using a variety of access methods (SQL query, database log, message queue, or application API) in both batch and real-time.
  • ETL provides source to target lineage metadata.
  • ETL vendors now have both on premise as well as cloud versions of their toolsets.

It is important to note the above comment applies to the source data capture processes only in a conversion type project.  In a future article, we will discuss ETL in comparison to other approaches like Google Cloud Dataproc or Cloud Dataflow.

Implementing the Landing Zone on Google Cloud Platform

Logical Design

First, let’s show the logical design for the Landing Zone.  This is the portion that reflects the final target (persisted data) of the source data capture process.  This is the component that will be implemented in Google Cloud Platform.

Within the above diagram:

  • Landing Zone – Represents instantiated data located in one or more Google Cloud products.  
  • Snapshots – Represents a specific set of captured source data (finite window).  
  • Run Control – An entity of the Balance and Control schema that records the start of a run to capture data from one or more source systems.
  • Process History – An entity of the Balance and Control schema that records the status of the attempt to capture source data.
  • Balance and Control - An entity of the Balance and Control schema that records the control totals (counts, sum checks) applicable to a specific capture of source data, plus the state of its independent validation.

Implementation Design

Now let’s discuss the key rules for implementing the Landing Zone as the final target in your source data capture processes:

  • What the snapshot represents:
    • For additive only based sources, the snapshot reflects a specific timestamp or aggregate within a time window.
    • For delta sources, the snapshot reflects all changes (CRUD) for a specific timeframe.  Note that the merge of changed data coming from sources with the old state of data does not occur during the landing process.
    • For full replace sources, the snapshot reflects a full set of source data as of the specific extract timestamp.  Like with delta sources, comparison of the latest source data with the previous version does not occur during the landing process.
  • Use minimal transformations:
    • The Landing Zone should represent raw source data as much as possible, as it is the recovery point for when transformation errors occur.
    • Only transformations needed to support data type, format and nullability constraints are needed.
    • Data quality standardization and cleansing is not performed here, it is performed after the original source data is captured (in a later architectural layer).
  • Only an insert path is implemented, data in the Landing Zone is never updated or deleted except in special circumstances (like a data repair scenario)
  • The Landing Zone is the final target, but you can still utilize local intermediate targets.  For example, you may have IT security constraints that dictate how PII data has to be specifically encrypted and copied to Cloud Storage.  

Other Implementation Considerations

  • As you plan for the implementation of your Landing Zone, you should consider if you have the following requirements:
    • Partitioning the Landing Zone – Snapshots of source data contained in the Landing Zone typically reflect an interval of time, with most subsequent access to Landing Zone data qualified to a specific range of time.  Therefore, it is very common to implement a date based range partitioning scheme (like a Teradata PPI).
    • Archiving and removing snapshots of data – Along with partitioning, consider your requirements for archiving and purging data in the Landing Zone.  Typically, you want to define an SLA for how long snapshots are retained.  The primary reason for retaining data in the Landing Zone, once it is processed into the data warehouse, is to ensure the raw data is still available if any downstream processing errors occurred.  As end users actually use the data warehouse, the risk of discovering undetected transformation errors diminishes over time.  It is a common practice to retain the original snapshots for around 90 days, after which it is archived or data is purged.  Consider Google Cloud Platform technologies that support partition level operations (like Teradata DROP RANGE) to avoid the cost of purging data row by row.    
    • Join Support – Google Cloud BigQuery is the Google recommended technology for implementing your data warehouse.   Depending on how you design your data model, you may find multiple source tables need to be joined together and loaded in BigQuery as a single data stream (multi-source/single target).  For example, in your logical data model you have ORDER_HEADER, ORDER_ITEM and ORDER_DETAIL entities, but you de-normalize it in BigQuery to a single ORDER entity using nested fields.  In this case, you likely need to join multiple source tables to load the ORDER entity.  In a future article we will discuss BigQuery de-normalization trade-offs, but be aware of the Landing Zone impact for now.      
    • Data Lake Support – As a repository for raw source data, the Landing Zone can be a component of your Data Lake architecture.   As you look at which specific Google Cloud products you want to use to implement the Landing Zone, consider that products ability support your data mining needs.   

Physical Design

Now that you captured the source data and know how to move it to Google Cloud (ETL is still the most robust and platform portable tool), the next question is where exactly to actually store this data.  The Landing Zone can be instantiated in one or more Google Cloud products.  Typical requirements that help identify which Google Cloud product to use are:

  • Is the data structured or unstructured?
  • Is there a need to support hyper-changing source data formats / schema?
  • Is there a need for strong data typing, data formatting, and nullability constraints?
  • Is there support for SQL joins or does the API support efficient child record lookups?
  • Can the Google Cloud product easily archive or remove captured source data after a period of time?
  • Is the Google Cloud product compatible with your desired ETL/ELT toolsets?
  • Is the Google Cloud product compatible with your desired data mining tools?
  • Is the Google Cloud product serverless no-ops or does it require you to manage nodes?

Google provides a high-level overview of the Google Cloud storage options on this page.  Let’s review in more detail various requirements and how they map to Google Cloud Storage and Database technologies as they apply to the Landing Zone.  

Product Storage Type Unstructured and Structured Data Hyper-changing Source Formats / Schema Strong Typing, Formatting, Nullability SQL Joins Archival or Purging Supported as a Data Mining Source (Data Lake) Infrastructure Maintenance Effort
Cloud Storage Just Files

Structured /

Unstructured

Yes

simple files

No

Not Natively

(1)

Yes

Yes

from BQ,

not as fast as BQ

None
CloudSQL

Relational store

hosted MySQL and Postgres

Structured Manual alter table Yes Yes Manual Yes Medium
Spanner

Relational store

with interleaved child tables

Structured Manual alter table Yes Yes Manual Limited Small(2)
BigQuery Columnar store with repeated nested structs

Structured /

Unstructured

Yes

autodetect schema

Yes Yes

Yes

time- partitioned tables

Yes None
BigTable NoSQL wide table (many columns)

Semistructured

Name/Value Pairs

Yes

add more columns

No

Not Natively

(1)

Yes

Yes

from BQ,

not as fast as BQ

Small(2)
Datastore Object store with name/value pair attributes Structured

Yes

add more attributes

No No Manual Limited None
  1. “Not Natively” indicates that it can support SQL joins through federated BigQuery capabilities
  2. “Small” indicates the need to scale up virtual nodes when capacity is exceeded.

Structured and unstructured data

Unstructured data formats that might be loaded into the Landing Zone are JSON, XML, log/text files, or binary formats (images, videos).  Cloud Storage is just a file store, so any type of data can be stored and it likely is the best option for storing truly unstructured data.  Cloud Spanner also has capabilities to load binary data into columns.  Cloud BigQuery has capabilities to load json files into a column to be later processed by using either regex or XPath expressions.  It is important to realize you want the unstructured data to be landed (as raw data), not parsed or interpreted into a different format.       

Hyper-changing source data formats

Sometimes source data systems change their data format very frequently.  As an example, consider a web application log file where new data elements are frequently added (or removed) as application functionality is changed.  In this situation, it is a maintenance burden to have to make a schema change every time the underlying source data format changes.  Cloud Storage is just a file store; it is agnostic to the file format being stored. Cloud BigQuery has the capability to adjust table format on the fly based on the new fields appearing in the data extracts.  Cloud BigTable can accommodate a virtually unlimited number of columns, does not require predefined schema before writing, and can accommodate changing data formats easily.   However, Cloud Spanner and Cloud SQL require implicit DDL execution on existing tables to accommodate new data formats.  

Strong data typing, data formatting, and nullability constraints

Cloud Spanner and Cloud SQL have pre-defined schemas with strong data types.  Cloud BigTable uses key/value pairs with no support for data typing.  Cloud BigQuery supports either explicit data typing where table schema is predefined up front, or implicit data typing where it relies on the source format for supplying the data type.  

Join support (multi-source/single)

Cloud Spanner, Cloud SQL and Cloud BigQuery support all the typical ANSI SQL join types (inner, outer).  Cloud BigTable does not support joins, you would have to combine multiple sources using the API or using ETL look-ups.   Cloud Storage is just a file store so no joins are supported unless data retrieval is federated through Cloud BigQuery.  Note that there are certain performance implications on performing join operations on massive tables that can be improved through schema design.

Archival or remove captured source data after a period of time

Cloud Storage has lifecycle configuration capabilities to move data from Nearline to Coldline storage to reduce storage fees, or files can be deleted altogether by a custom process.  Cloud BigQuery table partitions, if not accessed after 90 days, will result in reduced storage fees, or can be dropped altogether by a custom process.  Data archiving and purging can be easily scripted using command line tools to either delete Cloud Storage folders, or Cloud BigQuery table partitions from a specific date range.  Cloud BigTable tables and column families can have TTL (time to live) semantics configured to store values for a specific amount of time only after which those values are automatically purged opportunistically by the background process.  Cloud Spanner and Cloud SQL have row delete syntax that can be executed for a subset of rows to be deleted.  Remember to favor the technologies that support archive or remove at the most effective Google Cloud pricing model.

Compatibility with desired ETL/ELT toolsets

You need to consider which Google Cloud Storage technologies are supported by the toolsets you plan to utilize to implement your data load processes, initially as a target and subsequently as a source to transform and load your data into your Cloud BigQuery data warehouse.   

In the Teradata conversion scenario, we have already recommended ETL as the most robust approach for implementing source data capture.  But as we move on to loading your Cloud BigQuery data warehouse, the good news is that there is a wide variety of both Google and third-party toolsets, and both ETL and ELT style processing can be used.  Make sure your desired ETL and/or ELT toolset, such as Informatica, Talend, Snaplogic, Cloud DataFlow, supports your Landing Zone as a source and Cloud BigQuery as a target.  In a future article we will discuss ETL/ELT in comparison to other approaches like Cloud Dataproc or Cloud Dataflow.    

Compatibility with desired data mining tools

If you plan on using the Landing Zone as a component of your data lake, consider the ability of the underlying Google Cloud storage product to support your data mining toolsets.  

Google DataPrep, Google DataStudio, and Google Datalab provide robust mining capabilities and can connect to all Google Cloud storage products.   

Cloud BigQuery can directly access Cloud BigTable and Cloud Storage to allow a federated query of data in the Landing Zone with your data warehouse in Cloud BigQuery.  This is a frequent data mining need.

Numerous third party products such as Looker, iCharts, Tableau provide various degrees of connectivity to various Google Cloud storage products and are evolving rapidly through 2017.

Requirement to manage nodes

In a Teradata conversion scenario, the ability to migrate from a managed MPP to a serverless environment may be a significant factor in your conversion ROI, at least for the data warehouse.   If the requirement also applies to the Landing Zone, then consider that Cloud Storage and Cloud BigQuery are completely server-less no-ops environments.  There are no compute instances or nodes to manage and they are “hands-off” scalable.  Cloud BigTable and Cloud Spanner require you to manage number of virtual nodes (note: these are NOT virtual machines/compute instances).  Node scaling can be automated, but still requires an explicit effort to monitor and execute resizing when needed.  Cloud SQL has higher infrastructure management needs as you are managing an instance of either MySQL or PostgreSQL, just in the Google Cloud.

Sample scenarios for Landing Zone implementations

Now that we explored various conversion related requirements for the Cloud storage products and reviewed various Google Cloud product capabilities in this regard, let’s look at two specific examples.

Examples Existing Landing Zone Sourcing Technology Landing Zone Storage Recommendations

Relational databases, flat files - first landed into Teradata landing tables through ETL/ELT.

Unstructured data (web logs).   The unstructured data is landed first to HDFS. Then landed into Teradata landing tables through ETL/ELT.

ETL Cloud Storage, Cloud BigQuery, Cloud Spanner, Cloud SQL, Cloud BigTable

Relational databases, flat files - the shell scripts connect to the source database and issue SQL commands to extract the required data and generate flat files.  All source data is ultimately written to flat files in a variety of formats.  The shell scripts then call either the MultiLoad or TPT utility to load the data into landing tables..

Unstructured data (web logs).   The unstructured data is landed first to HDFS. Then landed into Teradata landing tables through Teradata data load utilities.

Teradata Utilities, Shell Scripting and Scheduler Cloud Storage

⇒ Wrap-up, what’s next

In this second article of the series we explored the details of implementing the Landing Zone:

  • How to evaluate your existing source data capture processes
    • Can they really be converted?
  • What is the logical design of the Landing Zone
  • What are the considerations for deciding how to implement the Landing Zone in Google Cloud

In the next (third) article of the series, we will start to discuss the data warehouse layer in the reference architecture.  We will focus on the some of the early data modeling decisions you will need to make, as well as use cases that will require a staging area prior to loading the actual data warehouse.


Article 1: Understanding Your Current Data Warehouse

This is the first of series of articles, collaboratively written by data and solution architects at Myers-Holum, Inc., PRA Health Sciences, and Google, describing an architectural framework for conversions of data warehouses from Teradata to the Google Cloud Platform.  The series will explore common architectural patterns for Teradata data warehouses and outline best-practice guidelines for porting these patterns to the Google Cloud Platform toolset.

 

David Chu and Knute Holum, Myers-Holum, Inc.

Michael Trolier, Ph.D., PRA Health Sciences

June, 2018

 

The Opportunity:  Modern cloud infrastructure offers a compelling opportunity to reimagine demanding data applications.  Large data warehouses, long constrained by prior technology, can be good candidates for migration to the cloud.  Google Cloud Platform offers a suite of fully-managed, elastically-scalable database services suitable for re-platforming large data warehouses for enhanced performance at reduced cost and complexity.

 

Begin with this: understand what you have

If what you have is an enterprise data warehouse implemented on a Teradata platform, quite likely it’s big, and it’s complex.  It may also be expensive.  There may be multiple architectural layers with hundreds or thousands of tables, relationships, and data transformations, representing a very significant investment of intellectual effort.  If you are fortunate, the architectural patterns are evident and consistent, and the business and technical requirements that drove the implementation are well-known and clearly documented.   You know whether those requirements still apply, or whether they have been superseded.  Understand and respect the scale of what you have.

Consider where your data warehouse might come to reside.  Google Cloud Platform supports several SQL-capable data services: Cloud SQL, Cloud Spanner, and BigQuery, each of which offers capabilities relevant to a data warehousing environment.  BigQuery in particular can be very fast at remarkable scale, though it has architectural features that differ significantly from a conventional RDBMS platform.  

In this series of articles, we’ll walk through the details of getting from here to there.

 

Know what you’re in for

There’s no getting around it: you’re in for a pretty big job.   Whether it’s pretty big and manageable or daunting depends mostly on how well you understand your current data warehouse.  If you’re fortunate, your current system is well-documented, you have mature data management practices, and you have strong institutional knowledge of the requirements, structure, and evolution of the current system.  Plan to spend time on architecture and planning up front.  Plan an incremental conversion, testing carefully at each stage.  Plan to run the current and migrated systems in parallel for a period of time commensurate with your need for cross-validation.

 

Dust off your reference architecture before migrating to the cloud.

If you are fortunate, your data warehouse has a foundation – an intentional alignment between business requirements and information management best practices.  This bedrock design, expressed independently of technology or vendor, is your reference data architecture.  It informed the selection of your technology stack – including the Teradata platform, and your ETL/ELT, business intelligence, and data analysis toolsets – and provided the roadmap for implementing the data warehouse using those tools.  

If you are fortunate, uncovering this architecture is not an archeological exercise.   Examples:

There may have been a business requirement to retain and access all historical data as they appeared at the time they were originally recorded.  Following best practices:

  • you implemented effective dates on all data warehouse tables to support point-in-time access.

There may have been a business requirement to track all financial metrics back to the source transaction.  Following best practices:

  • you implemented data lineage tracking columns on all data warehouse tables so any reported metric could be traced back to the source transaction(s) it was computed from.

Once you understand the reference architecture that underlies your current implementation, ask whether it still applies.  Make no mistake: migrating a large-scale data warehouse to the cloud will require some effort, and you want to end up where you need to be now (or as far into the future as your crystal ball reveals), not where you needed to be a decade or two ago.  By returning to your reference data architecture and the business requirements driving it, you can:

  • weed out business requirements – and any related architectural and implementation components – that are no longer needed;
  • recognize business requirements that are not well met by the Teradata implementation;
  • take advantage of Google Cloud Platform technology features that might be easier to implement than the current Teradata implementation.

There’s more to migrating to the cloud than mapping technology features; you need more than a forklift.  A conversion approach that fails to recognize that key architectural requirements may have changed over the years, and therefore does not bother to define a future-state reference data architecture, will likely not be successful.  

 

Understand and validate the key reference data architecture components.

Like a good birthday cake, most data warehouses – implemented on Teradata or otherwise – have three architectural layers.  Unlike most cakes, these layers are logical in nature and distinct by design, with each serving a specific role within the warehouse.  The three layers are:

  • a source data capture layer;
  • a central (or “active”) data warehouse layer; and
  • an end-user consumption (or semantic) layer.

In some Teradata data warehouse implementations, only one of these layers (the active data warehouse) exists as a physical datastore.

As you consider migrating your data warehouse to the cloud, you must ask the same questions for each layer independently: what business and technical requirements influenced the existing implementation; how (and how well) does the implementation satisfy the requirements; whether the requirements still apply; and how you should implement that layer going forward to Google Cloud.  The extent to which you created clear architectural guidelines and associated development standards around each layer will be a large factor in the ease of converting your Teradata implementation.   Even so, some implementation patterns will port far more readily than others; there are some cases where significant redesign may be required.

Let’s consider each architectural layer in turn.

Source data capture layer

This is the first hop; your existing Teradata implementation first captures data from various sources of record, then either stages it or loads it directly to your data warehouse.  There are a variety of approaches for each process; your implementation may use a consistent style or a mix of techniques.  Data capture might occur via:

  • Pull – you extract data directly from the source of record by accessing its database (or mirror, replica, or database log) on some periodic basis;
  • Push – you get a periodic set of data from the source of record (e.g. via flat files) and do not access its database; or
  • Stream – you get an episodic or real-time (not periodic) feed of data via a streaming interface (message, queue, etc).

Your Teradata implementation will then land the data somewhere, typically via one or more of these approaches:

  • You land it in a file system for later use by a Teradata utility (FASTLOAD, TPT);
  • You stage it in work or temporary tables in a relational database (or NoSQL or Hadoop);
  • You land it in permanent database tables so that source data is retained close to its original format; or
  • You use ETL and don’t land it all, but rather transform it in-flight to its final data warehouse target tables.

Business requirements influence the implementation style of the source data capture layer.  Typical considerations are the required latency – how quickly changes in sources must be reflected in the data warehouse – and requirements to trace lineage and retain source data.  Technical constraints are also important.  On the data capture side, the capture strategy (pull, push, stream) is typically dictated at least in part by the capabilities of the source application.   The choice of landing strategy is typically influenced by the capabilities of the tool or tools that will consume the landed source data.

One of the best ways to avoid unnecessary complication in your data warehouse conversion is to keep most of your current source data capture processes in place.  While a capture strategy (e.g. push) for a specific source of record may not meet the current business need (they want to stream for example), a change to this strategy should be avoided unless it’s implemented on the Teradata side first.  How would you perform side-by-side conversion validation if the captured source data are not the same?

After you’ve considered data capture, you’ll want to characterize how the captured data are landed, and whether your implementation includes reference data architecture components that support a unified landing model regardless of the capture method and source of record.  For example, MHI uses a unified landing model we typically call “landing zone”, and we persist that model in a relational database and distributed storage.   All data capture activities target the “landing zone”, and control processes manage the accessibility and lifespan of slices of captured data.  

It is important to realize that this initial landing of data is the point where you want the data copied to the Google cloud.  A standardized approach will provide operational benefit.  Note that this standardized approach is essentially a data lake, and thus would provide analytical benefit, particularly if you have an experienced data mining / data analysis / data science team accustomed to accessing a data lake.

The landed data typically anchor the beginning of data lineage – the tracing of data elements from the source through the warehouse’s architectural layers and transformations.  This is frequently a key business requirement and is foundational for effectively validating warehouse data.  A typical MHI “landing zone” table for a batch (ETL) process would satisfy this requirement by appending a number of “ETL” attributes that support process and lineage tracking.  Representative attributes might carry identifiers to track individual rows, session execution, and workflow execution; these values would then propagate to downstream architectural layers.  This allows each data element to be tracked unambiguously through the warehouse.

If you are using a variety of landing methods (work tables, flat files, etc.) without regard to any architectural standardization, your conversion becomes more complicated, and the difficulty of tracing data lineage increases significantly.

At the end of this analysis of your source data capture layer, you will create a catalog for the layer, enumerating source systems, data volumes, latency requirements, architectural patterns, and source capture and landing approaches.  Each variation will require a specific conversion approach.

Central Data Warehouse Layer

The primary business requirement for this architectural layer is integration – preparing and reconciling information from disparate sources in an internally-consistent way to support global access and analysis.  There has been an active conversation in the warehousing community for many years about the “best” way to achieve this goal, with convictions of near-religious fervor evident from prominent voices.  Fortunately, we don’t need to identify a “best” approach; we simply need to understand the common design idioms, and know how to use the Google Cloud Platform toolset to support each.

Among Teradata implementations, two architectural styles are most common.  In the first, commonly referred to as an “Operational Data Store” (ODS), the data warehouse is modeled after the system of record.  Warehouse tables look like the source system tables – and there are probably as many variations of the representation of each data entity (e.g. a reservation or a travel itinerary) as there are source systems.  The layer may support versioning (the time series of changes) through the addition of effective date attributes, a sequence number, and a current-record indicator.  The layer may further support architectural controls through the addition of lineage- and/or process-tracking attributes.  This architectural style typically relies on a robust semantic layer to achieve significant integration of data from separate sources.

In the second common Teradata architectural pattern, the data warehouse is modeled after a canonical industry model, whether purchased (e.g. a Teradata industry data model), based on a design pattern (e.g. Silverston) or created in-house.  This style is often referred to as a Teradata “Active Data Warehouse”.  In this case, the data warehouse tables follow a canonical model designed to support current- and future-state business requirements to the greatest extent possible.  Versioning of data is typically supported robustly.  The schema is typically highly normalized (e.g. 3rd-normal-form), and the consumption model relies on the power of the Teradata platform to allow downstream applications to directly and efficiently access the data warehouse while preserving a single copy of the data.  Those downstream applications are typically SQL access (SQL Assistant), BI applications or ELT/ETL processes that feed downstream applications.  This is the style of the Inmon “Corporate Information Factory”.

Two additional architectural patterns are encountered rather rarely on Teradata; we include them here for completeness.  Both would be robustly supported by the Teradata architecture.  One is the Kimball “Bus Architecture”, in which the data warehouse is modeled as a series of data marts (star schemas) integrated via conformed dimensions.  The other is the “Data Vault”, in which data and relationships are cast into a design pattern based on “hubs”, “satellites” and “links”.  

For all their differences, a common feature of these various architectural patterns is the need to handle CRUD (create-read-update-delete) data and manage versioning of information, to satisfy the common business requirement to faithfully represent historical data.  This will be a key consideration for any conversion of a data warehouse to Google Cloud Platform.  

We will cover conversion approaches for each of these architectural patterns in future articles.  

Right now, though, we’ll return to business requirements. The data-architecture style you choose for your Google Cloud data warehouse implementation will be a key decision, and it is important to look at business requirements that might dictate one style over another.

Key reasons businesses model their data warehouse like the source of record:

  • It is a more agile (bottom-up) approach, easiest to model and develop.
  • The organization is frequently changing operational systems (hyper-growth mode).
  • The organization has compartmentalized and/or reactive focus on data with little focus on uniform data governance.  End-user departments just want the source data.

Key reasons businesses model their data warehouse using an industry (canonical) model:

  • Perhaps because of acquisitions, there are multiple sources of record (operational systems) for the same data entities, for example, data must be integrated from multiple active ERP systems.
  • The principal source of record is very de-normalized but supplies many data entities (e.g. mainframe files like VSAM or IMS).
  • In both cases above, a canonical model is needed to provide a consistent interpretation of the source data for consumption.
  • The organization has more of a data governance focus (maturity), and is using the canonical model to implement standardized data definitions.

There is an intrinsic design relationship between your central data warehouse and your semantic layer; you need to understand this relationship.  If you want integrated data, data discipline has to be imposed at some stage – typically either entering or exiting the central layer.  Thus, the more your central data warehouse is modeled after the system of record, the more sophisticated your semantic layer should be.  Since there has been little transformation of source data up front, the burden of deciphering, cleansing and standardizing source data values occurs as it is consumed from the data warehouse.  Hopefully, that transformation has been standardized in either a series of semantic database views implemented on top of the data warehouse tables or in the BI tool’s semantic layer.  Otherwise, each end user of the data warehouse probably has specific (and probably different) methods for accomplishing that transformation, using various end-user oriented tool sets, which will all need to be evaluated for the conversion.  Conversely, the more your central data warehouse is modeled using a canonical model, the less sophisticated your semantic layer needs to be.  While there may be database views that are used for access control, row level security or multi-tenancy requirements, it would be expected that the database tables in a canonical model essentially meet the consumption requirements.  The transformation of source data – deciphering, cleansing and standardizing – should occur as data is loaded into the canonical model.  You may still have a semantic layer that focuses on ease and consistency of use, perhaps projecting the canonical model as a star or snowflake schema and resolving typical effective-date filters.  Look for this layer to be implemented either as database views or within your BI tool.

At the end of this analysis of your central DW layer, you should clearly understand the strengths and weaknesses of your current implementation style, and understand whether that style is likely to meet your current and foreseeable-horizon business requirements.  

Semantic layer

A semantic data model (SDM) captures the business view of information for a specific knowledge-worker community or analytic application.  It provides a consistent, governed, intentional view of the data for a group of consumers, typically masking the complexity of the physical representation of the data to make it easier and less error-prone to consume.  There may be a different semantic data model for each department/application that uses the data warehouse.  One way to look at your semantic layer is as a formal representation of the metadata that gives defined, consistent meaning to the data elements that populate your warehouse.

Semantic data modeling is a logical data modeling technique.  The prevalence and intuitive nature of dimensional modeling makes it particularly well-suited to, and commonly encountered for, the semantic data model for an analytic application.  In Teradata implementations, the semantic data model will typically be physicalized using either database views or the semantic capabilities of the BI tool; the platform’s data-retrieval engine typically supports efficient access to the central layer without physicalizing an intermediate star schema.  In contrast, in a full semantic layer implementation, the data warehouse itself is never directly accessed by downstream applications.

The semantic layer can provide data cleansing, and apply business rules, de-normalizations, transformations, and/or calculations before exposing data for downstream applications.

When considering conversion to the Google cloud, it is important to understand how the semantic layer has been implemented, and the extent to which it is used:

  • How much end user access is performed without the semantic layer?
  • Has the semantic layer data model been maintained?

At the end of this analysis of your semantic layer, you should know whether the semantic layer or the central data warehouse layer is carrying the bulk of the burden of integrating and standardizing your data.  You should also have a good idea how much of the consumption of data from the warehouse goes through governed channels, and how much makes an end-run around your best efforts to standardize.

 

Start looking ahead

These are the highlights of the typical implementation patterns we see for Teradata data warehouses.  This overview should give you a good start in understanding where your own data warehouse fits into the grand scheme – what patterns it follows, how consistently it has been implemented, and how effective it is in supporting consumers, integrating data from disparate sources, and standardizing data to a consistent “enterprise view”.  You are probably starting to get a sense for how much work a conversion to the cloud is likely to be.

 

What’s next?

In our next article, we’ll discuss the process of articulating a reference architecture that enables your desired future state environment.  That architecture will:

  • be justified back to the original implementation and new business requirements;
  • be logical only, without regard to specific Google Cloud Platform or other vendor products;
  • establish a basis for a single version of the truth;
  • establish a basis for data quality and data auditability;
  • be layered, with each layer having a clear purpose, contract, and performance expectations.

Consider whether new business requirements should be considered, e.g.:

  • Need to support rapid data analysis or agile BI;
  • Need to intermingle structured and non-structured data.

Again, we’ll focus on the three architectural layers of the warehouse in turn.  Highlights:

Source data capture: You want a consistent architectural design for source data capture and retention. Keep most of your current source data capture processes in place.  Reconcile your source data capture layer with any data lake initiatives you may have implemented or been considering. (See article Article 2: Source Data Capture)

Central data warehouse: Identify any specific business requirements that would require a central data warehouse implementation in Google Cloud.  Understand any ACID requirements for updates.  Reconcile your data warehouse with any MDM initiatives or implementations.

Semantic layer: You want a formal (logical) semantic layer in your architectural design that’s linkable to your business glossary and other data governance artifacts.  This will be the basis for the physical implementation of your consumption layer on Google Cloud.

This analysis is the basis on which you’ll determine the cost, timeline and project approval model (including ROI) for a migration to the cloud.

After that, we’ll dive into specific migration patterns.

Teradata Migration to Google BigQuery

As a Teradata Enterprise Data Warehouse owner, are you tired of:

  • buying and hosting expensive proprietary hardware,
  • patching operating systems,
  • installing specialized database software,
  • managing database servers,
  • tuning database parameters,
  • planning upgrades and downtime,
  • worrying about increased data load times and ever-increasing data consumption needs,
  • dreading that multi-million $ yearly license renewal.

4 Facts about Google BigQuery

  • Did you know that a Google BigQuery project comes with a default 2,000 query execution slots that can be extended to more slots upon request?  Does your on-premise Teradata Data Warehouse infrastructure have 2,000 slot elasticity to run analytical queries?
  • Did you know that Google Cloud Platform bills separately for storage and query data processing?  Are you overpaying Teradata for either compute or storage capacity due to static hardware configuration and pricing models?
  • Did you know that Google Cloud Platform utilizes Petabit network to distribute your data across multiple regions for redundancy and high availability?  Are you worried about your Teradata cross-data-center Data Warehouse failover and disaster recovery?
  • Do you know how much time, effort and resources you are spending on managing Teradata Data Warehousing on-premise infrastructure complexity instead of focusing on data, insights and your customer?

Advantages of Google BigQuery

Migration of very large Data Warehouses from a Teradata platform to a Google BigQuery offers significant potential advantages:

  • The elastic scalability of the cloud infrastructure eases cost/performance tradeoffs
  • Data ingestion patterns can be simplified
  • Integration with sophisticated cloud-based analytical toolsets is readily supported
  • A serverless NoOps environment frees infrastructure maintenance burden allowing to refocus resources on data and business insights

Deep Dive

Read our blog here where a series of articles, collaboratively written by data and solution architects at Myers-Holum, Inc, and Google, describe an architectural framework for conversions of data warehouses from Teradata to the Google Cloud Platform.  The series will explore common architectural patterns for Teradata data warehouses and outline best-practice guidelines for porting these patterns to the Google Cloud Platform toolset.

Myers-Holum is here to help

Myers-Holum can assist with navigating considerations of performance, cost, reliability, and security for including cloud platforms in your mix for data warehouse deployments.

MHI uses a model-based approach and metadata-wise tools to efficiently migrate data warehouse components from traditional to cloud platforms, translating schemas and ingestion and consumption processes for optimal performance in the new architecture. We maintain high standards for metadata integrity and governance, data lineage, and code discipline.

Assessment

We evaluate where it makes sense to include Cloud platforms in your Data Warehouse environment, and assess the complexity of making the migration.  The Assessment is focused on core business requirements and three different Teradata data warehouse implementation architectural styles, canonical data models used if any, layer architectures and semantic layers implementations.  We then review existing batch and streaming source data capture to preserve your existing investment.  We analyze data consumption patterns including frequency, resources, and volume.  And finally, we review your Data Governance programs in place.

Future State

We propose Google Cloud Platform products to be used and best practices to be applied based on the assessment.  We define data modeling patterns and detail examples for converting Teradata semantic layer and star schema into BigQuery repeated nested structures.  We suggest source data capture approach ETL vs. ELT vs. UPM Dataflow, and tooling within Google Cloud Platform utilizing either native Cloud Dataflow capabilities and/or 3rd party integration tools, with lift and carry as much as possible.  For data ingestion into the Google BigQuery, we define connectivity to on-premise and cloud data sources. For data consumption, we recommend an approach that utilizes best of breed solution either using existing analytics and reporting tools or newly available analytics tooling to democratize of data analytics.  It’s important to define data security and access models, and auditing approach for enterprise data in the cloud. We suggest adjustments to Data Governance programs for the cloud.  Finally, we recommend aspirational machine learning data insights opportunities utilizing CloudML, Tensoflow, Google Cloud AI APIs.  

Roadmap

We work with your expert staff to create a business, financial, architectural and technical roadmap to migrating DW to the cloud.  Special attention is paid on ROI and iterative delivery to show progress early and often.  

Implementation

We carry out the migration following a carefully planned, staged implementation strategy, delivering real business benefit at each stage.

Contact Us

Contact us at cloudinfo@myersholum.com or 646.844.4493 to learn more about Teradata to BigQuery migrations!