I dedicate this article to the persons who requested an article on planning a data warehouse in response to my article "Accessing OLAP using ASP.NET."This article starts with an introduction to data warehouse and further proceeds to the design-process of a data warehouse and concludes by giving some best practices.
Click this link to access the article "Accessing OLAP using ASP.NET."
Introduction
Databases store information about business transactions, plus other data such as employee records. Those types of systems are called online transaction processing (OLTP) databases. OLTP data contains a wealth of information that can help you make informed decisions about your business. The process of analyzing your data for that type of information, and the data that results, are collectively called Business Intelligence (BI).
A Data Warehouse is an enterprise-wide solution for data collection and analysis to meet the requirements of Decision Support Systems. A Data Warehouse is a complete architecture, and as such, requires a rigorous, yet iterative design approach and development methodology to ensure successful deployment. A data warehouse stores current and historical data from disparate operational systems (i.e., transactional databases) into one consolidated system where data is cleansed and restructured to support data analysis.
The first step in building a successful data warehousing application is to identify the specific information-based problems that are causing the organization the most amount of pain. These typically include:
The inability to extract data from multiple disparate data sources and resolve differences in data definitions
A lack of high quality data on which to base critical business decisions
No "single version of the truth" for business rules and data definitions
Inability to share consistent information across business divisions
A proliferation of non-integrated, "stovepipe" applications
The inability to generate consolidated and reconciled financial and other business reports
Data warehousing technology is ideally suited to solving all of these problems and more. Data integration platforms can be used to access a wide range of heterogeneous data sources, resolve the inconsistencies between these sources of data, and populate target databases.
Business intelligence (BI) tools and analytic applications may be used to access the target databases to support query, reporting, and analysis of the data. Metadata, generated and maintained by the data integration platform, may be used to create a centrally managed definition of business rules and entity definitions. And it is now possible to integrate data warehousing solutions with real-time systems-including real-time click stream analysis, real-time analytic applications, and EAI infrastructures-to drive real-time business responsiveness.
Typical relational databases which were designed for on-line transactional processing (OLTP) do not meet the requirements for effective on-line analytical processing (OLAP). As a result, data warehouses are designed differently than traditional relational databases. As the value of data warehouses and their associated OLAP capabilities have increased, comprehensive Meta data management has proven to be a vital element to the success or failure of data warehouses. Without accurate Meta data, a data warehouse rapidly becomes unmanageable and ineffective.
Meta data is literally "data about data". It describes the kind of information in the warehouse, where it stored, how it relates to other information, where it comes from, and how it is related to the business. The topic of standardizing meta data across various products and applying a systems engineering approach to this process in order to facilitate data warehouse design is what this project intends to address.
We intend to create a design process for creating a data warehouse through the development of a metadata system. The Meta data system we intend to create will provide a framework to organize and design the data warehouse. Using a systems engineering approach, this process will be used initially to help define the data warehouse requirements and it will then be used iteratively during the life of the data warehouse to update and integrate new dimensions of the warehouse. We will illustrate our process with a case study using an advertising/product-based business as the data warehouse application.
In order to be effective, the user of the data warehouse must have access to Meta data that is accurate and up to date. Without a good source of Meta data to operate from, the job of the analyst is much more difficult, and the work required for analysis is compounded significantly.
Understanding the role of the enterprise data model relative to the data warehouse is critical to a successful design and implementation of a data warehouse and its Meta data management. Some specific challenges which involve the physical design trade-offs of a data warehouse include:
Granularity of data - refers to the level of detail held in the unit of data
Partitioning of data - refers to the break-up of data into separate physical units that can be handled independently
Performance issues
Data structures inside the warehouse
Migration
Meta data itself is arguably the most critical element in effective data management. Effective tools must be used to properly store and use the Meta data generated by the various systems. This paper outlines a Systems Engineering approach to designing and maintaining a data warehouse, emphasizing the key element of metadata. This approach is then used as a template for a case study of the development of an actual data warehouse in a product based business. Finally, a metadata system is created and explained as a tool for designing and updating a data warehouse.
The Systems Planning Phase of the life cycle communicates an overall vision for the data warehouse activity and its role in the organization's daily and weekly life. Decisions made during this phase have significant impact on the implementation, scope, and size of the effort. It begins with the identification of a need and then defines timeliness, tasks and deliverables. Then it proceeds through the following key planning decisions:
Select an Implementation Strategy
Generally, a Top Down approach is useful for projects where the technology is mature and well understood, as well as where the business problems that must be solved are clear and well-understood. With this approach, the business requirements to be met by the proposed data warehouse solution are identified first. These are the primary drivers for the implementation of the data warehouse.
A Bottom Up approach, on the other hand, is useful in making technology assessments and is a good technique for organizations that are not leading-edge technology implementers. This approach is used when the business objectives that are to be met by the data warehouse are unclear, or when the current or proposed business process will be affected by the data warehouse.
Select a Development Methodology
A Development Methodology describes the expected evolution and management of the engineering system. One of the most important principles of Systems Engineering is evaluating a system from a Life-Cycle perspective. Establishing a methodology will also provide a strategy for the project manager and the project team as they execute the data warehouse project throughout all phases of development.
Waterfall Model
The waterfall model is a linear sequence comprised of the following basic stages:
* Requirements Definition * System Design * Detailed Design * Integration and Testing * Operations and Maintenance
This model is used when the system requirements and objectives are known and clearly specified.
Spiral Model
The Spiral model is a sequence of waterfall models which corresponds to a risk oriented iterative enhancement, and it recognizes that requirements are not always available and clear when the system is first implemented.
Since designing and building a data warehouse is an iterative process, the spiral method is the best development methodology.
Develop Business Objectives
Develop a list of business objectives that the system must fulfill using the following questions as a checklist:
Who is the potential audience?
What are the immediate uses of planned platforms?
What are the planned capabilities in terms of features and functions?
What data sources can and/or must be integrated into the data warehouse?
When is the system needed?
What is the expected life-span of the data warehouse?
This step is often the most difficult in the planning phase because the potential users of the data warehouse cannot specifically describe the type of information they will want out of the warehouse. A good approach in this circumstance is to determine what kind of ad hoc analytical processing they do now and what data sources they use to generate such analytical reports. Using that information as a starting point, question the decision-makers about what type of additional information they would find useful. With these "wish-lists" from the users, the physical design of the warehouse becomes clearer as the certain elements of data are requested more than others.
Collect Metadata
The final part of the planning phase is the need to initially capture the various items of design relating to metadata. Metadata will actually serve as a blueprint for constructing the data warehouse. The Metadata is collected during the planning phase from the following sources:
Enterprise Models based on E-R (Entity Relationship) Diagrams
Repositories and data dictionaries of the data sources
Syndicated data - i.e. Dow Jones/third-party information sources
Collecting metadata early in the planning phase is important for building a data warehouse. Unlike typical databases which usually have one coherent homogeneous data source which is structured with input rules and integrity constraints, a data warehouse is combining many different data sources which each have their own associated set of "business rules" which govern the database. It is at this stage where you begin to establish the patterns which trace data from the Source, to the Warehouse, to the Applications. Capturing and documenting this metadata is the only way to logically track this pattern which becomes critical later in the life-cycle when data elements at the sources change over time. When such changes occur, these links from Source to Warehouse to Application will allow the warehouse to be more readily maintained and updated, thereby ensuring its viability as a reliable OLAP resource.
This phase identifies the requirements/functions that the data warehouse will deliver. In addition to the features and functions needed, the requirements will clearly describe the operating environment in which the data warehouse will be delivered.
The amount of requirements gathering depends on the implementation approach that you take. The goal is to get an understanding of the core use of initial data and to identify other users who may need to leverage access to the data.
The primary purpose of collecting end user requirements for a data warehouse is to understand how users conduct their business, what data they currently use, and what they would like to do in the future. You should be able to further break down this information into Business entities and their attributes, relationships between the entities, and hierarchies.
The requirements can be gathered through a series of interviews with the different users. Answers to the following questions from the different users will generate the requirements needed for further development of the data warehouse.
Executive's/Owner's Requirements:
Why are we building a data warehouse? What business problem will it address?
How much will it cost?
When will it be ready?
What is the impact on people? Skills? Organization?
What does it do to our current computer investment?
Do we have the skills to do it?
What are the risks?
Architect's Requirements:
The architect views the data warehouse in the abstract and lays out its various components.
What analyses do you wish you could get more frequently?
What functions and features will be offered?
What platforms are needed for implementation?
How will standards and open interfaces be used?
How much flexibility is there for adding enhancements?
Developer's Requirements:
The developer views the warehouse in more specific detailed terms.
Request a further break down of the Architect's requirements into specific applications, interfaces, computers, databases, communication protocols, and user-interface screens.
What are the Deployment Requirements?
Access and delivery methods?
Access tools?
Connectivity requirements?
Client Platform requirements?
End User Requirements:
How does the data warehouse's functionality fit the end user's daily workflow?
What are your Query Requirements?
What types of ad hoc analysis do you do?
What level of detail is included?
What are your reporting requirements?
What reports do you create?
How often do you perform this analysis?
Who gets the information?
How is it used?
Where do you get the information from?
At the end of each interview, ask each candidate to develop a wish list which would include things that you would want to be able to do if there were no financial or time or technical constraints in your way. Next we develop a cost requirement.
Cost Requirements:
Traditional cost assessment is not applicable in developing a data warehouse because it uses an iterative development process, therefore making it impossible to predict end-user's summaries and changes.
However, project managers can determine percentages and priorities of work involved in the development. A percentage of at least 10% should be allocated to metadata management throughout the life-cycle of the warehouse.
The systems design phase is important because it determines how the requirements will be met. The main focus of this phase is to convert the systems requirements into a set of system-level specifications through deriving logical and physical data models for the data warehouse (i.e., E-R Diagrams and MetaModels).
Dimensional modeling is the design concept used by many data warehouse designers to build their data warehouse. Dimensional model is the underlying data model used by many of the commercial OLAP products available today in the market. In this model, all data is contained in two types of tables called Fact Table and Dimension Table.
Fact Table
Fact table contains the measurements or metrics or facts of business processes. If your business process is Sales, then a measurement of this business process such as "monthly sales number" is captured in the fact table. In addition to the measurements, the only other things a fact table contains are foreign keys for the dimension tables.
Context of the measurements is represented in dimension tables. You can also think of the context of a measurement as the characteristics such as who, what, where, when, how of a measurement (subject). In your business process Sales, the characteristics of the 'monthly sales number' measurement can be a Location (Where), Time (When), Product Sold (What).
The Dimension Attributes are the various columns in a dimension table. In the Location dimension, the attributes can be Location Code, State, Country, Zip code. Generally the Dimension Attributes are used in report labels, and query constraints such as where Country='USA'. The dimension attributes also contain one or more hierarchical relationships.
Before designing your data warehouse, you need to decide what this data warehouse contains. Say if you want to build a data warehouse containing monthly sales numbers across multiple store locations, across time and across products then your dimensions are:
Location Time Product
Each dimension table contains data for one dimension. In the above example you get all your store location information and put that into one single table called Location. Your store location data may be spanned across multiple tables in your OLTP system (unlike OLAP), but you need to de-normalize all that data into one single table.
The specifications are then used to generate the data warehouse extractors and transformation, integration, summarization, and aggregation software. Next, a "Build versus Buy" trade-off analysis is performed. Decisions are complicated by technical, economic, and political considerations. A poor decision can ruin an otherwise successful analysis and design. On the other hand, an appropriate selection of vendor-supplied component or incorporation of existing investment can make it possible to build a quick and effective data warehouse.
Finally, processes are identified to connect the data sources, the data warehouse, and the end user access tools together.
The System Integration phase is started in conjunction with the design phase. This phase encompasses locating the source of the data in the operational systems doing analysis to understand what types of data transformations may be needed, and mapping the source data to the target data within the warehouse design. The challenge during this phase is to understand how to incorporate existing investments in platforms, technology, and know-how. Systems integration capabilities tie vendor systems together with existing data sources and existing and proposed access tools. Proper selection and evaluation of vendor supplied components and Metadata Management will prove invaluable in successfully integrating the disparate sources of data into the warehouse.
The main focus of this phase is developing procedures to extract and move data in a form that can then be loaded into the warehouse. Programs and transformation tools should be used to reduce the customized programming required to transform and integrate the data. Finally, the data must be analyzed to determine whether or not certain elements should be cleansed prior to putting it into the warehouse. Business rules should be established to standardize the data formats within the warehouse if possible.
Coming to the Best Practices, there are two things about data warehousing: One, data warehousing environments are not becoming simpler. They are complex undertakings and are likely to remain so. Two, data warehouse environments are becoming steadily more strategic to the success of the enterprise. They are not only being used for business intelligence but are also starting to take a key operational role as real-time integration technologies and real-time analytics are integrated into the mix to drive real-time decisions and actions. Consequently, it is more important than ever to implement proven best practices so as to avoid delays, excessive costs, and business disappointments as a project goes forward. Based on long experience, the following best practices are recommended to build data warehousing applications:
Ensure that the data warehouse is business-driven, not technology-driven
Define the long-term vision for the data warehouse in the form of an enterprise data warehousing architecture
Avoid "stovepipe" data marts that do not integrate at the metadata level with a central meta-data repository, generated and maintained by a data integration platform
Do not build "virtual" data warehouses that access data directly from source environments and have no target database
Buy, don't build data warehousing components
Create a hub-and-spoke architecture using a data integration platform to access data sources and populate the central data warehouse, data marts, operational data store, and analytic applications
And finally you can develop an application to access the information in the data warehouse using a selected platform of choice. You can develop a web application using .NET framework and SQL Server 2000 Analysis Services as demonstrated at http://www.aspfree.com/c/a/MS-SQL-Server/Accessing-OLAP-using-ASP-dot-NET/