Sunday, June 2, 2019
Literature review about data warehouse
Literature review about info w behouseCHAPTER 2LITERATURE REVIEW2.1 inletChapter 2 nominates literature review about info w behouse, OLAP MDDB and info mining concept. We reviewed concept, characteristics, design and work throughation come of each above menti aned technology to severalize a suitable entropy w arhouse framework. This framework will bind consolidation of OLAP MDDB and info mining elanl.Section 2.2 discussed about the fundamental of knowledge store which includes entropy storage store patterns and information touch techniques such(prenominal) as extract, transform and loading (ETL) processes. A comparative study was done on information storage w arhouse models introduced by William Inmons (Inmon, 1999), Ralph Kimball (Kimball, 1996) and Matthias Nicola (Nicola, 2000) to identify suitable model, design and characteristics. Section 2.3 introduces about OLAP model and architecture. We also discussed concept of processing in OLAP establish MDDB , MDDB lineation design and implementation. Section 2.4 introduces selective information mining techniques, methods and processes for OLAP mining (OLAM) which is apply to mine MDDB. Section 2.5 provides conclusion on literature review especially pointers on our decision to propose a naked-made data w arhouse model. Since we propose to use Microsoft product to implement the propose model, we also discussed a product comparison to justify why Microsoft product is selected. 2.2 entropy WAREHOUSEAccording to William Inmon, data w arhouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of the managements decision-making process (Inmon, 1999). Data wargonhouse is a database containing data that ordinarily represents the slewiness history of an organization. This historical data is used for abstract that supports business decisions at many levels, from strategical planning to performance evaluation of a trenchant organizational u nit of measurement.It provides an effective integration of operational databases into an environment that enables strategic use of data (Zhou, Hull, King and Franchitti, 1995). These technologies include relative and MDDB management systems, customer/server architecture, meta-data imitate and repositories, graphical user interface and much more (Hammer, Garcia-Molina, Labio, Widom, and Zhuge, 1995 Harinarayan, Rajaraman, and Ullman, 1996). The emergence of cross discipline domain such as knowledge management in finance, wellness and e-commerce experience proved that vast amount of data need to be analysed. The evolution of data in data w arehouse prat provide nine-fold dataset dimensions to solve respective(a) problems. Thus, critical decision making process of this dataset needs suitable data warehouse model (Barquin and Edelstein, 1996).The main proponents of data warehouse are William Inmon (Inmon, 1999) and Ralph Kimball (Kimball, 1996). But they invite different perspe ctives on data warehouse in term of design and architecture. Inmon (Inmon, 1999) defined data warehouse as a parasitical data mart structure while Kimball (Kimball, 1996) defined data warehouse as a bus base data mart structure. Table 2.1 discussed the differences in data warehouse structure between William Inmon and Ralph Kimball.A data warehouse is a read- merely data chemical group system where nullify-users are not allowed to change the values or data elements. Inmons (Inmon, 1999) data warehouse architecture strategy is different from Kimballs (Kimball, 1996). Inmons data warehouse model splits data marts as a copy and distributed as an interface between data warehouse and end users. Kimballs views data warehouse as a unions of data marts. The data warehouse is the collections of data marts combine into one underlying repository. Figure 2.1 illustrates the differences between Inmons and Kimballs data warehouse architecture adopted from (Mailvaganam, 2007).Although Inmon a nd Kimball have a different design view of data warehouse, they do agree on successful implementation of data warehouse that depends on an effective collection of operational data and validation of data mart. The role of database scaffolding and ETL processes on data are inevitable components in both lookers data warehouse design. Both believed that dependant data warehouse architecture is necessary to fulfil the requirement of attempt end users in term of preciseness, timing and data relevancy2.2.1 DATA WAREHOUSE ARCHITECTUREAlthough data warehouse architecture have wide research scope, and it post be viewed in many perspectives. (Thilini and Hugh, 2005) and (Eckerson, 2003) provide some meaningful way to view and analyse data warehouse architecture. Eckerson states that a successful data warehouse system depends on database staging process which derives data from different integrated Online Transactional Processing (OLTP) system. In this case, ETL process plays a crucial role to take shape database staging process workable. Survey on factors that influenced selection on data warehouse architecture by (Thilini, 2005) indentifies five data warehouse architecture that are common in use as shown in Table 2.2Independent Data MartsIndependent data marts also known as localized or small scale data warehouse. It is mainly used by departments, divisions of company to provide individual operational databases. This type of data mart is simple stock-still consists of different form that was derived from multiple design structures from various in accordant database designs. Thus, it complicates cross data mart analytic thinking. Since every organizational units tend to build their own database which operates as independent data mart (Thilini and Hugh, 2005) cited the work of (Winsberg, 1996) and (Hoss, 2002), it is best used as an ad-hoc data warehouse and also to be use as a standard before expression a real data warehouse.Data Mart Bus Architecture(Kimball, 1996 ) pioneered the design and architecture of data warehouse with unions of data marts which are known as the bus architecture or virtual data warehouse. Bus architecture allows data marts not only located in one server but it can be also being located on different server. This allows the data warehouse to functions more in virtual mode and combined all data marts and process as one data warehouse.Hub-and-spoke architecture(Inmon, 1999) developed hub and spoke architecture. The hub is the central server taking consider of information exchange and the spoke handle data transformation for all regional operation data stores. Hub and spoke mainly decocted on building a scalable and maintainable infrastructure for data warehouse.Centralized Data Warehouse ArchitectureCentral data warehouse architecture build based on hub-and-spoke architecture but without the dependent data mart component. This architecture copies and stores heterogeneous operational and external data to a wholeness and consistent data warehouse. This architecture has only one data model which are consistent and complete from all data reference points. According to (Inmon, 1999) and (Kimball, 1996), central data warehouse should consist of database staging or known as operational data store as an intermediate stage for operational processing of data integration before transform into the data warehouse.Federated ArchitectureAccording to (Hackney, 2000), federated data warehouse is an integration of multiple heterogeneous data marts, database staging or operational data store, combination of analytical application and reportage systems. The concept of federated focus on integrated framework to hire data warehouse more reliable. (Jindal, 2004) conclude that federated data warehouse are a practical approach as it focus on higher dependability and provide excellent value.(Thilini and Hugh, 2005) conclude that hub and spoke and centralized data warehouse architectures are similar. Hub and spoke is f leet and easier to implement because no data mart are required. For centralized data warehouse architecture scored higher than hub and spoke as for urgency needs for relatively fast implementation approach.In this work, it is very important to identify which data warehouse architecture that is robust and scalable in terms of building and deploying enterprise wide systems. (Laney, 2000), states that selection of appropriate data warehouse architecture must incorporate successful characteristic of various data warehouse model. It is evident that two data warehouse architecture prove to be public as shown by (Thilini and Hugh, 2005), (Eckerson, 2003) and (Mailvaganam, 2007). First hub-and-spoke proposed by (Inmon, 1999) as it is a data warehouse with dependant data marts and momently is the data mart bus architecture with dimensional data marts proposed by (Kimball, 1996). The selection of the new proposed model will use hub-and-spoke data warehouse architecture which can be used for MDDB modelling.2.2.2 DATA WAREHOUSE EXTRACT, TRANSFORM, LOADINGData warehouse architecture process begins with ETL process to sick the data passes the quality threshold. According to Evin (2001), it is essential to have right dataset. ETL are an important component in data warehouse environment to ensure dataset in the data warehouse are cleansed from various OLTP systems. ETLs are also responsible for running scheduled tasks that extract data from OLTP systems. Typically, a data warehouse is populated with historical information from inwardly a particular(prenominal) organization (Bunger, Colby, Cole, McKenna, Mulagund, and Wilhite, 2001). The complete process descriptions of ETL are discussed in table 2.3.Data warehouse database can be populated with a wide variety of data sources from different locations, thus collecting all the different dataset and storing it in one central location is an extremely challenging task (Calvanese, Giacomo, Lenzerini, Nardi, and Rosati, , 2001) . However, ETL processes eliminate the complexity of data population via simplified process as depicts in realise 2.2. The ETL process begins with data extract from operational databases where data cleansing and scrubbing are done, to ensure all datas are validated. Then it is transformed to meet the data warehouse standards before it is loaded into data warehouse.(Zhou et al, 1995) states that during data integration process in data warehouse, ETL can assist in import and export of operational data between heterogeneous data sources using Object linking and embedding database (OLE-DB) based architecture where the data are transform to populate all validated data into data warehouse.In (Kimball, 1996) data warehouse architecture as depicted in figure 2.3 focuses on three important modules, which is the back room presentation server and the front room. ETL processes is implemented in the back room process, where the data staging serve in charge of gathering all source systems opera tional databases to perform extraction of data from source systems from different file format from different systems and platforms. The second step is to run the transformation process to ensure all inconsistency is removed to ensure data right. Finally, it is loaded into data marts. The ETL processes are commonly punish from a job control via scheduling task. The presentation server is the data warehouse where data marts are stored and process here. Data stored in asterisk schema consist of dimension and fact tables. This is where data are then process of in the front room where it is access by interrogatory services such as reporting tools, desktop tools, OLAP and data mining tools.Although ETL processes prove to be an essential component to ensure data integrity in data warehouse, the anaesthetize of complexity and scalability plays important role in deciding types of data warehouse architecture. One way to achieve a scalable, non-complex solution is to adopt a hub-and-spoke architecture for the ETL process. According to Evin (2001), ETL best operates in hub-and-spoke architecture because of its flexibility and efficiency. Centralized data warehouse design can influence the tending of wide of the mark access control of ETL processes.ETL processes in hub and spoke data warehouse architecture is recommended in (Inmon, 1999) and (Kimball, 1996). The hub is the data warehouse after processing data from operational database to staging database and the spoke(s) are the data marts for distributing data. Sherman, R (2005) state that hub-and-spoke approach uses one-to-many interfaces from data warehouse to many data marts. One-to-many are simpler to implement, cost effective in a long run and ensure consistent dimensions. Compared to many-to-many approach it is more complicated and costly.2.2.3 DATA WAREHOUSE FAILURE AND SUCCESS FACTORSBuilding a data warehouse is indeed a challenging task as data warehouse project inheriting a unique characteristics that whi tethorn influence the overall reliability and robustness of data warehouse. These factors can be applied during the analysis, design and implementation phases which will ensure a successful data warehouse system. Section 2.2.3.1 focus on factors that influence data warehouse project failure. Section 2.2.3.2 discusses on the success factors which implementing the correct model to support a successful data warehouse project.2.2.3.1 DATA WAREHOUSE FAILURE FACTORS(Hayen, Rutashobya, and Vetter, 2007) studies shows that implementing a data warehouse project is costly and risky as a data warehouse project can cost over $1 million in the first year. It is estimated that two-thirds of the effort of setting up the data warehouse projects attempt will fail eventually. (Hayen et al, 2007) cited on the work of (Briggs, 2002) and (Vassiliadis, 2004) noticed three factors for the failure of data warehouse project which is environment, project and practiced factors as shown in table 2.4.Environme nt leads to organization changes in term of business, politics, mergers, takeovers and lack of top management support. These include human error, corporate culture, decision making process and poor change management (Watson, 2004) (Hayen et al, 2007). Poor technical knowledge on the requirements of data definitions and data quality from different organization units may cause data warehouse failure. Incompetent and insufficient knowledge on data integration, poor selection on data warehouse model and data warehouse analysis applications may cause huge failure.In enkindle of heavy investment on hardware, software and people, poor project management factors may lead data warehouse project failure. For example, assigning a project manager that lacks of knowledge and project experience in data warehouse, may cause impediment of quantifying the return on investment (ROI) and achievement of project multiply constraint (cost, scope, time).Data ownership and accessibility is a potential fa ctor that may cause data warehouse project failure. This is considered vulnerable issue within the organization that one must not share or acquire someone else data as this considered losing authority on the data (Vassiliadis, 2004). Thus, it emphasis restriction on any departments to declare total ownership of pure clean and error free data that might cause potential problem on ownership of data rights.2.2.3.2 DATA WAREHOUSE SUCCESS FACTORS(Hwang M.I., 2007) stress that data warehouse implementations are an important area of research and indus runnel practices but only few researches made an assessment in the critical success factors for data warehouse implementations. He conducted a survey on six data warehouse researchers (Watson Haley, 1997 Chen et al., 2000 Wixom Watson, 2001 Watson et al., 2001 Hwang Cappel, 2002 Shin, 2003) on the success factors in a data warehouse project. He concluded his survey with a list of successful factors which influenced data warehouse implemen tation as depicted in figure 2.8. He shows eight implementation factors which will directly affect the six selected success variablesThe above mentioned data warehouse success factors provide an important guideline for implementing a successful data warehouse projects. (Hwang M.I., 2007) studies shows an integrated selection of various factors such as end user participation, top management support, acquisition of quality source data with profound and well-defined business needs plays crucial role in data warehouse implementation. Beside that, other factors that was highlighted by Hayen R.L. (2007) cited on the work of Briggs (2002) and Vassiliadis (2004), Watson (2004) such as project, environment and technical knowledge also influenced data warehouse implementation.SummaryIn this work on the new proposed model, hub-and-spoke architecture is use as Central repository service, as many scholars including Inmon, Kimball, Evin, Sherman and Nicola adopt to this data warehouse architectur e. This approach allows locating the hub (data warehouse) and spokes (data marts) centrally and can be distributed across local or wide area network depending on business requirement. In designing the new proposed model, the hub-and-spoke architecture clearly identifies six important data warehouse components that a data warehouse should have, which includes ETL, Staging Database or operational database store, Data marts, MDDB, OLAP and data mining end users applications such as Data query, reporting, analysis, statistical tools. However, this process may differ from organization to organization. Depending on the ETL setup, some data warehouse may overwrite old data with new data and in some data warehouse may only maintain history and audit trial of all changes of the data.2.3 ONLINE ANALYTICAL PROCESSINGOLAP Council (1997) define OLAP as a group of decision support system that facilitate fast, consistent and interactive access of information that has been reformulate, transformed and summarized from relational dataset mainly from data warehouse into MDDB which allow optimal data retrieval and for performing trend analysis.According to Chaudhuri (1997), Burdick, D. et al. (2006) and Vassiladis, P. (1999), OLAP is important concept for strategic database analysis. OLAP have the ability to analyze large amount of data for the extraction of valuable information. Analytical development can be of business, education or checkup sectors. The technologies of data warehouse, OLAP, and analyzing tools support that ability. OLAP enable discovering pattern and relationship contain in business activity by query tons of data from multiple database source systems at one time (Nigel. P., 2008). Processing database information using OLAP required an OLAP server to organize and transformed and builds MDDB. MDDB are then separated by cubes for client OLAP tools to perform data analysis which aim to discover new pattern relationship between the cubes. Some popular OLAP server s oftware programs include seer (C), IBM (C) and Microsoft (C).Madeira (2003) supports the fact that OLAP and data warehouse are complementary technology which blends to specifyher. Data warehouse stores and manages data while OLAP transforms data warehouse datasets into strategic information. OLAP function ranges from rudimentary navigation and browsing (often known as slice and dice), to calculations and also serious analysis such as time series and complex modelling. As decision-makers implement more advanced OLAP capabilities, they move from basic data access to creation of information and to discovering of new knowledge.2.3.4 OLAP ARCHITECTUREIn comparison to data warehouse which unremarkably based on relational technology, OLAP uses a third-dimensional view to aggregate data to provide rapid access to strategic information for analysis. There are three type of OLAP architecture based on the method in which they store multi-dimensional data and perform analysis operations on that dataset (Nigel, P., 2008). The categories are third-dimensional OLAP (MOLAP), relational OLAP (ROLAP) and hybrid OLAP (HOLAP). In MOLAP as depicted in plat 2.11, datasets are stored and summarized in a multidimensional cube. The MOLAP architecture can perform faster than ROLAP and HOLAP (C). MOLAP cubes designed and build for rapid data retrieval to enhance efficient slicing and dicing operations. MOLAP can perform complex calculations which have been pre-generated after cube creation. MOLAP processing is restricted to initial cube that was created and are not bound to any additional replication of cube.In ROLAP as depict in Diagram 2.12, data and aggregations are stored in relational database tables to provide the OLAP slicing and dicing functionalities. ROLAP are the slowest among the OLAP flavours. ROLAP relies on data manipulating directly in the relational database to give the manifestation of naturalized OLAPs slicing and dicing functionality. Basically, each slicing an d dicing action is equivalent to adding a WHERE clause in the SQL statement. (C)ROLAP can manage large amounts of data and ROLAP do not have any limitations for data size. ROLAP can influence the intrinsic functionality in a relational database. ROLAP are slow in performance because each ROLAP activity are essentially a SQL query or multiple SQL queries in the relational database. The query time and number of SQL statements executed measures by its complexity of the SQL statements and can be a bottleneck if the underlying dataset size is large. ROLAP essentially depends on SQL statements generation to query the relational database and do not cater all needs which make ROLAP technology conventionally limited by what SQL functionality can offer. (C)HOLAP as depict in Diagram 2.13, combine the technologies of MOLAP and ROLAP. Data are stored in ROLAP relational database tables and the aggregations are stored in MOLAP cube. HOLAP can drill down from multidimensional cube into the underl ying relational database data. To acquire summary type of information, HOLAP leverages cube technology for faster performance. Whereas to think back detail type of information, HOLAP can drill down from the cube into the underlying relational data. (C)In OLAP architectures (MOLAP, ROLAP and HOLAP), the datasets are stored in a multidimensional format as it involves the creation of multidimensional blocks called data cubes (Harinarayan, 1996). The cube in OLAP architecture may have three axes (dimensions), or more. Each axis (dimension) represents a logical phratry of data. One axis may for example represent the geographic location of the data, while others may indicate a state of time or a specific school. Each of the categories, which will be described in the following section, can be broken down into successive levels and it is possible to drill up or down between the levels.Cabibo (1997) states that OLAP partitions are normally stored in an OLAP server, with the relational data base frequently stored on a separate server from OLAP server. OLAP server must query across the network whenever it needs to access the relational tables to resolve a query. The impact of querying across the network depends on the performance characteristics of the network itself. Even when the relational database is placed on the same server as OLAP server, inter-process calls and the associated context switching are required to retrieve relational data. With a OLAP partition, calls to the relational database, whether local or over the network, do not occur during querying.2.3.3 OLAP FUNCTIONALITYOLAP functionality offers dynamic multidimensional analysis supporting end users with analytical activities includes calculations and modelling applied across dimensions, trend analysis over time periods, slicing subsets for on-screen viewing, drilling to deeper levels of records (OLAP Council, 1997) OLAP is implemented in a multi-user client/server environment and provide reliably fast re sponse to queries, in spite of database size and complexity. OLAP facilitate the end user integrate enterprise information through relative, customized viewing, analysis of historical and present data in various what-if data model scenario. This is achieved through use of an OLAP Server as depicted in diagram 2.9.OLAP functionality is provided by an OLAP server. OLAP server design and data structure are optimized for fast information retrieval in any course and flexible calculation and transformation of unprocessed data. The OLAP server may either actually canalise out the processed multidimensional information to distribute consistent and fast response times to end users, or it may fill its data structures in real time from relational databases, or offer a choice of both.Essentially, OLAP create information in cube form which allows more composite analysis pars to relational database. OLAP analysis techniques employ slice and dice and drilling methods to segregate data into loads of information depending on given parameters. Slice is identifying a single value for one or more variable which is non-subset of multidimensional array. Whereas dice function is application of slice function on more than two dimensions of multidimensional cubes. Drilling function allows end user to traverse between condensed data to about precise data unit as depict in Diagram 2.10.2.3.5 MULTIDIMENSIONAL DATABASE SCHEMAThe base of every data warehouse system is a relational database build using a dimensional model. Dimensional model consists of fact and dimension tables which are described as confidential information schema or snowflake schema (Kimball, 1999). A schema is a collection of database objects, tables, views and indexes (Inmon, 1996). To watch dimensional data modelling, Table 2.10 defines some of the terms commonly used in this type of modellingIn designing data models for data warehouse, the most commonly used schema types are star schema and snowflake schema. In t he star schema design, fact table sits in the middle and is machine-accessible to other surrounding dimension tables like a star. A star schema can be simple or complex. A simple star consists of one fact table a complex star can have more than one fact table.Most data warehouses use a star schema to represent the multidimensional data model. The database consists of a single fact table and a single table for each dimension. Each tuple in the fact table consists of a pointer or foreign key to each of the dimensions that provide its multidimensional coordinates, and stores the numeric measures for those coordinates. A tuple consist of a unit of data extracted from cube in a range of member from one or more dimension tables. (C, http//msdn.microsoft.com/en-us/library/aa216769%28SQL.80%29.aspx). Each dimension table consists of columns that chalk up to attributes of the dimension. Diagram 2.14 shows an example of a star schema For Medical Informatics System.Star schemas do not explic itly provide support for attribute hierarchies which are not suitable for architecture such as MOLAP which require lots of hierarchies of dimension tables for efficient drilling of datasets. Snowflake schemas provide a refinement of star schemas where the dimensional hierarchy is explicitly represented by normalizing the dimension tables, as shown in Diagram 2.15. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables. (C)Levene. M (2003) stresses that in addition to the fact and dimension tables, data warehouses store selected summary tables containing pre-aggregated data. In the simplest cases, the pre-aggregated data corresponds to aggregating the fact table on one or more selected dimensions. Such pre-aggregated summary data can be represente d in the database in at least two ways. Whether to use star or a snowflake mainly depends on business needs. 2.3.2 OLAP EvaluationAs OLAP technology taking declamatory place in data warehouse industry, there should be a suitable assessment tool to evaluate it. E.F. Codd not only invented OLAP but also provided a set of procedures which are known as the Twelve Rules for OLAP product ability assessment which include data function, unlimited dimensions and aggregation levels and flexible reporting as shown in Table 2.8 (Codd, 1993)Codd twelve rules of OLAP provide us an essential tool to verify the OLAP functions and OLAP models used are able to produce desired result. Berson, A. (2001) punctuate that a good OLAP system should also support a complete database management tools as a utility for integrated centralized tool to let database management to perform distribution of databases within the enterprise. OLAP ability to perform drilling mechanism within the MDDB allows the functio nality of drill down right to the source or root of the detail record level. This implies that OLAP tool permit a smooth changeover from the MDDB to the detail record level of the source relational database. OLAP systems also must support incremental database refreshes. This is an important feature as to prevent stability issues on operations and usability problems when the size of the database increases.2.3.1 OLTP and OLAPThe design of OLAP for multidimensional cube is entirely different compare to OLTP for database. OLTP is implemented into relational database to support daily processing in an organization. OLTP system main function is to capture data into computers. OLTP allow effective data manipulation and storage of data for daily operational resulting in huge quantity of transactional data. Organisations build multiple OLTP systems to handle huge quantities of daily operations transactional data can in short period of time.OLAP is designed for data access and analysis to supp ort managerial user strategic decision making process. OLAP technology focuses on aggregating datasets into multidimensional view without hindering the system performance. According to Han, J. (2001), states OLTP systems as Customer oriented and OLAP is a market oriented. He summarized major differences between OLTP and OLAP system based on 17 key criteria as shown in table 2.7.It is complicated to merge OLAP and OLTP into one centralized database system. The dimensional data design model used in OLAP is much more effective for querying than the relational database query used in OLTP system. OLAP may use one central database as data source and OLTP used different data source from different database sites. The dimensional design of OLAP is not suitable for OLTP system, mainly due to redundancy and the loss of referential integrity of the data. Organization chooses to have two separate information systems, one OLTP and one OLAP system (Poe, V., 1997).We can conclude that the purpose o f OLTP systems is to get data into computers, whereas the purpose of OLAP is to get data or information out of computers. 2.4 DATA MININGMany data mining scholars (Fayyad, 1998 Freitas, 2002 Han, J. et. al., 1996 Frawley, 1992) have defined data mining as discovering concealed patterns from historical datasets by using pattern recognition as it involves searching for specific, unknown information in a database. Chung, H. (1999) and Fayyad et al (1996) referred data mining as a step of knowledge discovery in database and it is the process of analyzing data and extracts knowledge from a large database also known as data warehouse (Han, J., 2000) and making it into useful information.Freitas (2002) and Fayyad (1996) have recognized the advantageous tool of data mining for extracting knowledge from a da
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.