Adv DB: Data Warehouse & Data Mining

Data warehouses allow for people with decision power to locate the adequate data quickly from one location that spans across multiple functional departments and is very well integrated to produce reports and in-depth analysis to make effective decisions (MUSE, 2015a). The data warehouse doesn’t solve the: Who, What, Where, When, Why and How, but that is where data mining can help.  Data warehouse, when combined with data mining tools, can create a decision support system (DSS), which can be used to uncover/discover hidden relationships within the data (MUSE, 2015b). DSS needs both a place to store data and a way to sort meaningful data in order to make sense of the data and provide meaningful insights to the decision-maker.  Data that can be used for meaningful insights must be prepared/transformed (and checked for quality) while in the data warehouse, but must be completed before the data is used in a data mining tool.  Also, results from the data mining tool can be placed back into the data warehouse to allow its results to be seen by all end-users and to be reused by others.

Data Warehouse & Data Mining

A data warehouse is a centralized collection of data that is consistent, subject-oriented, integrated, special variant and/or temporally variant, nonvolatile data to enable decisions makers to make desirable business decisions based on their gathered insights and predictions from the data about the near future (Tryfona et al, 1999). Ballou & Tayi (1999) stated that a key feature of a data warehouse is its usage for decision making not for operational purposes.  Nevertheless, data warehouses don’t solve the questions: Who, What, When, Where, Why and How, it’s just a data depository (MUSE, 2015b). Hence, it validates what Tryfona et al (1999) stated, there is little distinction/differentiator on how data is modeled in a data warehouse as with a database. Databases though can and are used in operational situations, thus invalidating Tryfona et al (1999) argument, because as Ballou & Tayi (1999) pointed out operational data usually focuses heavily on current data whereas decision-makers look at historical data across time intervals to make temporal comparisons.

Databases and/or data warehouses cannot make a decision all on its own, but they are the platform to which data is stored centrally so that the right decision analysis techniques can be conducted on the data in order to provide meaning from them. The right decision analysis technique comes from data mining, which helps find meaningful once-hidden patterns from the data (in this case stored in the data warehouse).  Data mining can look into the past and current data to make predictions into the future (Silltow, 2006).   Though this is nothing new, statisticians have been using these techniques in a manual fashion for years to help discover knowledge from data. Thus, discovering knowledge through these centrally stored data, which can possibly come from multiple sources in a business or other data creation system that could be tied/linked together is what a warehouse does best (Connolly & Begg, 2015). What data warehouses also enable is using the same data in new ways to discover new insights about a subject than what the original purpose was (reuse) for collecting that data (Ballou & Tayi, 1999).  Data warehouses can support several low-level organizational decisions as well as high-level organizational (enterprise-wide) decisions.  Suitable applications to feed data into a data warehouse to aid in decision making can come from: mainframes, proprietary file systems, servers, internal workstations, external website data, etc.  Storing some data offline or online helps mainly to improve querying speeds. Summarized data, which is updated automatically as new data enters the warehouse, can help improve query speeds, while detailed data can be stored online if it can help support/supplement summarized data (Connolly & Begg, 2015).

Failure in the implementation of a data warehouse can be generated from poor data quality. Data quality should be built into the data warehouse: planning, implementation, and maintenance phases.  Ballou & Tayi (1999) warned that even though this feature of data stored in a data warehouse is a key driver for companies to adopt a warehouse is that data quality must be preserved.  Data quality encompasses the following attributes: accuracy, completeness, consistency, timeliness, interpretability, believability, value-added, and accessibility.  Most people generating data are familiar with its error rates, margins of error, its deficiencies, and idiosyncrasies, but when rolled up in a data warehouse (and it is not communicated properly), people outside of that data-generating organization will not know this and their final decisions could be prone to errors.  One must consider the different needs for data quality within a data warehouse, as the levels of quality needed for relevant decision making, project design, future needs, etc.  We must ask from our data providers what is unsatisfactory and to what quantifiable level is the current data that they are providing into the data warehouse (Ballou & Tayi, 1999).  As the old adage goes “Garbage In – Garbage Out”.

So, what can cause data quality issues?  Let’s take a mortgage company, REMAX, which has a data warehouse, however, the data for sales isn’t consistent, because there are different definitions of what a sale/price could be based on differing stakeholders.  The mortgage company can say that a sale is the closing price of the house, whereas REMAX may say the negotiated list price of house, the broker may say the final settlement price of the house after the home inspection, the insurance company is the price of the building materials in the house plus 65-70 thousand dollars for internal possessions.  This may be all the data that REMAX wants to have to provide the best service to their customer and to provide a realistic view of what goes on in purchasing a house, monetarily, but REMAX must know this information ahead of time as they input this data into their data warehouse.  This could be valuable information for the home buyer when they are deciding which one of two to three properties that they would like own.  There could be syntactic inconsistencies between all these sources of data like $60K, $60,000, $60,000.00, 60K, $60000, etc.

Another way the implementation of a data warehouse could fail, according to Ballou & Tayi (1999), can come from not including appropriate data (in other words: data availability).   Even though critical data can exist among: soft data (uncertain data), text-based data, external sources of data, this set of data could altogether be ignored.  They continue to add that this type of data, so long as it can support the organization in any “meaningful way” should be added into the centralized data warehouse.  Though one must weigh the high cost of acquiring the data that may be useless because it is relatively easy (inexpensive) to delete data that is rarely used once in the system.  But, then there is an opportunity cost to adding irrelevant data, we could have used our resources to improve the timeliness of the current data (or provide real-time data) or eliminating null values in a different data set that is already in the system.

To solve the issue of data quality, decision-makers and data warehouse managers must think systematically about what data is required, why it is required, and how should it be collected and used (Ballou & Tayi, 1999).  This could be done when a data warehouse manager asks the end-users what decisions this data warehouse will support.  From that information one can decipher what is required from these stakeholders through the MoSCoW: What is a “Must have”?; What is a “Should have”?; What is a “Could have?”; and What is a “Wish to have”? In the REMAX case, they should have the final asking price before the inspection listed (as they do) as a “Must have”, typical closing costs for a house in that price range that is provided by the mortgage company as a “Should have”, the average house insurance costs as a “Could Have”, etc. Balou & Tayi (1999) said that other factors can affect data quality enhancement projects, like the: Current quality, required quality, anticipated quality, priority of organizational activity (as aforementioned with MoSCoW), Cost of data quality enhancements (and their aforementioned tradeoffs/opportunity costs), and their value-added to the data warehouse.  Data quality is needed in order to use data mining tools, and many papers using data mining or text mining always talk about a preprocessing step that must occur before full analysis can begin: Nassirtoussi et al (2015),  Kim et al (2014), Barak & Modarres (2015), etc.

According to Silltow (2006), data mining tools can be group into three types: Traditional (have complex algorithms and techniques to find hidden patterns in the data and highlight trends), dashboard (data changes are shown on a screen which is mostly used to monitor information), and text-mining (using complex algorithms and techniques to find hidden patterns in text data, even to a point of figuring out the sentiment in a string of words and can include video and audio data).  These data mining techniques range from artificial neural networks (prediction models that use training data to learn and then make forecasts) like in Nassirtoussi et al (2015) and Kim et al (2014); decision trees (uses a bunch of defined if-then statements, also known as rules, and are easier to understand the results of the data) like in Barak & Modarres (2015); nearest neighbor (uses similar past data to make predictions into the future), etc.

Finally, another aspect of data quality is the output of the data from data mining tools, especially since we can then plug the output back into the data warehouse for future reuse.  Data mining tools are just that, automatic algorithms used to discover knowledge.  These tools lack the intuitive nature presented in humans to decipher between a relevant and irrelevant correlation.  For instance, data stored in a hospital data warehouse may link data collected in the summer of insane amount of increased ice cream consumption which could lead to obesity and the number of pool/beach drownings and say that ice cream consumption leads to them, rather than looking at the fact that they both occur in the summer but are not necessarily causing one or the other.  This is why Silltow (2006) suggest that all results provided by these tools be quality checked after utilized to not give out false, irrelevant insights that are preposterous when analyzed by a human.

Conclusion

Data warehouses allow for people with decision power to locate adequate data quickly to make effective decisions. The data that is planned, entered, maintained should be of acceptable quality.  Poor quality in the data may drive poor quality decisions.  The best way to improve data quality is by looking at the eight factors of data quality as aforementioned when asking stakeholders what data from a systemic point of view would be useful in a data warehouse.  Sometimes asking what data should be included is very hard for decision-makers to make at that moment, though they could have a general idea of what decisions they need to make soon.  Data collection and quality must be weighed against all of their cost and their significance.

References

  • Ballou, D. P., & Tayi, G. K. (1999). Enhancing data quality in data warehouse environments. Communications of the ACM, 42(1), 73-78.
  • Barak, S., & Modarres, M. (2015). Developing an approach to evaluate stocks by forecasting effective features with data mining methods. Expert Systems with Applications, 42(3), 1325–1339. http://doi.org/10.1016/j.eswa.2014.09.026
  • Connolly, T. & Begg, C. (2015).  Database Systems:  A Practical Approach to Design, Implementation, and Management, Sixth Edition.  Boston:  Pearson.
  • Kim, Y., Jeong, S. R., & Ghani, I. (2014). Text opinion mining to analyze news for stock market prediction. Int. J. Advance. Soft Comput. Appl, 6(1).
  • My Unique Student Experience (2015a). Data Warehousing Concepts and Design. Retrieved from: https://class.ctuonline.edu/_layouts/MUSEViewer/ Asset.aspx?MID=1819502&aid=1819506
  • My Unique Student Experience (2015b). Online Analytical Processing. Retrieved from: https://class.ctuonline.edu/_layouts/MUSEViewer/Asset.aspx?MID=1819502&aid=1819509
  • Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2015). Text mining of news-headlines for FOREX market prediction: A Multi-layer Dimension Reduction Algorithm with semantics and sentiment. Expert Systems with Applications, 42(1), 306-324.
  • Silltow, J. (2006) Data mining 101: Tools and techniques.  Retrieved from: https://iaonline.theiia.org/data-mining-101-tools-and-techniques
  • Tryfona, N., Busborg, F., & Borch Christiansen, J. G. (1999, November). starER: a conceptual model for data warehouse design. In Proceedings of the 2nd ACM international workshop on Data warehousing and OLAP (pp. 3-8). ACM.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: