For data mining success one must follow a data mining process. There are many processes out there, and here are two:
- From Fayyad, Piatetsky-Shapiro, and Smyth (1996)
- Data -> Selection -> Target Data -> Preprocessing -> Preprocesses Data -> Transformation -> Transformed Data -> Data Mining -> Patterns -> Interpretation/Evaluation -> Knowledge
- From Padhy, Mishra, and Panigrahi (2012)
- Business understanding -> Data understanding -> Data Preparation -> Modeling -> Evaluation -> Deployment
Success has much to do with the events that lead to the main event as it does with the main event. Thus, what is done to the data before data mining can proceed successfully. Fayyad et al. (1996) address that data mining is just a subset of the knowledge discovery process, where data mining provides the algorithms/math that helps reach the final goal. Looking at each of the individual processes we can see that they are slightly different, yet the same. Another key thing to note is that we can move back and forth (i.e. iterations) between the steps in these processes. These two are supposing that data is being pulled from a knowledge database or data warehouse, where the data should be cleaned (uniformly represented, handling missing data, noise, and errors) and accessible (provided access paths to data).
If the removal of the pre-processing stage or data preparation phase, we will never be able to reduce the high-dimensionality in the data sets (Fayyad et al., 1996). High dimensionality increases the size of the data, thus increases the need for more processing time, which may not be as advantageous on a real-time data feed into the data mining derived model. Also, with all this data, you run into the chances that the data model derived through the data mining process will pick up spurious patterns, which will not be easily generalizable or even understandable for descriptive purposes (Fayyad et al., 1996). Descriptive purposes are data mining for the sake of understanding the data, whereas predictive purposes are for data mining for the sake of predicting the next result of an input of variables from a data source (Fayyad et al., 1996, Padhy et al., 2012). Thus, to avoid this high-dimensionality problem, we must understand the problem, understand why we have the data we have, what data is needed and reduce the dimensions to the bare essentials.
Another challenge that would come from data mining if we did do the selection, data understanding, or data mining algorithm selection, the step is the issue is overfitting. Fayyad et al. (1996), defines selection as selecting the key data you need to feed into the model, and selecting the right data mining algorithm which will influence the results. Understanding the problem will allow you to select the right data dimensions as aforementioned as well as the data mining algorithm (Padhy et al., 2012). Overfitting is when a data mining algorithm tries to not only derive general patterns in the data but also describes it with noisy data (Fayyad et al., 1996). Through the selection process, you can pick data with reduced noise to avoid an overfitting problem. Also, Fayyad et al. (1996) suggest that solutions should include: cross-validation, regularization, and other statistical analysis. Overfitting issues though can be fixed through understanding what you are looking for before using data mining, will aid in the evaluation/interpretation process (Padhy et al., 2012).
Variety in big data that changes with time, while applying the same data mined model, will at one point, either be outdated (no longer relevant) or invalid. This is the case in social media, if we try to read posts without focusing on one type of post, it would be hard to say that one particular data pattern model derived from data mining is valid. Thus, previously defined patterns are no longer valid as data rapidly change with respect to time (Fayyad et al., 1996). We would have to solve this, through incrementally modifying, deleting or augmenting the defined patterns in the data mining process, but as data can vary in real-time, in the drop of a hat, and this can be quite hard to do (Fayyad et al., 1996).
Missing data and noisy data is very prevalent in Meteorology; we cannot sample the entire atmosphere at every point at every time. We send up weather balloons 2-4 times a day at two points in a US state at a time. We then try to feed that into a model for predictive purposes. However, we have a bunch of gaps in the data. What happens if the weather balloon is a dud, and we get no data. Hence, we have missing data. This is a problem with the data. How are we supposed to rely on the solution derived through data mining if the data is either missing or noisy? Fayyad et al. (1996) said that missing values are “not designed with discovery in mind”, but we must include statistical strategies to define what these values should be. One of the ones that meteorologist use is data interpolation. There are many types of interpolation, revolving simple nearest neighbor ones, to complex Gaussian types.
- Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37. Retrieved from: http://www.aaai.org/ojs/index.php/aimagazine/article/download/1230/1131/
- Padhy, N., Mishra, D., & Panigrahi, R. (2012). The survey of data mining applications and feature scope. arXiv preprint arXiv:1211.5723. Retrieved from: https://arxiv.org/ftp/arxiv/papers/1211/1211.5723.pdf