Like prediction and classification, understanding the way the data is organized can help us with analysis of data. One way to tease out the structure of the data is by examining clustering. Based on the patterns shown in the data, we can group individual observational units into distinct clusters. Clusters are defined so that observations within each cluster will have similar characteristics. We can do further analysis of each group as well as comparing between groups. For example, marketers may want to know the customer segments to develop targeted marketing strategies. A cluster analysis will group customers so that people in the same customer segments tend to have similar needs but are different from those in other customer segments. Some popular clustering methods are multi-dimensional scaling and latent class analysis.
The first step is constructing a classification system. The categories can be created based on either theories or observed statistical patterns such as those detected using clustering techniques. The next step is to identify the category or group to which a new observation belongs. For example, a new email can be put in the spam bin or non-spam bin based on the contents of the email. In statistics and machine learning, logistic regression, linear classifier, support vector machine and linear discriminant analysis are popular techniques used for classification problem.
Predictive models can be built with the available data to tell you what is likely to happen. Predictive models assume either that a knowledge of past statistical patterning can be used to predict the future or the validity of some type of theoretical model. For example, Netflix recommends movies to users based on the movies and shows which users have watched in the past.
Can we predict who will be the next president, Clinton or Trump? Yes, we can. Based on the polling data or candidates’ speeches, you can build a predictive model for the 2016 presidential election. Nate Silver is well-known for the accuracy of his predictions of both political and sporting events. Here is his prediction model on the 2016 presidential election:
Predictive modeling utilizes regression analysis, including linear regression, multiple regression and generalized linear models, as well as some machine learning algorithms, such as random forest tree and factor analysis. Time series analysis can be used to forecast weather and the sales of a product of next season.
Anomaly detection identifies unexpected or abnormal events. In the other words, we seek to find deviations from expected patterns. Detecting credit card fraud provides an example. Credit card companies can analyze customers’ purchase behavior and history, so they can alert customers of possible fraud. Here are examples of popular anomaly detection techniques: k-nearest neighbor, neural network, support vector machine and cluster analysis.
One of the most common motivations for analyzing data is to drive better decision making. When a company needs to promote a new product, it can employ data analysis to set the price to maximize profit and avoid price wars with other competitors. Data analysis is so central to decision making that almost all analytic techniques – including not only the ones mentioned above but also geographical information systems, social network analysis, and qualitative analysis – can be applied.