Starting the data conversation

The world is in love with data.  The idea of collecting information about the way we live and then transforming the characteristics and interactions into meaningful insights is fascinating. We collect data so we can extract actionable business insights and make successful, game-changing decisions. Data is an investment.

In our last few articles we briefly introduced the prediction of time-to-event, and its effect on managing account transition. We discussed wallet share maximization and customer lifetime value (CLV).  As we move forward in our exploration of SMART BANKING, we will be providing some examples to help you start working with your data and exploring some of the many benefits it provides.  We will begin the series by digging into what we call customer lifetime stages and show you how to identify them to extend your product offerings. We will end each article with an easy to follow recipe to give you tangible results you can implement along the way.

Getting started

Before we begin investigating the many aspects of SMART BANKING, I want to take a moment and prepare us for the journey.  Data exploration starts with a specific question we want to answer.  Getting reliable answers from data depends on the specificity of the question asked. The more specific the better.  We will refer to the question as the response.

Statistics follows the convention of innocent until proven guilty.  Imagine yourself in a courtroom.  The judge is the analyst and the data is the evidence against the accused (response).  A conviction happens when the data provides enough evidence to convict the accused.  Therefore, if the data has enough evidence to make the conviction, then the conviction shall stand.  It can be tempting to abuse this convention by formulating a pre-determined verdict based on a hunch and subsequently seeking confirmation of your verdict from the data.  Even if you ignore the ethics of the practice described above, from a statistical perspective, the practice is invalid because, as we explain later in this article, correlation is not causation. As in the legal context, the evidence should only be deemed valid if it was acquired by legitimate and objective means. The practice of simply finding correlation can drastically alter or even change the results of the analysis because the evidence is not linked to the logic behind the conclusion. For more information, see our previous article “Is your Data Lying to You”.

Data is a representation of the world around us.  How data is interpreted, and the resulting decisions derived from it, depend heavily on the researcher’s intuition, cleverness and skill.

Now as we begin to explore the aspects of SMART BANKING and specifically, customer stages it will be beneficial to have a sample of data to begin the exploration.  In order to move ahead more seamlessly, I would suggest assembling a small data set of 100 or so customer accounts with transactions spanning over the period of a few years.  If you have questions on how to assemble such a data set, please feel free to contact us for assistance.

The difference between causation and correlation.

Statistics gives us correlation.   Correlation is the quantification of how two or more things go together. The relationship is not necessarily deterministic.   Statistics, along with subject matter expertise and repeated realization gives us probable cause.  We use the term probable cause intentionally.  Data allows us to show probable cause, as in the courtroom.  Predictive analytics can give us probable cause for an event but it cannot ever predict the individual event.  Banks profit by using predictive analytics holistically.

When we say that statistics gives us correlation and not causation, it means that correlation does not tell us that for an event ‘A,’ and event ‘B’, ‘A causes B’ or ‘B causes A’.

‘A’ might be the number of car accidents in a major city and ‘B’ might be the price of ice cream.  Neither one directly causes the other, but the evidence in this case shows that they are highly correlated.  When two things are highly correlated, it does not mean that one is a direct cause of the other.  High correlation is likely a result of a variable that we have not measured; in this example that variable being the outside temperature.

The point we are attempting to make is that even with using the most advanced statistical techniques and having the most elaborate data warehouse; data alone cannot give us the cause of events that happen around us.

That doesn’t mean that there is nothing left to talk about and we need to shut the lights off and go home. What we need to know is what part of the data is important (the signal) and what part of the data is random noise.

Understand the environment your data is trying to describe.

In order to allow predictive analytics to be reliable, we need to not only know what data to collect but also how the characteristics of the data that we collect work together to explain a phenomenon.  Deciding what data to collect is an iterative learning process.  To facilitate the process we introduce an exploratory statistical method called ‘Factor Analysis’.

Factor analysis is a data analysis technique that uncovers hidden factors in data.  Factor analysis is widely used in the fields of biology and psychology in which data is often very complex and the subjects understudy are unable to communicate.  In banking we often find ourselves in the same situation. People are banking more and more online.  If customers were talking to us about their needs then there would be less of a need for predictive analytics.

Factor analysis is most easily explained as a means of uncovering hidden factors in the data that drive your business.  The generation of profits are not dependent on usually a single attribute or a single variable. Rather a complex synergy of the many decisions and actions taken on a daily basis. Factor analysis is a means of uncovering those factors, giving a roadmap to success by highlighting the interaction of variables that drive the solution.

A step-by-step guide to performing factor analysis on your data

A factor analysis is performed in four (4) steps.  Since a statistical platform is necessary to perform calculations, we choose the platform R for its simplicity of use, open source availability (free of cost) and robustness of available procedures.  (

Step 1:

The first step to any analysis exploration is to assemble your data set to include all of the variables or attributes you wish to consider, as candidates in your environment. The environment here is the problem statement.  For instance, it could be the definition of what variables to consider as possible candidates to be correlated with the response?  Include as many variables as you wish in this initial stage.  Be sure to not include more variables than observations.  If you have more variables than observations you will be need to increase the number of customers in your data set.  In short, there should always be more rows than columns in your data.

Step 2:

Perform the factor analysis

  1. Using the platform R
    1. If the data contains variables other than numeric such as gender, or occupation, use the function fa.parallel.poly() from the psych package to perform the factor analysis since this function will accept variables of category.
  1. If the data only contains numeric variables such as age or credit score, use the function factanal(). Documentation of the psych package can be found at the link provided below.

  • Consider different rotations of the data (varimax, promax, simplex, oblique, etc.) to get the largest standard deviation in differences between factor loadings.

Step 3:

After you have acquired your factors, consider each factor and determine what variables dominate factor composition.  This dominant variables will give you a clear picture of the relationships that exist within your data. Try and give a quick definition of each factor as it relates to the variables.

The definition is an essential step to understanding your data and what questions your particular data might be able to answer.  Performing this crucial step before performing an analysis will give insight into the validity of your results.

Step 4: 

Calculate factor scores on each observation in order to rank the factors that are most influential to a particular customer.  This ranking is a precursor to performing any type of clustering exercise and gives invaluable insights to the analyses performed. If at any point you wish to seek additional help in interpreting your results or performing the calculations please contact us for assistance.


Getting started with predictive analytics begins with the formulation of specific questions in which you wish to gain insight.  Often, the process becomes iterative and each question leads to the next as you dig deeper and deeper into the events that drive your business.  Factor analysis is a very powerful exploratory procedure that allows you and your data to begin the conversation.  There are many different tools available to perform a factor analysis depending on the many different types of data you might encounter.  Factor analysis gives you the ability to get a broad understanding of the relationships that exist within your data before performing rigorous statistical techniques.  A carefully executed factor analysis helps to ensure the data has the ability to provide the insights required.

In our next article we will begin digging deeper and providing examples as well as tutorials of the many benefits to SMART BANKING.  Follow along as we show you how to begin profiting from the data you already have.


Leave a reply

Your email address will not be published. Required fields are marked *