Our ability to collect and store data continues to increase year after year. With this, the impact such repositories have on our lives and for an organization increases as well. A competitive advantage for organizations is determined by their ability to analyze and make informed decisions from their troves of collected data. Unfortunately, this is not a simple task. Established organizations with the most to gain from their information conquests have possibly followed a long path of employing a number of data solutions to meet their evolving requirements.

Mainframes
Transactions databases
Data Warehouses
Cloud Based Solution – SalesForce, AWS
Single User Databases – Access

Each solution may have been employed along the path to align with a data strategy initiative, gap fill missing features of past implemented solutions or meet an immediate business need.

In environments where a combination of such sources exists, how can analytic solutions be derived efficiently? Such solutions are, realistically, fragmented due to the maturing process of internal analytical initiatives. Starting as a dream once formed in individual departments that slowly merged over time into an aligned vision accompanied by a centralized analytics department. Such instances can find themselves struggling with conflicting sources of data and ranges of contributing skill sets.

Before Analytical Layer — Figure 1: Inconsistent Analytics

Different skillsets aligning to the enterprise data sources; Marketing wants the well formated solutions provided by Tableau, Power BI. Risk needs models build in R, SAS, Python. Vendors need CSV extracts. Sales wants an Excel spreadsheet (just hope they don’t ‘tweak’ the numbers). Each custom solution uses a custom access mechanism (query, temp table, in-memory or on-disk dataset). How sustainable and rigorous is this process? Here is a structure that developed out of necessity but struggles to scale, innovate and apply rigor.

Broken Process

Sometimes it becomes necessary to focus on what one can positively impact with the resources available to them. For-instance, given Figure 1 above, I can not realistically redevelop an enterprise data strategy in a timeframe that will align with the immediate analytical requirements of an organization; it probably took decades to get this infrastructure in place. I can, however, identify the roadblocks and pain points that are impeding progress and then insert a solution into the process.

Empirical Evidence

After a period of observation I found the following as the common hindrances that prevented teams from being able to solidify an insight driven analytics foundation:

Collected, cleaned and validated data sets are often used one time
- Collecting and preparing data accounts for ~40%-60% of time allowed per deliverable
Multiple Data Access mechanisms prevents consistency – ‘reinventing the wheel’
Multiple Data Sources, some deprecated, prevents reliability as the same data is access from different sources.
Testing often occurs using the final output (report, csv, spreadsheet, etc)

Changing Behaviors – Analytical Layer

After identifying the list of success detractors, I constructed another list that I believe is the core of the Analytical Layer. I also believe these simple items promote a rigorous and robust analysis environment.

Can we provide solutions that ensure:

Reusability
- Solution is used across multiple projects and by multiple team members
Testability
- Shared data sets that have been validated closer to the source systems will save time and reduce errors/
Separation of Concerns
- Analytics should be separated from data structure concerns – data structures can and will changes, decoupling the analytics/reporting from the data structures will prevent future rework and errors.

Changing set behaviors is hard, if not impossible. To get people to want to change, the ‘change’ needs to be seamless (fit in with their current individual processes) and provide immediate benefits (save time and/or offer expanded feature sets).

Early on I found faults with the following :

Stored Procedures – Failed to encapsulate all sources, little performance benefits when compared to queries (despite being compiled), tight coupling to source sytems
RDBMS Tables – Managing database structure placed an additional burden on team, rigid structures may impede pace
OLAP Cubes – Long build-up time, too rigid to keep pace with fast moving analytics.

Then I started using NoSQL, more specifically MongoDB to construct an ‘Analytical Layer’. This gained traction very quickly as it was simple to implement, integrate into the existing process and stored the data in a fantastic structure. The document storage structure accommodates for analytical aggregations and hierarchical structures that would have required excess SQL to pivot, rank, and partition out of a flat table structure. Figure 2 below depicts the analytics process incorporating a NoSQL Analytical Layer.

Analytical Layer Diagram — Figure 2: Analytical Layer

Loading Data

The principle for the Analytical Layer is ‘fluidity’ – ensuring the analytics team has access to the data they need, in an analytical ready view, as soon as they need it. If analytics is waiting weeks-months to get the data then the process is a failure. For this layer, ETL can easily be done effectively using R or Python; blow away the collection every load with a new load of the data. I love the ease at which I can construct an analytical view of complex data structures using MongoDB and R. I am, however, waiting and preparing for the eventual conflict caused by such flexibility .

Adding Additional Attributes

Using R or Python offers the added benefits of enriching the data before storing it in the Analytical Layer. Running regression models, forecasts, cleansing functions, etc.

Predictors

The Analytical Layer is a great place to store model results: regression, machine learning,etc. Having the ability to track predictors (other model variables ) throughout time can benefit model training, validation and analysis.

Geocoding

I continue to find new and enjoyable ways to to use the Google Places API – Using R and Google Places API Geocode Locations . Often my source data does not have latitude and longitude on it so my ETL will get the coordinates via an API call using the address or business name, city, state.

Drilling Down

Since this is for analytics then collections are typically an aggregated subset of the original data, with a referencing system keys if needed.

Accessing Data

The Analytical Layer is primarily accessed directly from the NoSQL database; however, I am interested (and working on) expanding the layer out to a service. Complex processes that should be shared amount an analytics department should be service methods. This could include: regression models, geospatial distance functions, pricing models, and more.

When to use what:

Does shared function exists in service – use service
Does the data exist in NoSQL – use Analytical Layer

When NO is the answer to both of the above then the solution will have to come from the source data. If this data pull is not a one time event, then build it into the Analytical Layer.

Regards,

Jonathan

One response to “Implementing an Analytical Layer”

RMarkdown flexdashboard – Test Drive – Stochastic Coder

May 14, 2019 at 10:45 am

[…] form at the correct grain. I have been using Mongodb as the backbone of my Analytical Layer – Implementing an Analytical Layer. Even though this example is shown using mock data I have the ability to populate my Analytical […]

Loading…

Stochastic Coder

Implementing an Analytical Layer