Our ability to collect and store data continues to increase year after year. With this, the impact such repositories have on our lives and for an organization increases as well. A competitive advantage for organizations is determined by their ability to analyze and make informed decisions from their troves of collected data. Unfortunately, this is not a simple task. Established organizations with the most to gain from their information conquests have possibly followed a long path of employing a number of data solutions to meet their evolving requirements.
- Transactions databases
- Data Warehouses
- Cloud Based Solution – SalesForce, AWS
- Single User Databases – Access
Each solution may have been employed along the path to align with a data strategy initiative, gap fill missing features of past implemented solutions or meet an immediate business need.
In environments where a combination of such sources exists, how can analytic solutions be derived efficiently? Such solutions are, realistically, fragmented due to the maturing process of internal analytical initiatives. Starting as a dream once formed in individual departments that slowly merged over time into an aligned vision accompanied by a centralized analytics department. Such instances can find themselves struggling with conflicting sources of data and ranges of contributing skill sets.
Different skillsets aligning to the enterprise data sources; Marketing wants the well formated solutions provided by Tableau, Power BI. Risk needs models build in R, SAS, Python. Vendors need CSV extracts. Sales wants an Excel spreadsheet (just hope they don’t ‘tweak’ the numbers). Each custom solution uses a custom access mechanism (query, temp table, in-memory or on-disk dataset). How sustainable and rigorous is this process? Here is a structure that developed out of necessity but struggles to scale, innovate and apply rigor.
Sometimes it becomes necessary to focus on what one can positively impact with the resources available to them. For-instance, given Figure 1 above, I can not realistically redevelop an enterprise data strategy in a timeframe that will align with the immediate analytical requirements of an organization; it probably took decades to get this infrastructure in place. I can, however, identify the roadblocks and pain points that are impeding progress and then insert a solution into the process.
After a period of observation I found the following as the common hindrances that prevented teams from being able to solidify an insight driven analytics foundation:
- Collected, cleaned and validated data sets are often used one time
- Collecting and preparing data accounts for ~40%-60% of time allowed per deliverable
- Multiple Data Access mechanisms prevents consistency – ‘reinventing the wheel’
- Multiple Data Sources, some deprecated, prevents reliability as the same data is access from different sources.
- Testing often occurs using the final output (report, csv, spreadsheet, etc)
Changing Behaviors – Analytical Layer
After identifying the list of success detractors, I constructed another list that I believe is the core of the Analytical Layer. I also believe these simple items promote a rigorous and robust analysis environment.
Can we provide solutions that ensure:
- Solution is used across multiple projects and by multiple team members
- Shared data sets that have been validated closer to the source systems will save time and reduce errors/
- Separation of Concerns
- Analytics should be separated from data structure concerns – data structures can and will changes, decoupling the analytics/reporting from the data structures will prevent future rework and errors.
Changing set behaviors is hard, if not impossible. To get people to want to change, the ‘change’ needs to be seamless (fit in with their current individual processes) and provide immediate benefits (save time and/or offer expanded feature sets).
Early on I found faults with the following :
- Stored Procedures – Failed to encapsulate all sources, little performance benefits when compared to queries (despite being compiled), tight coupling to source sytems
- RDBMS Tables – Managing database structure placed an additional burden on team, rigid structures may impede pace
- OLAP Cubes – Long build-up time, too rigid to keep pace with fast moving analytics.
Then I started using NoSQL, more specifically MongoDB to construct an ‘Analytical Layer’. This gained traction very quickly as it was simple to implement, integrate into the existing process and stored the data in a fantastic structure. The document storage structure accommodates for analytical aggregations and hierarchical structures that would have required excess SQL to pivot, rank, and partition out of a flat table structure. Figure 2 below depicts the analytics process incorporating a NoSQL Analytical Layer.
The principle for the Analytical Layer is ‘fluidity’ – ensuring the analytics team has access to the data they need, in an analytical ready view, as soon as they need it. If analytics is waiting weeks-months to get the data then the process is a failure. For this layer, ETL can easily be done effectively using R or Python; blow away the collection every load with a new load of the data. I love the ease at which I can construct an analytical view of complex data structures using MongoDB and R. I am, however, waiting and preparing for the eventual conflict caused by such flexibility .
Adding Additional Attributes
Using R or Python offers the added benefits of enriching the data before storing it in the Analytical Layer. Running regression models, forecasts, cleansing functions, etc.
The Analytical Layer is a great place to store model results: regression, machine learning,etc. Having the ability to track predictors (other model variables ) throughout time can benefit model training, validation and analysis.
I continue to find new and enjoyable ways to to use the Google Places API – Using R and Google Places API Geocode Locations . Often my source data does not have latitude and longitude on it so my ETL will get the coordinates via an API call using the address or business name, city, state.
Since this is for analytics then collections are typically an aggregated subset of the original data, with a referencing system keys if needed.
The Analytical Layer is primarily accessed directly from the NoSQL database; however, I am interested (and working on) expanding the layer out to a service. Complex processes that should be shared amount an analytics department should be service methods. This could include: regression models, geospatial distance functions, pricing models, and more.
When to use what:
- Does shared function exists in service – use service
- Does the data exist in NoSQL – use Analytical Layer
When NO is the answer to both of the above then the solution will have to come from the source data. If this data pull is not a one time event, then build it into the Analytical Layer.