Feb 092016
 

Using data to answer interesting questions is what researchers are busy doing in today’s data driven world. Given huge volumes of data, the challenge of processing and analyzing it is a big one; particularly for statisticians or data analysts who do not have the time to invest in learning business intelligence platforms or technologies provided by Hadoop eco-system, Spark, or NoSQL databases that would help them to analyze terabytes of data in minutes.

The norm today is for researchers or statisticians to build their models on subsets of data in analytics packages like RMATLAB, or Octave, and then give the formulas and data processing steps to IT teams who then build production analytics solutions.

One problem with this approach is that if the researcher realizes something new after running his model on all of the data in production, the process has to be repeated all over again.

What if the researcher could work with a MongoDB developer and run his analysis on all of the production data and use it as his exploratory dataset, without having to learn any new technology or complex programming languages, or even SQL?

mongodb and business intelligence

If we use MongoDB’s Aggregation Pipeline and MEAN effectively we can achieve this in a reasonably short time. Through this article and the code that is available here in this GitHub repository, we would like to show how easy it is to achieve this.

Most of the Business Intelligence tools that are on the market are providing ways for researchers to import datasets from NoSQL and other Big Data technologies into the tool, then the transformations and analysis are done inside the tool. But in this business intelligence tutorial we are using the power of MongoDB Aggregation Pipeline without pulling the data out of MongoDB, and the researcher is using a simple interface to do all kinds of transformations on a production big data system.

MongoDB Aggregation Pipeline for Business Intelligence

Simply put, MongoDB’s aggregation pipeline is a framework to perform a series of data transformations on a dataset. The first stage takes the entire collection of documents as input, and from then on each subsequent stage takes the previous transformation’s result set as input and produces some transformed output.

There are 10 types of transformations that can be used in an aggregation pipeline:

  • $geoNear: outputs documents in order of nearest to farthest from a specified point
  • $match: filters input record set by any given expressions
  • $project: creates a resultset with a subset of input fields or computed fields
  • $redact: restricts the contents of the documents based on information from the document
  • $unwind: takes an array field with n elements from a document and returns n documents with each element added to each document as a field replacing that array
  • $group: groups by one or more columns and perform aggregations on other columns
  • $limit: picks first n documents from input sets (useful for percentile calculations, etc.)
  • $skip: ignores first n documents from input set
  • $sort: sorts all input documents as per the object given
  • $out: takes all the documents returned from previous stage and writes them to a collection

Except for the first and last in the list above, there are no rules about the order in which these transformations may be applied. $out should be used only once, and at the end, if we want to write the result of the aggregation pipeline to a new or existing collection. $geoNear can be used only as the first stage of a pipeline.

In order to make things easier to understand, let us walk through two datasets and two questions relevant to these datasets.

Difference in Salaries by Designation

In order to explain the power of MongoDB’s aggregation pipeline, we have downloaded a dataset which has salary information of university instructional staff for the entire US. This data is available at nces.ed.gov. We have data from 7598 institutions with the following fields:

var FacultySchema = mongoose.Schema({
	InstitutionName : String,
	AvgSalaryAll : Number,
	AVGSalaryProfessors : Number,
	AVGSalaryAssociateProfessors : Number,
	AVGSalaryAssistantProfessors : Number,
	AVGSalaryLecturers : Number,
	AVGSalaryInstructors : Number,
	StreetAddress : String,
	City : String,
	State : String,
	ZIPCode : String,
	MenStaffCount : Number,
	WomenStaffCount : Number
}

With this data we want to find out (on average) what the difference is between salaries of associate professors and professors by state. Then, an associate professor can realize in which state he is valued closer to a professor in terms of salary.

To answer this question, a researcher first needs to weed out bad data from the collection, because there are a few rows/documents in our dataset where the average salary is a null or empty string. To accomplish this cleaning of the dataset we will add the following stage:

{$match: {AVGSalaryProfessors: {$not: {$type: 2}}, AVGSalaryAssociateProfessors:  {$not: {$type: 2}}}}

This will filter out all the entities which have string values in those two fields. In MongoDB, each type is represented with a unique number – for strings, the type number is 2.

This dataset is a good example because in real world data analytics, engineers often have to deal with data cleanups as well.

Now that we have some stable data, we can continue to the next stage where we will average the salaries by state:

{$group: {_id: "$State", StateAVGSalaryProfessors: {$avg: "$AVGSalaryProfessors"}, StateAVGSalaryAssociateProfessors: {$avg: "$AVGSalaryAssociateProfessors"}}}

We just need to run a projection of the above result set and get the difference in state average salaries, as shown below in Stage 3 of our pipeline:

{$project: {_ID: 1, SalaryDifference: {$subtract: ["$StateAVGSalaryProfessors", "$StateAVGSalaryAssociateProfessors"]}}}

This should give us the state level average salary difference between professors and associate professors from a dataset of 7519 educational institutions all over US. To make it even more convenient to interpret this information, let us do a simple sort so we know which state has the least difference by adding a $sort stage:

{$sort: { SalaryDifference: 1}}

From this dataset, it is apparent that Idaho, Kansas, and West Virginia are three states where the difference in salaries of associate professors and professors is the least compared to all the other states.

The full aggregation pipeline generated for this is shown below:

[
{$match: {AVGSalaryProfessors: {$not: {$type: 2}}, AVGSalaryAssociateProfessors:  {$not: {$type: 2}}}},
{$group: {_id: "$State", StateAVGSalaryProfessors: {$avg: "$AVGSalaryProfessors"}, StateAVGSalaryAssociateProfessors: {$avg: "$AVGSalaryAssociateProfessors"}}},
{$project: {_ID: 1, SalaryDifference: {$subtract: ["$StateAVGSalaryProfessors", "$StateAVGSalaryAssociateProfessors"]}}},
{$sort: { SalaryDifference: 1}}
]

The resulting dataset that shows up looks like this. Researchers can also export these results to CSV in order to report on it using visualization packages like Tableau, or through simple Microsoft Excel charts.

mongodb dataset example

Average Pay by Employment Type

Another example that we will explore in this article involves a dataset obtained from www.data.gov. Given the payroll information of all state and local government organizations in the United States of America, we would like to figure out the average pay of full-time and part-time “Financial Administration” employees in each state.

The dataset has been imported, resulting in 1975 documents where each document follows this schema:

mongoose.Schema({
	State : String,
	GovernmentFunction : String,
	FullTimeEmployees : Number,
	VariationPCT : Number,
	FullTimePay : Number,
	PartTimeEmployees : Number,
	PartTimePay : Number,
	PartTimeHours : Number,
	FullTimeEquivalentEmployment : Number,
	TotalEmployees : Number,
	TotalMarchPay : Number
	}, {collection: 'payroll'});

The answer to this question may help a Financial Administration employee to choose the best state to move to. With our MongoDB aggregator pipeline based tool, this can be done quite easily:

In the first stage, filter on GovernmentFunction column to discard all non-”Financial Administration” entities:

{$match:{GovernmentFunction:'Financial Administration'}}

In the next stage of the tutorial, we will group the entities by state and calculate the average full time and part time salaries in each state:

{$group: {_id: '$State', FTP_AVG: {$avg: '$FullTimePay'}, PTM_AVG: {$avg: '$PartTimePay'}}}

Finally, we will sort the results from higher paying states to lower paying states:

{$sort: {FTP_AVG: -1, PTM_AVG: -1}}

This should allow the tool to generate the following aggregation pipeline:

[
    {$match:{GovernmentFunction:'Financial Administration'}},
    {$group: {_id: '$State', FTP_AVG: {$avg: '$FullTimePay'}, PTM_AVG: {$avg: '$PartTimePay'}}},
    {$sort: {FTP_AVG: -1, PTM_AVG: -1}}
]

Running the aggregation pipeline should produce some results like this:

mongodb aggregation pipeline

Building Blocks

To build this business intelligence application we used MEAN, which is a combination of MongoDBExpressJSAngularJS, and NodeJS.

MEAN Business intelligence

As you may already know, MongoDB is a schemaless document database. Even though each document that it stores is limited to 16MB in size, its flexibility and performance along with the aggregation pipeline framework it provides makes MongoDB a perfect fit for this tool. Getting started with MongoDB is very easy, thanks to its comprehensive documentation.

Node.js, another integral component of the MEAN, provides the event-driven server-side Javascript environment. Node.js runs Javascript using Google Chrome’s V8 engine. The scalability promises of Node.js is what is driving many organizations towards it.

Express.js is the most popular web application framework for Node.js. It makes it easy to build APIs or any other kind of server-side business layer for web applications. It is very fast because of its minimalist nature, but is also quite flexible.

AngularJS, created and maintained by a number of Google engineers, is rapidly becoming one of the most popular front-end Javascript frameworks available at our disposal.

There are two reasons why MEAN is so popular and our choice for application development at techXplorers:

  • The skillset is simple. An engineer who understands JavaScript is good to go on all layers.
  • Communication between front-end to business to database layers all happens through JSON objects, which saves us significant time in design and development at different layers.

Conclusion

In this MongoDB aggregation pipeline tutorial we have demonstrated a cost effective way to give researchers a tool where they can use production data as exploratory datasets and run different sets of transformations to analyze and construct models from.

We were able to develop and deploy this application end-to-end in just 3 days. This applicaton was developed by a team of 4 experienced engineers (2 in the US and 2 in India) and a designer and freelance UX expert helping us with some thoughts on interface design. At some point in the future, I will take the time to explain how this level of collaboration works to build awesome products in unbelievably short time.

We hope you take advantage of MongoDB’s Aggregation Pipeline, and put power in the hands of your researchers who can change the world with their clever analysis and insights.

This application is live to be played with here.

This post originally appeared on the Toptal Engineering blog

Sep 292015
 

Hi with the advent of the new generation of SSIS deployment model (via Integration Catalog) on to the SQL server, if you would have paid a keen attention to the list of databases in your SSMS, you would also see a new addition “SSISDB” and believe me this new database has tons of features which would enable you to have a great insight into your deployed packages.

CropperCapture[1]

CropperCapture[2]

Firstly I would like to mention if you use this new deployment model of SSIS out of the box you would be able to utilize good standard reports based in SSMS for tracking the errors in the SSIS and performance and other component level details.

CropperCapture[7]

CropperCapture[3]

CropperCapture[4]

But on top of it if again you would like to distribute these information across departments and not troubling your DBA team, SSRS comes in handy and there I step in to build the basic mock ups to provide the vital information relating to the SSIS execution runs, user details and more.

CropperCapture[8]

Feel free to upgrade and re-design the reports to your heart’s content with all the visualization required as per your needs and hope it helps you plan you IT set-up better.

Hope it helps.

Downloads

Sep 292015
 

At times in some organizations I have found certain DBA’s struggling with the SQL job history window. Which to me is little inconvenient and tiny to look into window for the major bugs causing the job failures.

CropperCapture[1]

And to access this window, one needs SSMS access to the server thus singling out the DBA and their team to deal with the issues which can be even dealt by the support desk and business user for themselves where failure is cause by input data issues and visible by the history.

So I thought to build a better reporting mechanism for the same which could render out the report for the operations status of the SQL job and easy to read report of the same which can be even subscription driven feature of SSRS.

CropperCapture[2]

The above is simple report which can be easily deployed to one of your server’s and tweaked more to build more complex and better visualized report than just the table to you hearts content.

Hope it helps.

Downloads

Sep 282015
 

Hi, ever caught up in dilemma where you try map your query parameter to pass in your OLEDB source and it seems to go into abyss, meaning nothing happens or no rows get returned with no clue in the log messages.

Hmm… Just that recently happened with me where I was trying to pass in one of the query parameters to my OLE DB source as follows:

SELECT * FROM [SomeTable] --This is a query for sample demo
WHERE [SomeKey] > <<Parameter>>

CropperCapture[1]

And eventually with loads of debugging with multiple drivers and login permissions we ended up in the discovering that the issue was aroused due the use of single line comments in the SQL which I just copied and bumped off in the OLE DB source component.

This comment in the SQL query was causing this issue when I queries the SQL 2008 R2 from the SSIS ETL package.

So finally the solution with other references I found was to use the block comments in SQL while bumping off in any place in SSIS.

SELECT * FROM [SomeTable] /*This is a query for sample demo*/
WHERE [SomeKey] > <<Parameter>>

CropperCapture[2]

The following article proved me quite helpful for the same and thought this experience of mine helps others.

http://blog.hoegaerden.be/2012/12/05/ssis-ole-db-source-parameters-and-comments-a-dangerous-mix/

Hope it helps.

Sep 272015
 

With the advent of the new age and technological advances, Reporting and BI is taking on a new shape in multiple dimensions.

One arena is expanding in terms of the data volumes (terming Big Data) another arena into more complex calculations and analysis and ease of dimensional modelling (products such as SSAS, Jedox, R, Python and more..) some in fancy reporting and interactive dashboards (like Power View/Maps/BI, SSRS).

But some thing caught up my attention recently was the usage of HTML 5 web sockets and its implementation in ASP MVC framework via Signal R framework. The combination of two opened up a whole new world of imaginations and solutions, I was holding aback in my wishes for a fair amount of time.

One of the imaginations I would like to put forth in this blog is as follows:

Real Time Monitoring Solutions for the Corporate Networks Servers

CropperCapture[3]

Many times in my previous experiences across organizations, Servers went down or application crashes occurred as disk ran out of space, or thread pool recycles, or processes/memory starvation occurred during the peak hours. All due to no insight into the server status across the network and their real-time status.

Though periodic jobs can help out extract the vital statistics (via Performance Counters) to the common database, of which the reports could be ran, but still no real time insight into the server was a real road-block.

Many times the commercial solutions though available are as usual with all the cost and managerial approvals are either un-affordable or just not right in some ways or other as off-the shelf products never fit the bill in all aspects of the business requirements. So to over come I mocked a quick solution to show case the power of the above frameworks in a simple POC which monitors any machine off its vital statistics (via Performance Counters) just by providing the Machine name.

The source code is attached in this blog for the same, and its details can be interpreted via the code as its simple enough with all the relevant documentation for the frameworks.

Just to add some pointers for my code references:

http://haishibai.blogspot.com.au/2010/02/tiy-collect-remote-performance-counters.html

http://www.asp.net/signalr/overview/getting-started/tutorial-server-broadcast-with-signalr

Note: Once the machine name is passed by the User from the UI for the statistics requested, the web app creates a timer on the server side which has an interval defined of 1 second on which it extracts the performance counter of the target machine and send it across to the request user via the signal R/ Web socket methodology.

In order to run the code only one change is required where the credentials need to be set in the user code (not in web.config as its a POC) which has access to the requested machine performance counters information.

MonitorApplication\MonitorApplication\Hubs\ServerInfo.cs

CropperCapture[1]

Hope it helps.

Downloads