Client: IT MNC
Application: Big Data Analytics Platform
Scope: Data Ingestion, ETL and Reporting
Tools/ Languages(s): HDFS, Hive, Sqoop, Spark, Kafka, Java, Scala, Cognos, Analytics


Monitoring millions of host machines and servers across geographies was never too easy for this IT infrastructure MNC. Different IT controls were in place through different tools for application updates, OS patch installations, antivirus updates, encryption and data loss prevention. However, compliance data pertaining to different controls was stored in different databases. There was no single view for gaining insights on non-compliant machines, areas of non-compliance or improvement trends in IT controls compliance. How a Hadoop based data lake solution was leveraged to achieve this?

Infrastructure managers were perplexed on getting different reports from different sources daily on dozens of IT control parameters. Many times data from one report was in contradiction with data from other report. Also, network logs were analysed in isolation from other IT control parameters. It was finally decided to create a single data lake on a Hadoop cluster for ingesting, processing, combining and storing structured data from 8 different databases and semi-structured data from network logs into a single Hive data warehouse.

A single dashboard solution with drill downs and drill-through reports was developed to provide single view of millions of machines and servers based upon machine types, OS types, location-wise, domain-wise and compliant vs non-compliant and periodic trends of compliance across the organization. Also, dashboards for monitoring network events like SSH connection, Deny/Drop on specific hosts and ports, Failed login attempts etc were provided. It helped Infrastructure BU in focusing on the areas of non-compliance and pugging those loopholes.


  • Installation, Configuration, Tuning and Administration of a multi-node Hadoop Cluster.
  • Sqoop jobs for ingesting IT controls and compliance data from multiple SQL Server DB sources into HDFS and Hive tables. Hive tables partitioned on fields like location and month-year for faster querying.
  • Kafka Cluster for real-time aggregation and streaming of network logs and Spark Streaming-Kafka integration jobs for processing of network logs.
  • Spark jobs for data processing and combining for final results.
  • Active Reports and Dashboards in Cognos Analytics for compliance and network monitoring and reporting.


  • Single data lake for all compliance data analytics and reporting.
  • Single dashboard with drill-through, drill-down and reporting options for gaining deeper insights on clients, trades and portfolios.
  • Faster processing of huge amount of data with parallel processing by Spark jobs.