Overview of Data Science using Spark on Azure HDInsight

Introduction Spark on Azure HDInsight is able to manipulate data in-memory while processing it on an HDInsight (Hadoop) cluster in a distributed way. This Spark cluster thus combines speed with capacity. It also includes support for Jupyter notebooks on the Spark cluster that can run Spark SQL interactive queries for transforming, filtering and visualizing data stored in Azure Blobs (WASB). Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. Spark processing engine is built for speed, ease of use, and sophisticated analytics. Spark’s in-memory computation capabilities make it a good choice for iterative algorithms in machine learning and graph computations. MLlib is Spark’s scalable machine learning library. HDInsight Spark is the Azure hosted offering of open-source Spark. The setup steps…


Link to Full Article: Overview of Data Science using Spark on Azure HDInsight