Apache Releases Spark 1.6

A new faster version of Apache’s Spark open source data processing engine has been released with a new Dataset API and improved data science features. Apache Spark 1.6 is faster, has a new Dataset API, and the data science features have been improved. The improved performance starts with changes to the scanning of Parquet data. Parquet is one of the most commonly used data formats with Spark. In the past, Spark’s Parquet reader used parquet-mr to read and decode Parquet files. The developers profiled many Spark applications and found that many cycles tend to be spent in “record assembly”, a process that reconstructs records from Parquet columns. A new Parquet reader has been introduced that bypasses parquert-mr’s record assembly and uses a more optimized code path for flat schemas. According…


Link to Full Article: Apache Releases Spark 1.6