Pivoting Data in SparkSQL

One of the core values at Silicon Valley Data Science (SVDS) is contributing back to the community, and one way we do that is through open source contributions. One of the many new features in Spark 1.6.0 is the ability to pivot data in data frames. This was a feature requested by one of my colleagues that I decided to work on. Pivot tables are an essential part of data analysis and reporting. A pivot can be thought of as translating rows into columns while applying one or more aggregations. Many popular data manipulation tools (pandas, reshape2, and Excel) and databases (MS SQL and Oracle 11g) include the ability to pivot data. Below you’ll find some examples of how to use pivot in PySpark, using this dataset: https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/mpg.csv $ bin/pyspark…


Link to Full Article: Pivoting Data in SparkSQL