Preface 1.Introductioto High Performance Spark What Is Spark and Why Performance Matters What You CaExpect to Get from This Book Spark Versions Why Scala To Be a Spark Expert You Have to Leara Little Scala Anyway The Spark Scala API Is Easier to Use Thathe lava API Scala Is More Performant ThaPython Why Not Scala Learning Scala Conclusion 2.How Spark Works How Spark Fits into the Big Data Ecosystem Spark Components Spark Model of Parallel Computing: RDDs Lazy Evaluation In-Memory Persistence and Memory Management Immutability and the RDD Interface Types of RDDs Functions oRDDs: Transformations Versus Actions Wide Versus Narrow Dependencies Spark Job Scheduling Resource AllocatioAcross Applications The Spark Application The Anatomy of a Spark lob The DAG Jobs Stages Tasks Conclusion 3.DataFrames, Datasets, and Spark SQL Getting Started with the SparkSessio(or HiveContext or SQLContext) Spark SQL Dependencies Managing Spark Dependencies Avoiding Hive JARs Basics of Schemas DataFrame API Transformations Multi-DataFrame Transformations PlaiOld SQL Queries and Interacting with Hive Data Data RepresentatioiDataFrames and Datasets Tungsten Data Loading and Saving Functions DataFrameWriter and DataFrameReader Formats Save Modes Partitions (Discovery and Writing) Datasets Interoperability with RDDs, DataFrames, and Local Collections Compile-Time Strong Typing Easier Functional (RDD 'like') Transformations Relational Transformations Multi-Dataset Relational Transformations Grouped Operations oDatasets Extending with User-Defined Functions and Aggregate Functions (UDFs,UDAFs) Query Optimizer Logical and Physical Plans Code Generation Large Query Plans and Iterative Algorithms Debugging Spark SQL Queries BC/ODBC Server Conclusion 4.Joins (SQL and Core) Core Spark Joins Choosing a JoiType Choosing aExecutioPlan Spark SQL Joins DataFrame Joins Dataset Joins Conclusion 5.Effective Transformations Narrow Versus Wide Transformations Implications for Performance Implications for Fault Tolerance The Special Case of coalesce What Type of RDD Does Your TransformatioReturn Minimizing Object Creation Reusing Existing Objects Using Smaller Data Structures Iterator-to-Iterator Transformations with mapPartitions What Is aIterator-to-Iterator Transformation Space and Time Advantages AExample Set Operations Reducing Setup Overhead Shared Variables Broadcast Variables Accumulators Reusing RDDs Cases for Reuse Deciding if Repute Is Inexpensive Enough Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files Alluxio (nee Tachyon) LRU Caching Noisy Cluster Considerations Interactiowith Accumulators Conclusion 6.Working with Key/Value Data The Goldilocks Example Goldilocks Versio0: Iterative Solution How to Use PairRDDFunctions and OrderedRDDFunctions Actions oKey/Value Pairs What's So Dangerous About the groupByKey Function Goldilocks Versio1: groupByKey Solution Choosing aAggregatioOperation Dictionary of AggregatioOperations with Performance Considerations Multiple RDD Operations Co-Grouping Partitioners and Key/Value Data Using the Spark Partitioner Object Hash Partitioning Range Partitioning Custom Partitioning Preserving Partitioning InformatioAcross Transformations Leveraging Co-Located and Co-Partitioned RDDs Dictionary of Mapping and Partitioning Functions PairRDDFunctions Dictionary of OrderedRDDOperations Sorting by Two Keys with SortByKey Secondary Sort and repartitionAndSortWithinPartitions Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function How Not to Sort by Two Orderings Goldilocks Versio2: Secondary Sort A Different Approach to Goldilocks Goldilocks Versio3: Sort oCell Values Straggler Detectioand Unbalanced Data Back to Goldilocks (Again) Goldilocks Versio4: Reduce to Distinct oEach Partition Conclusion 7.Going Beyond Scala Beyond Scala withithe JVM Beyond Scala, and Beyond the JVM How PySpark Works How SparkR Works Spark.jl (Julia Spark) How Eclair JS Works Spark othe CommoLanguage Runtime (CLR)——C# and Friends Calling Other Languages from Spark Using Pipe and Friends JNI Java Native Access (JNA) Underneath Everything Is FORTRAN Getting to the GPU The Future Conclusion 8.Testing and Validation Unit Testing General Spark Unit Testing Mocking RDDs Getting Test Data Generating Large Datasets Sampling Property Checking with ScalaCheck Computing RDD Difference IntegratioTesting Choosing Your IntegratioTesting Environment Verifying Performance Spark Counters for Verifying Performance Projects for Verifying Performance Job Validation Conclusion 9.Spark MLlib and ML Choosing BetweeSpark MLlib and Spark ML Working with MLlib Getting Started with MLlib (Organizatioand Imports) MLlib Feature Encoding and Data Preparation Feature Scaling and Selection MLlib Model Training Predicting Serving and Persistence Model Evaluation Working with Spark ML Spark ML Organizatioand Imports Pipeline Stages ExplaiParams Data Encoding Data Cleaning Spark ML Models Putting It All Together ia Pipeline Training a Pipeline Accessing Individual Stages Data Persistence and Spark ML Extending Spark ML Pipelines with Your OwAlgorithms Model and Pipeline Persistence and Serving with Spark ML General Serving Considerations Conclusion 10.Spark Components and Packages Stream Processing with Spark Sources and Sinks Batch Intervals Data Checkpoint Intervals Considerations for DStreams Considerations for Structured Streaming High Availability Mode (or Handling Driver Failure or Checkpointing) GraphX Using Community Packages and Libraries Creating a Spark Package Conclusion A.Tuning, Debugging, and Other Things Developers Like to Pretend Don't Exist Index