0%

Data Analysis with Python and PySpark helps you solve the daily challenges of data science with PySpark. You’ll learn how to scale your processing capabilities across multiple machines while ingesting data from any source—whether that’s Hadoop clusters, cloud data storage, or local data files. Once you’ve covered the fundamentals, you’ll explore the full versatility of PySpark by building machine learning pipelines, and blending Python, pandas, and PySpark code.

Table of Contents

  1. inside front cover
  2. Data Analysis with Python and PySpark
  3. Copyright
  4. contents
  5. front matter
  6. 1 Introduction
  7. Part 1. Get acquainted: First steps in PySpark
  8. 2 Your first data program in PySpark
  9. 3 Submitting and scaling your first PySpark program
  10. 4 Analyzing tabular data with pyspark.sql
  11. 5 Data frame gymnastics: Joining and grouping
  12. Part 2. Get proficient: Translate your ideas into code
  13. 6 Multidimensional data frames: Using PySpark with JSON data
  14. 7 Bilingual PySpark: Blending Python and SQL code
  15. 8 Extending PySpark with Python: RDD and UDFs
  16. 9 Big data is just a lot of small data: Using pandas UDFs
  17. 10 Your data under a different lens: Window functions
  18. 11 Faster PySpark: Understanding Spark’s query planning
  19. Part 3. Get confident: Using machine learning with PySpark
  20. 12 Setting the stage: Preparing features for machine learning
  21. 13 Robust machine learning with ML Pipelines
  22. 14 Building custom ML transformers and estimators
  23. Appendix A. Solutions to the exercises
  24. Appendix B. Installing PySpark
  25. Appendix C. Some useful Python concepts
  26. index
  27. inside back cover
3.129.59.176