Table of Contents

Preface

Section 1: Overview of What Arrow Is, its Capabilities, Benefits, and Goals

Chapter 1: Getting Started with Apache Arrow

Technical requirements

Understanding the Arrow format and specifications

Why does Arrow use a columnar in-memory format?

Learning the terminology and physical memory layout

Quick summary of physical layouts, or TL;DR

How to speak Arrow

Arrow format versioning and stability

Would you download a library? Of course!

Setting up your shooting range

Using pyarrow For Python

C++ for the 1337 coders

Go Arrow go!

Summary

References

Chapter 2: Working with Key Arrow Specifications

Technical requirements

Playing with data, wherever it might be!

Working with Arrow tables

Accessing data files with pyarrow

Accessing data files with Arrow in C++

pandas firing Arrow

Putting pandas in your quiver

Making pandas run fast

Keeping pandas from running wild

Sharing is caring… especially when it's your memory

Diving into memory management

Managing buffers for performance

Crossing the boundaries

Summary

Chapter 3: Data Science with Apache Arrow

Technical requirements

ODBC takes an Arrow to the knee

Lost in translation

SPARKing new ideas on Jupyter

Understanding the integration

Everyone gets a containerized development environment!

SPARKing joy with Arrow and PySpark

Interactive charting powered by Arrow

Stretching workflows onto Elasticsearch

Indexing the data

Summary

Section 2: Interoperability with Arrow: pandas, Parquet, Flight, and Datasets

Chapter 4: Format and Memory Handling

Technical requirements

Storage versus runtime in-memory versus message-passing formats

Long-term storage formats

In-memory runtime formats

Message-passing formats

Summing up

Passing your Arrows around

What is this sorcery?!

Producing and consuming Arrows

Learning about memory cartography

The base case

Parquet versus CSV

Mapping data into memory

Too long; didn't read (TL;DR) – Computers are magic

Summary

Chapter 5: Crossing the Language Barrier with the Arrow C Data API

Technical requirements

Using the Arrow C data interface

The ArrowSchema structure

The ArrowArray structure

Example use cases

Using the C Data API to export Arrow-formatted data

Importing Arrow data with Python

Exporting Arrow data with the C Data API from Python to Go

Streaming across the C Data API

Streaming record batches from Python to Go

Other use cases

Some exercises

Summary

Chapter 6: Leveraging the Arrow Compute APIs

Technical requirements

Letting Arrow do the work for you

Input shaping

Value casting

Types of functions

Executing compute functions

Using the C++ compute library

Using the compute library in Python

Picking the right tools

Adding a constant value to an array

Summary

Chapter 7: Using the Arrow Datasets API

Technical requirements

Querying multifile datasets

Creating a sample dataset

Discovering dataset fragments

Filtering data programmatically

Expressing yourself – a quick detour

Using expressions for filtering data

Deriving and renaming columns (projecting)

Using the Datasets API in Python

Creating our sample dataset

Discovering the dataset

Using different file formats

Filtering and projecting columns with Python

Streaming results

Working with partitioned datasets

Summary

Chapter 8: Exploring Apache Arrow Flight RPC

Technical requirements

The basics and complications of gRPC

Building modern APIs for data

Efficiency and streaming are important

Arrow Flight's building blocks

Horizontal scalability with Arrow Flight

Adding your business logic to Flight

Other bells and whistles

Understanding the Flight Protocol Buffer definitions

Using Flight, choose your language!

Building a Python Flight Server

Building a Go Flight server

What is Flight SQL?

Setting up a performance test

Running the performance test

Flight SQL, the new kid on the block

Summary

Section 3: Real-World Examples, Use Cases, and Future Development

Chapter 9: Powered by Apache Arrow

Swimming in data with Dremio Sonar

Clarifying Dremio Sonar's architecture

The library of the Gods…of data analysis

Spicing up your ML workflows

Bringing the AI engine to where the data lives

Arrow in the browser using JavaScript

Gaining a little perspective

Taking flight with Falcon

Summary

Chapter 10: How to Leave Your Mark on Arrow

Technical requirements

Contributing to open source projects

Communication is key

You don't necessarily have to contribute code

There are a lot of reasons why you should contribute!

Preparing your first pull request

Navigating JIRA

Setting up Git

Orienting yourself in the code base

Building the Arrow libraries

Creating the PR

Understanding the CI configuration

Development using Archery

Find your interest and expand on it

Getting that sweet, sweet approval

Finishing up with style!

C++ styling

Python code styling

Go code styling

Summary

Chapter 11: Future Development and Plans

Examining Flight SQL (redux)

Why Flight SQL?

Defining the Flight SQL protocol

Firing a Ballista using Data(Fusion)

What about Spark?

Looking at Ballista's development roadmap

Building a cross-language compute serialization

Why Substrait?

Working with Substrait serialization

Getting involved with Substrait development

Final words

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.89.2