Joshua Cook

Docker for Data Science

Building Scalable and Extensible Data Infrastructure Around the Jupyter Notebook Server

Joshua Cook

Santa Monica, California, USA

Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub via the book’s product page, located at www.apress.com/978-1-4842-3011-4 . For more detailed information, please visit www.apress.com/source-code .

ISBN 978-1-4842-3011-4

e-ISBN 978-1-4842-3012-1

https://doi.org/10.1007/978-1-4842-3012-1

Library of Congress Control Number: 2017952396

© Joshua Cook 2017

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.

Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

Printed on acid-free paper

Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.

To my wife, Aylin.

Introduction

This text is designed to teach the concepts and techniques of Docker and its ecosystem as applied to the field of data science. Besides introducing the core Docker technologies (the container and image, the engine, the Dockerfile), this book contains a discussion on building larger integrated systems using the Jupyter Notebook Server and open source data stores MongoDB, PostgreSQL, and Redis.

The first chapter walks the reader through a recommended hardware configuration for working through the text using an AWS t2.micro . Chapters 2 and 3 introduce the core technologies used in the book, Docker and Jupyter, as well as the idea of interactive programming. Chapters 4 , 5 , 6 , and 9 dig deeper into specific areas of the Docker ecosystem. Chapter 7 explores the official Jupyter Docker images developed and maintained by the Jupyter development team. Chapter 8 introduces the Docker images for three open source data stores. Chapters 9 and 10 tie everything together, connecting Jupyter to data stores using Docker Compose. After having completed the book, readers are encouraged to reread Chapter 3 and Chapter 10 to begin to develop their own interactive software development style.

The concepts presented herein can be challenging, especially in terms of the abstraction of computer resources and processes. That said, no requisite knowledge is assumed. An attempt has been made to build the discussion from base principles. With this in mind, the reader should be comfortable working at the command line and have an adventurous and inquisitive spirit. We hope that readers with an intermediate to advanced understanding of Docker, Jupyter, or both will gain a deeper understanding of the concepts and learn novel approaches to the solving of computational problems using these tools.

Acknowledgments

Thanks to Mike Frantz, Gilad Gressel, Devon Muraoka, Bharat Ramanathan, Nash Taylor, Matt Zhou, and DSI Santa Monica Cohorts 3 and 4 for talking through some of the more abstract concepts herein with me. Thanks to Chad Arnett for keeping it weird. Thanks to Jim Kruidenier and Jussi Eloranta for teaching an old dog new tricks. Thanks to my father for continually inspiring me and my mother for giving me my infallible belief in goodness. Thanks to Momlo for paving the way and Dablo for his curiosity. Thanks to my wife, Aylin, for her belief in me and tolerance for the word “eigenvector.”

Contents

  1. Chapter 1:​ Introduction
    1. “Big Data”
    2. Recommended Practice for Learning
      1. Set up a New AWS Account
      2. Configure a Key Pair
    3. Infrastructure Limitations on Data
      1. Pull the jupyter/​scipy-notebook image
      2. Run the jupyter/​scipy-notebook Image
      3. Monitor Memory Usage
      4. What Size Data Set Will Cause a Memory Exception?​
      5. What Size Dataset Is Too Large to Be Used to Fit Different Kinds of Simple Models?​
    4. Summary
  2. Chapter 2:​ Docker
    1. Docker Is Not a Virtual Machine
    2. Containerization​
    3. A Containerized Application
    4. The Docker Container Ecosystem
      1. The Docker Client
      2. The Host
      3. The Docker Engine
      4. The Docker Image and the Docker Container
      5. The Docker Registry
    5. Get Docker
      1. Docker for Linux
      2. Docker for Mac
      3. Docker for Windows
      4. Docker Toolbox
    6. Hello, Docker!
      1. Basic Networking in Docker
    7. Summary
  3. Chapter 3:​ Interactive Programming
    1. Jupyter as Persistent Interactive Computing
      1. How Not to Program Interactively
      2. Setting Up a Minimal Computational Project
      3. Writing the Source Code for the Evaluation of a Bessel Function
      4. Performing Your Calculation Using Docker
      5. Compile Your Source Code
      6. Execute Compiled Binary
    2. How to Program Interactively
      1. Launch IPython Using Docker
      2. Persistence
      3. Jupyter Notebooks
      4. Port Connections
      5. Data Persistence in Docker
      6. Attach a Volume
    3. Summary
  4. Chapter 4:​ The Docker Engine
    1. Examining the Docker Workstation
    2. Hello, World in a Container
    3. Run Echo as a Service
      1. Isolating the Bootstrap Time
    4. A Daemonized Hello World
    5. Summary
  5. Chapter 5:​ The Dockerfile
    1. Best Practices
      1. Stateless Containers
      2. Single-Concern Containers
    2. Project:​ A Repo of Docker Images
      1. Prepare for Local Development
      2. Configure GitHub
      3. Building Images Using Dockerfiles
      4. Dockerfile Syntax
      5. Designing the gsl Image
      6. The Docker Build Cache
      7. Anaconda
      8. Design the miniconda3 Image
      9. tini
      10. ENTRYPOINT
      11. Design the ipython Image
      12. Run the ipython Image as a New Container
    3. Summary
  6. Chapter 6:​ Docker Hub
    1. Docker Hub
      1. Alternatives to Docker Hub
    2. Docker ID and Namespaces
    3. Image Repositories
    4. Search for Existing Repositories
    5. Tagged Images
      1. Tags on the Python Image
      2. Official Repositories
    6. Pushing to Docker Hub
      1. Create a New Repository
      2. Push an Image
      3. Pull the Image from Docker Hub
      4. Tagged Image on Docker Hub
    7. Summary
  7. Chapter 7:​ The Opinionated Jupyter Stacks
    1. High-Level Overview
      1. jupyter/​base-notebook
      2. Notebook Security
      3. The Default Environment
      4. Managing Python Versions
      5. Extending the Jupyter Image Using conda Environments
    2. Using joyvan to Install Libraries
    3. Ephemeral Container Extension
      1. Maintaining Semi-Persistent Changes to Images
    4. Summary
  8. Chapter 8:​ The Data Stores
    1. Serialization
      1. Serialization Formats and Methods
      2. Binary Encoding in Python
    2. Redis
      1. Pull the redis Image
    3. Docker Data Volumes and Persistence
      1. Create and View a New Data Volume
      2. Launch Redis as a Persistent Service
      3. Connecting Containers via Legacy Links
      4. Using Redis with Jupyter
      5. A Simple Redis Example
      6. Track an Iterative Process Across Notebooks
      7. Pass a Dictionary via a JSON Dump
      8. Pass a Numpy Array as a Bytestring
    4. MongoDB
      1. Set Up a New AWS t2.​micro
      2. Configure the New AWS t2.​micro for Docker
      3. Pull the mongo Image
      4. Create and View a New Data Volume
      5. Launch MongoDB as a Persistent Service
      6. Verify MongoDB Installation
      7. Using MongoDB with Jupyter
      8. MongoDB Structure
      9. pymongo
      10. Mongo and Twitter
      11. Obtain Twitter Credentials
      12. Collect Tweets by Geolocation
      13. Insert Tweets Into Mongo
    5. PostgreSQL
      1. Pull the postgres Image
      2. Create New Data Volume
      3. Launch PostgreSQL as a Persistent Service
      4. Verify PostgreSQL Installation
      5. Docker Container Networking
      6. Minimally Verify the Jupyter-PostgreSQL Connection
      7. Connnecting Containers by Name
      8. Using PostgreSQL with Jupyter
      9. Jupyter, PostgreSQL, Pandas, and psycopg2
      10. Minimal Verification
      11. Loading Data into PostgreSQL
      12. PostgreSQL Binary Type and Numpy
    6. Summary
  9. Chapter 9:​ Docker Compose
    1. Install docker-compose
    2. What Is docker-compose?​
      1. Docker Compose Versions
    3. Build a Simple Docker Compose Application
      1. Run Your Application with Compose
    4. Jupyter and Mongo with Persistence
      1. Specifying the Build Context
      2. Specify the Environment File
      3. Data Persistence
      4. Build Your Application with Compose
    5. Scaling an AWS Application via Instance Type
    6. Restart Docker Compose Application
    7. Complete the Computation
      1. Encode Tweets as Document Vectors
    8. Switch AWS Instance Type to t2.​micro
      1. Retrieve Tweets from MongoDB and Compare
    9. Docker Compose Networks
    10. Jupyter and Postgres with Persistence
      1. Specifying the Build Context
      2. Build and Run Your Application with Compose
    11. Summary
  10. Chapter 10:​ Interactive Software Development
    1. A Quick Guide to Organizing Computational Biology Projects
    2. A Project Framework for Interactive Development
    3. Project Root Design Pattern
    4. Initialize Project
    5. Examine Database Requirements
      1. Managing the Project via Git
    6. Adding a Database to Your Application
    7. Interactive Development
      1. Create a Python Module Using Jupyter
    8. Add Delayed Processing to Your Application
    9. Extending the Postgres Module
      1. Updating Your Python Module
    10. Summary
  11. Index

About the Author and About the Technical Reviewer

About the Author

A439726_1_En_BookFrontmatter_Figb_HTML.jpg

Joshua Cook is a mathematician. He writes code in Bash, C, and Python and has done pure and applied computational work in geo-spatial predictive modeling, quantum mechanics, semantic search, and artificial intelligence. He also has 10 years experience teaching mathematics at the secondary and post-secondary level. His research interests lie in high-performance computing, interactive computing, feature extraction, and reinforcement learning. He is always willing to discuss orthogonality or to explain why Fortran is the language of the future over a warm or cold beverage.

About the Technical Reviewer

A439726_1_En_BookFrontmatter_Figc_HTML.jpg

Jeeva S. Chelladhurai has been working as a DevOps specialist at the IBM GTS Labs for the last 9 years. He is the co-author of Learning Docker , published by PacktPub, UK. He has more than 20 years of IT industry experience. He has technically managed and mentored diverse teams across the globe in envisaging and building pioneering telecommunication products. He specializes in DevOps, automation, and cloud solution delivery, with a focus on data center optimization, software-defined environments (SDEs), and distributed application development, deployment, and delivery using the newest Docker technology. Jeeva is also a strong proponent of the agile methodologies, DevOps, and IT automation. He holds a master’s degree in computer science from Manonmaniam Sundaranar University and a graduation certificate in project management from Boston University, Boston, Massachusetts, USA. Besides his official responsibilities, he writes book chapters and authors research papers. He has been instrumental in crafting reusable technical assets for IBM solution architects and consultants. He speaks in technical forums on DevOps technologies and tools. He hosts one of the largest Open Source communities in Bangalore ( www.meetup.com/opensourceblr/ ). His LinkedIn profile can be found at www.linkedin.com/in/JeevaChelladhurai .

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.247.166