Docker for Data Science

Building Scalable and Extensible Data Infrastructure Around the Jupyter Notebook Server

Joshua Cook

Santa Monica, California, USA

Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub via the book’s product page, located at www.apress.com/978-1-4842-3011-4 . For more detailed information, please visit www.apress.com/source-code .

ISBN 978-1-4842-3011-4

e-ISBN 978-1-4842-3012-1

https://doi.org/10.1007/978-1-4842-3012-1

Library of Congress Control Number: 2017952396

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.

Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

Printed on acid-free paper

Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.

To my wife, Aylin.

Introduction

This text is designed to teach the concepts and techniques of Docker and its ecosystem as applied to the field of data science. Besides introducing the core Docker technologies (the container and image, the engine, the Dockerfile), this book contains a discussion on building larger integrated systems using the Jupyter Notebook Server and open source data stores MongoDB, PostgreSQL, and Redis.

The first chapter walks the reader through a recommended hardware configuration for working through the text using an AWS t2.micro . Chapters 2 and 3 introduce the core technologies used in the book, Docker and Jupyter, as well as the idea of interactive programming. Chapters 4 , 5 , 6 , and 9 dig deeper into specific areas of the Docker ecosystem. Chapter 7 explores the official Jupyter Docker images developed and maintained by the Jupyter development team. Chapter 8 introduces the Docker images for three open source data stores. Chapters 9 and 10 tie everything together, connecting Jupyter to data stores using Docker Compose. After having completed the book, readers are encouraged to reread Chapter 3 and Chapter 10 to begin to develop their own interactive software development style.

The concepts presented herein can be challenging, especially in terms of the abstraction of computer resources and processes. That said, no requisite knowledge is assumed. An attempt has been made to build the discussion from base principles. With this in mind, the reader should be comfortable working at the command line and have an adventurous and inquisitive spirit. We hope that readers with an intermediate to advanced understanding of Docker, Jupyter, or both will gain a deeper understanding of the concepts and learn novel approaches to the solving of computational problems using these tools.

Acknowledgments

Thanks to Mike Frantz, Gilad Gressel, Devon Muraoka, Bharat Ramanathan, Nash Taylor, Matt Zhou, and DSI Santa Monica Cohorts 3 and 4 for talking through some of the more abstract concepts herein with me. Thanks to Chad Arnett for keeping it weird. Thanks to Jim Kruidenier and Jussi Eloranta for teaching an old dog new tricks. Thanks to my father for continually inspiring me and my mother for giving me my infallible belief in goodness. Thanks to Momlo for paving the way and Dablo for his curiosity. Thanks to my wife, Aylin, for her belief in me and tolerance for the word “eigenvector.”

Contents at a Glance

About the Author
About the Technical Reviewer
Acknowledgments
Introduction
Chapter 1: Introduction
Chapter 2: Docker
Chapter 3: Interactive Programming
Chapter 4: The Docker Engine
Chapter 5: The Dockerfile
Chapter 6: Docker Hub
Chapter 7: The Opinionated Jupyter Stacks
Chapter 8: The Data Stores
Chapter 9: Docker Compose
Chapter 10: Interactive Software Development
Index

About the Author and About the Technical Reviewer

About the Author

Joshua Cook is a mathematician. He writes code in Bash, C, and Python and has done pure and applied computational work in geo-spatial predictive modeling, quantum mechanics, semantic search, and artificial intelligence. He also has 10 years experience teaching mathematics at the secondary and post-secondary level. His research interests lie in high-performance computing, interactive computing, feature extraction, and reinforcement learning. He is always willing to discuss orthogonality or to explain why Fortran is the language of the future over a warm or cold beverage.

About the Technical Reviewer

Jeeva S. Chelladhurai has been working as a DevOps specialist at the IBM GTS Labs for the last 9 years. He is the co-author of Learning Docker , published by PacktPub, UK. He has more than 20 years of IT industry experience. He has technically managed and mentored diverse teams across the globe in envisaging and building pioneering telecommunication products. He specializes in DevOps, automation, and cloud solution delivery, with a focus on data center optimization, software-defined environments (SDEs), and distributed application development, deployment, and delivery using the newest Docker technology. Jeeva is also a strong proponent of the agile methodologies, DevOps, and IT automation. He holds a master’s degree in computer science from Manonmaniam Sundaranar University and a graduation certificate in project management from Boston University, Boston, Massachusetts, USA. Besides his official responsibilities, he writes book chapters and authors research papers. He has been instrumental in crafting reusable technical assets for IBM solution architects and consultants. He speaks in technical forums on DevOps technologies and tools. He hosts one of the largest Open Source communities in Bangalore ( www.meetup.com/opensourceblr/ ). His LinkedIn profile can be found at www.linkedin.com/in/JeevaChelladhurai .

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Frontmatter

Create new playlist

Sign In

Sign Up