front matter

foreword

At the dawn of a new decade, developers are confronted with a myriad of database options when beginning a new project. The stalwart relational database still rules the roost, maintaining popularity in both legacy and greenfield projects. This is for good reason; flexibility and forty plus years of cumulative engineering history are hard to argue with. Despite the success of relational databases, the last decade saw an explosion of new commercial and open-source database systems that were designed around alternative models and query languages. Some tackle traditional RDBMS workloads with a new twist, perhaps focusing horizontal scale out or high performance via the embrace of in-memory optimization that have become available due to decreases in RAM prices. Many other systems diverged from the relational model altogether. Out of this set, we find a variety of focus areas and modeling paradigms. This book focuses on one of the more expressive and powerful developments, the graph model, and the property graph in particular.

Graph databases aren’t a new thing. Hierarchical and navigational databases have existed since the 60s, but these have recently experienced an increase in developer popularity. I think this is largely due to the intuitiveness of the property graph data model. People are already wired to think in graphs. If you draw a graph on a whiteboard, technical and non-technical folks get it. Consequently, after you overlay the graph model onto your software tasks at hand, everything starts to look like a graph problem.

With all that said, we’re still dealing with technology, and the available property graph databases are the newer technology at that, so there isn’t any magic. This is where Dave and Josh come in. I can’t imagine a better pair to help lay out the signposts and guide you on the journey to graph understanding. Both are accomplished graph architects and developers that have been involved in this junior space since before its recent uptick in popularity. Having worked in graph-based product development and consulting, they’ve racked up years of real-world experience.

This experience has influenced their pragmatic approach to the problems of graph application development, and though both proponents of graphs, they’re proponents with a healthy dose of skepticism and are not overly fascinated with the technology. After all, as mentioned, one of the first and most important questions new developers have is, “Is this a graph problem?” As you make your way through this book, you’ll hone an intuition for translating real world problems into graph data models and build up your Gremlin query chops, a popular and powerful property graph query language. The rubber meets the road in chapter 6 where you use this knowledge to build your first graph application. By the time you’ve finished, you’ll have the knowledge to evaluate if a graph database is a good fit for your next project, and if so, to execute on that vision having already built an example graph database application.

Ted Wilmes

Data Architect & JanusGraph Technical Steering Committee Member

Expero Inc.

preface

Two complementary trends started in the mid to late 2000s. First, companies began using and collecting more data on their customers, competition, and users than ever before. Second, the information companies wanted from this data became more complex, often containing hidden connections. These two trends drove the need for an easier exploration of expansive, yet highly connected data. Graph databases met that need.

Both the authors have gotten an up-close and personal view of this market as the technology, usage, and adoption of graph technology has matured. We both started using graph databases in the mid 2010s while working for a niche software consulting company. Independently, we each worked on projects that used graph databases to solve specific types of complex data problems. At that time, graph databases were new and very rough. Despite the challenges of working with new technologies, we both recognized the power of this tool and were hooked.

Since then, we have spent countless hours banging our heads against a proverbial wall to understand all the intricacies and nuances of building graph-backed applications. This book is the distillation of those countless hours of struggle. It is our hope that the hands-on nature of this book will provide a solid, foundational understanding of the skills needed to build graph-backed applications and, in the process, help you to avoid some of the pitfalls that we encountered.

acknowledgments

This book has been a labor of love, and sometimes frustration, so we first and foremost need to thank our wives (Melody and Meredith), and then acknowledge family and friends for their endless patience and for indulging us as we shared our latest esoteric discoveries while working with graph databases. Without their support we never could have made it through the countless hours it took to create this book.

A big thank you goes out to Dr. Denise Gosnell, Kelly Mondor, Ted Wilmes, and Daniel Farrell for all the specific insights, interviews, and support you provided, which helped us immensely in creating this book.

We would also like to thank the team at Manning Publications for allowing us the time and opportunity to publish this book. We would like to thank the entire Manning staff and specifically our publishers Marjan Bace and Michael Stephens, as well as our editors Frances Lefkowitz, Nick Watts, Alex Ott, Lori Weidert, and Frances Buran for all the amazing feedback and endless patience you have shown. Our appreciation also goes out to all the reviewers whose comments and reviews were invaluable in solidifying the organization and in clarifying the focus of this book: Scott Bartram, Andrew Blair, Alain Couniot, Douglas Duncan, Mike Erickson, John Guthrie, Mike Haller, Milorad Imbra, Ramaninder Singh Jhajj, Mike Jensen, Nicholas Robert Keers, Mladen Knežic´, Miguel Montalvo, Luis Moux, Nick Rakochy, Ron Sher, Deshuang Tang, Richard Vaughan, and Matthew Welke.

We would also like to thank the team at Expero Inc., without whom Josh and Dave would never have met, nor would have ever started their exploration of graph databases. Our many years of working side by side with the exceptionally talented Experonauts were a fruitful starting point that eventually led to writing this book.

about this book

This book is written for anyone building applications using graph databases. It is designed to provide a foundational understanding of graphs and graph databases, as well as to provide a framework for building applications using common graph database patterns. To teach this framework, this book follows the development lifecycle of a fictitious application called DiningByFriends. We use this application throughout the book to provide a realistic grounding of graph principles and examples of the concepts and content we teach. In many areas throughout this book, we compare and contrast the differences between building a graph-backed application and using the more traditional relational database model. By the end of this book, you will not only have the skills needed to build your own graph-backed application, but you will have built your first application, DiningByFriends.

Who should read this book

This book is for application developers, data engineers, and database developers who want to use graph databases as the backing data store for their applications. Throughout this book, we do not expect the reader to have any prior experience using graph databases, but you should be familiar with data modeling concepts, specifically with relational database development, as these are used heavily throughout as a common point of reference. Although all the application code is written in Java, any developer with object-oriented application development experience should be able to follow along with the concepts and content.

How this book is organized: A roadmap

This book is organized into 3 parts, comprising of 11 chapters. In part 1, “Getting started with graph databases,” we establish the foundation for our DiningByFriends application:

  • Chapter 1 begins with an introduction to graphs and graph terminology. We discuss how graph databases differ from relational databases and how you can use graph databases to solve highly connected data problems. We finish this chapter by discussing what makes a problem a good candidate for using a graph database.

  • Chapter 2 is where we hit the ground running by building an initial data model for our DiningByFriends application. We start with the types of information needed to begin the data modeling process. We then show how to turn this information into a conceptual data model. Finally, we walk through a framework for taking our business needs and our conceptual data model and turn that into our initial data model using the elements of a graph database: vertices, edges, and properties.

  • Chapter 3 begins a set of three chapters focused on learning the process of querying a graph database, known as traversing. We begin by teaching you how to retrieve and filter data from our graph. We follow this with learning how to navigate the structure of our graph and how that differs from working with a relational database. Then we finish up this chapter by demonstrating the ease with which you can recursively traverse through a graph to retrieve complex, interconnected data.

  • Chapter 4 continues our exploration of graph traversals with data mutation use cases. We then show how you can traverse the graph to find the entities and relationships that connect two items, known as the path. Finally, we look at how to leverage properties on relationships to filter the traversals and increase their performance.

  • Chapter 5 finishes our initial focus on graph traversals with a discussion of ways to format the results of our traversal into a desired output. Additionally, you learn how to perform common operations such as sorting, filtering, and limiting the results returned.

  • Chapter 6 begins the process of building our DiningByFriends application by taking the traversals we developed in chapters 3, 4, and 5 and walking through incorporating these into a Java application. Then we’ll process the results to complete this first part.

In part 2, “Building an application with graph databases,” we extend the concepts introduced in part 1:

  • Chapter 7 uses the foundations of data modeling from chapter 2, as well as what you learned about traversing a graph, to extend the data model for more complex use cases, such as recommendation engines and personalization.

  • Chapter 8 leverages a recommendation engine use case to demonstrate the power of using a known-walk pattern to create a robust recommendation application pattern.

  • Chapter 9 uses our personalization use case to demonstrate how to use a subgraph access pattern within a graph-backed application.

In part 3, “Beyond the basics,” we move past the DiningByFriends application to discuss our next steps in the application development process.

  • Chapter 10 discusses how to debug and troubleshoot common performance problems with traversals. We then investigate exactly what supernodes are and why they cause issues in graph-backed applications. We follow up these common performance problems with common application and traversal pitfalls and anti-patterns, as well as how to recognize and avoid them.

  • Chapter 11 takes a forward-looking view and discusses some of the next steps you might want to take with your graph-backed application. We also discuss some of the most common graph analytics algorithms and how you can apply these to solve a specific problem. Finally, we wrap up this chapter with a brief overview of how to leverage graphs in machine learning (ML) application.

About the code

This book contains many examples of source code, both in numbered listings and in line with normal text. In both cases, source code is formatted in a fixed-width font like this to separate it from ordinary text.

In many cases, the original source code has been reformatted; we’ve added line breaks and reworked indentation to accommodate the available page size in the book. In rare cases, even this was not enough and code listings include line-continuation markers (). Additionally, code annotations accompany many of the listings, highlighting important concepts.

The code for the examples in this book is available for download from the Manning website at https://www.manning.com/books/graph-databases-in-action, and from GitHub at https://github.com/bechbd/graph-databases-in-action.

About the technologies

Our goal throughout this book is to equip the reader with the conceptual knowledge needed to build graph-backed applications. However, in order to provide practical examples of these concepts, we had to make decisions regarding the technologies used for demonstration.

Our first decision was to pick the type of database. We decided to use a labeled property graph database, instead of, for example, an RDF store or triplestore database. Labeled property graph databases are the most common type we have seen in production use and seem to be the ones with the most momentum behind them. Additionally, these are the closest to the familiar concepts of relational databases, so labeled property graph databases are quite effective for comparisons.

This lead us to our next decision: the traversal language to use, openCypher or Gremlin.

While there’s a strong case for using openCypher, the goal of this book is to remain as vendor-agnostic as possible. It is important to us that these concepts and techniques are easily transferable to many popular databases when you start to build your applications. In the end, we decided to use the Apache TinkerPop version 3.4.x framework because it currently has the most database vendors with compatible implementations.

We have been questioned multiple times during the proposal and review processes as to why we chose this stack over a Neo4j/Cypher stack. Given the popularity of the Neo4j ecosystem this is a fair question which deserves fuller comment. There are three reasons we chose TinkerPop’s Gremlin for the illustrations throughout this book:

  • Gremlin is a better tool for teaching how a traversal works.

  • Gremlin is a common language of choice for enterprise applications.

  • Gremlin is the most portable language between property graph databases.

As for the first reason, we believe that the imperative design of Gremlin provides a better teaching tool for learning how a graph traversal works compared to the declarative approach of Cypher/openCypher. The syntax of Gremlin requires that we think about how we are moving through our graph in order to determine where we will move next. While we do appreciate the simplicity of Cypher/openCypher, it can also obfuscate critical technical matters, especially when dealing with issues of performance or scale. So while Cypher/openCypher is a great starting point for learning how to work with connected data, we feel that Gremlin is better suited for building high performing, scalable data applications.

Because Gremlin is the common language of choice for enterprise applications, many of these applications were built using TinkerPop-enabled databases. This means that Gremlin is the query language of choice. Some organizations have both Cypher/openCypher and Gremlin applications. But in our experience, the bigger, more complex enterprise-level projects seem to have chosen one of the many TinkerPop-enabled databases or cloud services.

As for our third choice, at this time, it is easy to say that Gremlin is the most widely available query language across graph database engines. Nearly all of the major cloud vendors (Amazon Web Services, Microsoft Azure, IBM, Huawei, and so forth) offer graph databases or services compatible with Gremlin. The lone exception is the Google Cloud Platform, which offers Neo4j as a service.

Our goal is not to advocate for one database or language over another. We seek to provide you with a solid foundation for how to use a graph database when building applications with highly connected data and to illustrate how graph databases work under the cover. We think that Gremlin provides the best path to accomplish this.

With the decision to use TinkerPop’s Gremlin made, we had to pick a specific TinkerPop-enabled database to use. In the spirit of remaining vendor agnostic, we’ve decided to use TinkerGraph for the examples. TinkerGraph is the graph implementation used in the Gremlin Server and Gremlin Console, the reference software provided as part of the Apache Software Foundation’s TinkerPop project.

Finally, we had to decide on an application programming language to build our example application, DiningByFriends. As Java is the most common language we have used with graph databases, we chose that as our application language. We should note that it is possible to build the same application with other languages such as C#, JavaScript and Python. Not only is it possible, we have done so ourselves. But all the traversals provided in this book are written in Gremlin and any application code is written in Java.

While almost all the concepts presented throughout this book are not specific to TinkerPop-enabled databases, there are a few we discuss that are unique to TinkerPop. When this is the case, we'll note where a TinkerPop-specific feature is used so that you’re aware that a particular feature might not be available in your graph database of choice. If no such note is given, it is safe to assume that the concept we discuss is applicable to other labeled property graph databases as well.

liveBook discussion forum

Purchase of Graph Databases in Action includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the authors and from other users. To access the forum, go to https://livebook.manning.com/#!/book/graph-databases-in-action/discussion. You can also learn more about Manning’s forums and the rules of conduct at https://livebook.manning.com/#!/discussion.

Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the authors can take place. It is not a commitment to any specific amount of participation on the part of the authors, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the authors some challenging questions lest their interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.

about the authors

Dave Bechberger is a data architect and developer with over two decades of experience. He uses his extensive knowledge of graph and other big data technologies to build highly performant and scalable data platforms in complex data domains such as bioinformatics, oil and gas, and supply chain management. Since the mid-2010s, Dave has worked with graph databases as a consultant, consumer, and vendor. He is an active member of the graph community and has presented on a wide range of graph-related topics at national and international conferences.

Josh Perryman also has over two decades of experience building and maintaining complex systems. Since 2014, he has focused on graph databases, especially in distributed or big data environments, and he regularly blogs and speaks at conferences about graph databases. Josh has worked with a variety of industries, including enterprise software, financial services, consumer products, and government intelligence agencies. In addition to consulting and product work, he has designed Gremlin training courses that have been delivered all over the world.

about the cover illustration

The figure on the cover of Graph Databases in Action is captioned “Femme de la Foret Noire,” or a woman from the Black Forest, in Southwest Germany. The illustration is taken from a collection of dress costumes from various countries by Jacques Grasset de Saint-Sauveur (1757-1810), titled Costumes civils actuels de tous les peoples connus, published in France in 1788. Each illustration is finely drawn and colored by hand. The rich variety of Grasset de Saint-Sauveur’s collection reminds us vividly of how culturally apart the world’s towns and regions were just 200 years ago. Isolated from each other, people spoke different dialects and languages. In the streets or in the countryside, it was easy to identify where they lived and what their trade or station in life was just by their dress.

The way we dress has changed since then and the diversity by region, so rich at the time, has faded away. It is now hard to tell apart the inhabitants of different continents, let alone different towns, regions, or countries. Perhaps we have traded cultural diversity for a more varied personal life--certainly for a more varied and fast-paced technological life.

At a time when it is hard to tell one computer book from another, Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional life of two centuries ago, brought back to life by Grasset de Saint-Sauveur’s pictures.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.15.149