0%

Advanced Algorithms and Data Structures introduces a collection of algorithms for complex programming challenges in data analysis, machine learning, and graph computing. You’ll discover cutting-edge approaches to a variety of tricky scenarios. You’ll even learn to design your own data structures for projects that require a custom solution.

Table of Contents

  1. inside front cover
  2. Advanced Algorithms and Data Structures
  3. Copyright
  4. dedication
  5. contents
  6. front matter
    1. foreword
    2. preface
    3. Welcome to Advanced Algorithms and Data Structures
    4. acknowledgments
    5. about this book
    6. Who should read this book?
    7. How this book is organized: a roadmap
    8. About the code
    9. liveBook discussion forum
    10. about the author
    11. about the cover illustration
  7. 1 Introducing data structures
    1. 1.1 Data structures
    2. 1.1.1 Defining a data structure
    3. 1.1.2 Describing a data structure
    4. 1.1.3 Algorithms and data structures: Is there a difference?
    5. 1.2 Setting goals: Your expectations after reading this book
    6. 1.3 Packing your knapsack: Data structures meet the real world
    7. 1.3.1 Abstracting the problem away
    8. 1.3.2 Looking for solutions
    9. 1.3.3 Algorithms to the rescue
    10. 1.3.4 Thinking (literally) outside of the box
    11. 1.3.5 Happy ending
    12. Summary
  8. Part 1. Improving over basic data structures
  9. 2 Improving priority queues: d-way heaps
    1. 2.1 Structure of this chapter
    2. 2.2 The problem: Handling priority
    3. 2.2.1 Priority in practice: Bug tracking
    4. 2.3 Solutions at hand: Keeping a sorted list
    5. 2.3.1 From sorted lists to priority queues
    6. 2.4 Describing the data structure API: Priority queues
    7. 2.4.1 Priority queue at work
    8. 2.4.2 Priority matters: Generalize FIFO
    9. 2.5 Concrete data structures
    10. 2.5.1 Comparing performance
    11. 2.5.2 What’s the right concrete data structure?
    12. 2.5.3 Heap
    13. 2.5.4 Priority, min-heap, and max-heap
    14. 2.5.5 Advanced variant: d-ary heap
    15. 2.6 How to implement a heap
    16. 2.6.1 BubbleUp
    17. 2.6.2 PushDown
    18. 2.6.3 Insert
    19. 2.6.4 Top
    20. 2.6.5 Update
    21. 2.6.6 Dealing with duplicates
    22. 2.6.7 Heapify
    23. 2.6.8 Beyond API methods: Contains
    24. 2.6.9 Performance recap
    25. 2.6.10 From pseudo-code to implementation
    26. 2.7 Use case: Find the k largest elements
    27. 2.7.1 The right data structure . . .
    28. 2.7.2 . . . and the right use
    29. 2.7.3 Coding it up
    30. 2.8 More use cases
    31. 2.8.1 Minimum distance in graphs: Dijkstra
    32. 2.8.2 More graphs: Prim's algorithm
    33. 2.8.3 Data compression: Huffman codes
    34. 2.9 Analysis of branching factor25
    35. 2.9.1 Do we need d-ary heaps?
    36. 2.9.2 Running time
    37. 2.9.3 Finding the optimal branching factor
    38. 2.9.4 Branching factor vs memory
    39. 2.10 Performance analysis: Finding the best branching factor
    40. 2.10.1 Please welcome profiling
    41. 2.10.2 Interpreting results
    42. 2.10.3 The mystery with heapify
    43. 2.10.4 Choosing the best branching factor
    44. Summary
  10. 3 Treaps: Using randomization to balance binary search trees
    1. 3.1 Problem: Multi-indexing
    2. 3.1.1 The gist of the solution
    3. 3.2 Solution: Description and API
    4. 3.3 Treap
    5. 3.3.1 Rotations
    6. 3.3.2 A few design questions
    7. 3.3.3 Implementing search
    8. 3.3.4 Insert
    9. 3.3.5 Delete
    10. 3.3.6 Top, peek, and update
    11. 3.3.7 Min, max
    12. 3.3.8 Performance recap
    13. 3.4 Applications: Randomized treaps
    14. 3.4.1 Balanced trees
    15. 3.4.2 Introducing randomization
    16. 3.4.3 Applications of Randomized Treaps
    17. 3.5 Performance analysis and profiling
    18. 3.5.1 Theory: Expected height
    19. 3.5.2 Profiling height
    20. 3.5.3 Profiling running time
    21. 3.5.4 Profiling memory usage
    22. 3.5.5 Conclusions
    23. Summary
  11. 4 Bloom filters: Reducing the memory for tracking content
    1. 4.1 The dictionary problem: Keeping track of things
    2. 4.2 Alternatives to implementing a dictionary
    3. 4.3 Describing the data structure API: Associative array
    4. 4.4 Concrete data structures
    5. 4.4.1 Unsorted array: Fast insertion, slow search
    6. 4.4.2 Sorted arrays and binary search: Slow insertion, fast(-ish) search
    7. 4.4.3 Hash table: Constant-time on average, unless you need ordering
    8. 4.4.4 Binary search tree: Every operation is logarithmic
    9. 4.4.5 Bloom filter: As fast as hash tables, but saves memory (with a catch)
    10. 4.5 Under the hood: How do Bloom filters work?
    11. 4.6 Implementation
    12. 4.6.1 Using a Bloom filter
    13. 4.6.2 Reading and writing bits
    14. 4.6.3 Find where a key is stored
    15. 4.6.4 Generating hash functions
    16. 4.6.5 Constructor
    17. 4.6.6 Checking a key
    18. 4.6.7 Storing a key
    19. 4.6.8 Estimating accuracy
    20. 4.7 Applications
    21. 4.7.1 Cache
    22. 4.7.2 Routers
    23. 4.7.3 Crawler
    24. 4.7.4 IO fetcher
    25. 4.7.5 Spell checker
    26. 4.7.6 Distributed databases and file systems
    27. 4.8 Why Bloom filters work21
    28. 4.8.1 Why there are no false negatives . . .
    29. 4.8.2 . . . But there are false positives
    30. 4.8.3 Bloom filters as randomized algorithms
    31. 4.9 Performance analysis
    32. 4.9.1 Running time
    33. 4.9.2 Constructor
    34. 4.9.3 Storing an element
    35. 4.9.4 Looking up an element
    36. 4.10 Estimating Bloom filter precision25
    37. 4.10.1 Explanation of the false-positive ratio formula
    38. 4.11 Improved variants
    39. 4.11.1 Bloomier filter
    40. 4.11.2 Combining Bloom filters
    41. 4.11.3 Layered Bloom filter
    42. 4.11.4 Compressed Bloom filter
    43. 4.11.5 Scalable Bloom filter
    44. Summary
  12. 5 Disjoint sets: Sub-linear time processing
    1. 5.1 The distinct subsets problem
    2. 5.2 Reasoning on solutions
    3. 5.3 Describing the data structure API: Disjoint set
    4. 5.4 Naïve solution
    5. 5.4.1 Implementing naïve solution
    6. 5.5 Using a tree-like structure
    7. 5.5.1 From list to trees
    8. 5.5.2 Implementing the tree version
    9. 5.6 Heuristics to improve the running time
    10. 5.6.1 Path compression
    11. 5.6.2 Implementing balancing and path compression
    12. 5.7 Applications
    13. 5.7.1 Graphs: Connected components
    14. 5.7.2 Graphs:15 Kruskal’s algorithm for minimum spanning tree
    15. 5.7.3 Clustering
    16. 5.7.4 Unification
    17. Summary
  13. 6 Trie, radix trie: Efficient string search
    1. 6.1 Spell-check
    2. 6.1.1 A prncess, a Damon, and an elf walkz into a bar
    3. 6.1.2 Compression is the key
    4. 6.1.3 Description and API
    5. 6.2 Trie
    6. 6.2.1 Why is it better again?
    7. 6.2.2 Search
    8. 6.2.3 Insert
    9. 6.2.4 Remove
    10. 6.2.5 Longest prefix
    11. 6.2.6 Keys matching a prefix
    12. 6.2.7 When should we use tries?
    13. 6.3 Radix tries
    14. 6.3.1 Nodes and edges
    15. 6.3.2 Search
    16. 6.3.3 Insert
    17. 6.3.4 Remove
    18. 6.3.5 Longest common prefix
    19. 6.3.6 Keys starting with a prefix
    20. 6.4 Applications
    21. 6.4.1 Spell-checker
    22. 6.4.2 String similarity
    23. 6.4.3 String sorting
    24. 6.4.4 T9
    25. 6.4.5 Autocomplete
    26. Summary
  14. 7 Use case: LRU cache
    1. 7.1 Don’t compute things twice
    2. 7.2 First attempt: Remembering values
    3. 7.2.1 Description and API
    4. 7.2.2 Fresh data, please
    5. 7.2.3 Handling asynchronous calls
    6. 7.2.4 Marking cache values as “Loading”
    7. 7.3 Memory is not enough (literally)
    8. 7.4 Getting rid of stale data: LRU cache
    9. 7.4.1 Sometimes you have to double down on problems
    10. 7.4.2 Temporal ordering
    11. 7.4.3 Performance
    12. 7.5 When fresher data is more valuable: LFU
    13. 7.5.1 So how do we choose?
    14. 7.5.2 What makes LFU different
    15. 7.5.3 Performance
    16. 7.5.4 Problems with LFU
    17. 7.6 How to use cache is just as important
    18. 7.7 Introducing synchronization
    19. 7.7.1 Solving concurrency (in Java)
    20. 7.7.2 Introducing locks
    21. 7.7.3 Acquiring a lock
    22. 7.7.4 Reentrant locks
    23. 7.7.5 Read locks
    24. 7.7.6 Other approaches to concurrency
    25. 7.8 Cache applications
    26. Summary
  15. Part 2. Multidimensional queries
  16. 8 Nearest neighbors search
    1. 8.1 The nearest neighbors search problem
    2. 8.2 Solutions
    3. 8.2.1 First attempts
    4. 8.2.2 Sometimes caching is not the answer
    5. 8.2.3 Simplifying things to get a hint
    6. 8.2.4 Carefully choose a data structure
    7. 8.3 Description and API
    8. 8.4 Moving to k-dimensional spaces
    9. 8.4.1 Unidimensional binary search
    10. 8.4.2 Moving to higher dimensions
    11. 8.4.3 Modeling 2-D partitions with a data structure
    12. Summary
  17. 9 K-d trees: Multidimensional data indexing
    1. 9.1 Right where we left off
    2. 9.2 Moving to k-D spaces: Cycle through dimensions
    3. 9.2.1 Constructing the BST
    4. 9.2.2 Invariants
    5. 9.2.2 The importance of being balanced
    6. 9.3 Methods
    7. 9.3.1 Search
    8. 9.3.2 Insert
    9. 9.3.3 Balanced tree
    10. 9.3.4 Remove
    11. 9.3.5 Nearest neighbor
    12. 9.3.6 Region search
    13. 9.3.7 A recap of all methods
    14. 9.4 Limits and possible improvements
    15. Summary
  18. 10 Similarity Search Trees: Approximate nearest neighbors search for image retrieval
    1. 10.1 Right where we left off
    2. 10.1.1 A new (more complex) example
    3. 10.1.2 Overcoming k-d trees’ flaws
    4. 10.2 R-tree
    5. 10.2.1 A step back: Introducing B-trees
    6. 10.2.2 From B-Tree to R-tree
    7. 10.2.3 Inserting points in an R-tree
    8. 10.2.4 Search
    9. 10.3 Similarity search tree
    10. 10.3.1 SS-tree search
    11. 10.3.2 Insert
    12. 10.3.3 Insertion: Variance, means, and projections
    13. 10.3.4 Insertion: Split nodes
    14. 10.3.5 Delete
    15. 10.4 Similarity Search
    16. 10.4.1 Nearest neighbor search
    17. 10.4.2 Region search
    18. 10.4.3 Approximated similarity search
    19. 10.5 SS+-tree18
    20. 10.5.1 Are SS-trees better?
    21. 10.5.2 Mitigating hyper-sphere limitations
    22. 10.5.3 Improved split heuristic
    23. 10.5.4 Reducing overlap
    24. Summary
  19. 11 Applications of nearest neighbor search
    1. 11.1 An application: Find nearest hub
    2. 11.1.1 Sketching a solution
    3. 11.1.2 Trouble in paradise
    4. 11.2 Centralized application
    5. 11.2.1 Filtering points
    6. 11.2.2 Complex decisions
    7. 11.3 Moving to a distributed application
    8. 11.3.1 Issues handling HTTP communication
    9. 11.3.2 Keeping the inventory in sync
    10. 11.3.3 Lessons learned
    11. 11.4 Other applications
    12. 11.4.1 Color reduction
    13. 11.4.2 Particle interaction
    14. 11.4.3 Multidimensional DB queries optimization
    15. 11.4.4 Clustering
    16. Summary
  20. 12 Clustering
    1. 12.1 Intro to clustering
    2. 12.1.1 Types of learning
    3. 12.1.2 Types of clustering
    4. 12.2 K-means
    5. 12.2.1 Issues with k-means
    6. 12.2.2 The curse of dimensionality strikes again
    7. 12.2.3 K-means performance analysis
    8. 12.2.4 Boosting k-means with k-d trees
    9. 12.2.5 Final remarks on k-means
    10. 12.3 DBSCAN
    11. 12.3.1 Directly vs density-reachable
    12. 12.3.2 From definitions to an algorithm
    13. 12.3.3 And finally, an implementation
    14. 12.3.4 Pros and cons of DBSCAN
    15. 12.4 OPTICS
    16. 12.4.1 Definitions
    17. 12.4.2 OPTICS algorithm
    18. 12.4.3 From reachability distance to clustering
    19. 12.4.4 Hierarchical clustering
    20. 12.4.5 Performance analysis and final considerations
    21. 12.5 Evaluating clustering results: Evaluation metrics
    22. 12.5.1 Interpreting the results
    23. Summary
  21. 13 Parallel clustering: MapReduce and canopy clustering
    1. 13.1 Parallelization
    2. 13.1.1 Parallel vs distributed
    3. 13.1.2 Parallelizing k-means
    4. 13.1.3 Canopy clustering
    5. 13.1.4 Applying canopy clustering
    6. 13.2 MapReduce
    7. 13.2.1 Imagine you are Donald Duck . . .
    8. 13.2.2 First map, then reduce
    9. 13.2.3 There is more under the hood
    10. 13.3 MapReduce k-means
    11. 13.3.1 Parallelizing canopy clustering
    12. 13.3.2 Centroid initialization with canopy clustering
    13. 13.3.3 MapReduce canopy clustering
    14. 13.4 MapReduce DBSCAN
    15. Summary
  22. Part 3. Planar graphs and minimum crossing number
  23. 14 An introduction to graphs: Finding paths of minimum distance
    1. 14.1 Definitions
    2. 14.1.1 Implementing graphs
    3. 14.1.2 Graphs as algebraic types
    4. 14.1.3 Pseudo-code
    5. 14.2 Graph properties
    6. 14.2.1 Undirected
    7. 14.2.2 Connected
    8. 14.2.3 Acyclic
    9. 14.3 Graph traversal: BFS and DFS
    10. 14.3.1 Optimizing delivery routes
    11. 14.3.2 Breadth first search
    12. 14.3.3 Reconstructing the path to target
    13. 14.3.4 Depth first search
    14. 14.3.5 It’s queue vs stack again
    15. 14.3.6 Best route to deliver a parcel
    16. 14.4 Shortest path in weighted graphs: Dijkstra
    17. 14.4.1 Differences with BFS
    18. 14.4.2 Implementation
    19. 14.4.3 Analysis
    20. 14.4.4 Shortest route for deliveries
    21. 14.5 Beyond Dijkstra’s algorithm: A*
    22. 14.5.1 How good is A* search?
    23. 14.5.2 Heuristics as a way to balance real-time data
    24. Summary
  24. 15 Graph embeddings and planarity: Drawing graphs with minimal edge intersections
    1. 15.1 Graph embeddings
    2. 15.1.1 Some basic definitions
    3. 15.1.2 Complete and bipartite graphs
    4. 15.2 Planar graphs
    5. 15.2.1 Using Kuratowski’s theorem in practice
    6. 15.2.2 Planarity testing
    7. 15.2.3 A naïve algorithm for planarity testing
    8. 15.2.4 Improving performance
    9. 15.2.5 Efficient algorithms
    10. 15.3 Non-planar graphs
    11. 15.3.1 Finding the crossing number
    12. 15.3.2 Rectilinear crossing number
    13. 15.4 Edge intersections
    14. 15.4.1 Straight-line segments
    15. 15.4.2 Polylines
    16. 15.4.3 Bézier curves
    17. 15.4.4 Intersections between quadratic Bézier curves
    18. 15.4.5 Vertex-vertex and edge-vertex intersections
    19. Summary
  25. 16 Gradient descent: Optimization problems (not just) on graphs
    1. 16.1 Heuristics for the crossing number
    2. 16.1.1 Did you just say heuristics?
    3. 16.1.2 Extending to curve-line edges
    4. 16.2 How optimization works
    5. 16.2.1 Cost functions
    6. 16.2.2 Step functions and local minima
    7. 16.2.3 Optimizing random sampling
    8. 16.3 Gradient descent
    9. 16.3.1 The math of gradient descent
    10. 16.3.2 Geometrical interpretation
    11. 16.3.3 When is gradient descent appliable?
    12. 16.3.4 Problems with gradient descent
    13. 16.4 Applications of gradient descent
    14. 16.4.1 An example: Linear regression
    15. 16.5 Gradient descent for graph embedding
    16. 16.5.1 A different criterion
    17. 16.5.2 Implementation
    18. Summary
  26. 17 Simulated annealing: Optimization beyond local minima
    1. 17.1 Simulated annealing
    2. 17.1.1 Sometimes you need to climb up to get to the bottom
    3. 17.1.2 Implementation
    4. 17.1.3 Why simulated annealing works
    5. 17.1.4 Short-range vs long-range transitions
    6. 17.1.5 Variants
    7. 17.1.6 Simulated annealing vs gradient descent: Which one should I use?
    8. 17.2 Simulated annealing + traveling salesman
    9. 17.2.1 Exact vs approximated solutions
    10. 17.2.2 Visualizing cost
    11. 17.2.3 Pruning the domain
    12. 17.2.4 State transitions
    13. 17.2.5 Adjacent vs random swaps
    14. 17.2.6 Applications of TSP
    15. 17.3 Simulated annealing and graph embedding
    16. 17.3.1 Minimum edge crossing
    17. 17.3.2 Force-directed drawing
    18. Summary
  27. 18 Genetic algorithms: Biologically inspired, fast-converging optimization
    1. 18.1 Genetic algorithms
    2. 18.1.1 Inspired by nature
    3. 18.1.2 Chromosomes
    4. 18.1.3 Population
    5. 18.1.4 Fitness
    6. 18.1.5 Natural selection
    7. 18.1.6 Selecting individuals for mating
    8. 18.1.7 Crossover
    9. 18.1.8 Mutations
    10. 18.1.9 The genetic algorithm template
    11. 18.1.10 When does the genetic algorithm work best?
    12. 18.2 TSP
    13. 18.2.1 Fitness, chromosomes, and initialization
    14. 18.2.2 Mutations
    15. 18.2.3 Crossover
    16. 18.2.4 Results and parameters tuning
    17. 18.2.5 Beyond TSP: Optimizing the routes of the whole fleet
    18. 18.3 Minimum vertex cover
    19. 18.3.1 Applications of vertex cover
    20. 18.3.2 Implementing a genetic algorithm
    21. 18.4 Other applications of the genetic algorithm
    22. 18.4.1 Maximum flow
    23. 18.4.2 Protein folding
    24. 18.4.3 Beyond genetic algorithms
    25. 18.4.4 Algorithms, beyond this book
    26. Summary
  28. appendix A. A quick guide to pseudo-code
    1. A.1 Variables and basics
    2. A.2 Arrays
    3. A.3 Conditional instructions
    4. A.3.1 Else-if
    5. A.3.2 Switch
    6. A.4 Loops
    7. A.4.1 For loop
    8. A.4.2 While loop
    9. A.4.3 Break and continue
    10. A.5 Blocks and indent
    11. A.6 Functions
    12. A.6.1 Overloading and default arguments
    13. A.6.2 Tuples
    14. A.6.3 Tuples and destructuring objects
  29. appendix B. Big-O notation
    1. B.1 Algorithms and performance
    2. B.2 The RAM model
    3. B.3 Order of magnitude
    4. B.4 Notation
    5. B.5 Examples
  30. appendix C. Core data structures
    1. C.1 Core data structures
    2. C.2 Array
    3. C.3 Linked List
    4. C.4 Tree
    5. C.4.1 Binary search trees
    6. C.5 Hash table
    7. C.5.1 Storing key-value pairs
    8. C.5.2 Hashing
    9. C.5.3 Conflicts resolution in hashing
    10. C.5.4 Performance
    11. C.6 Comparative analysis of core data structures
  31. appendix D. Containers as priority queues
    1. D.1 Bag
    2. D.2 Stack
    3. D.3 Queue
    4. D.4 A comparative analysis of containers
  32. appendix E. Recursion
    1. E.1 Simple recursion
    2. E.1.1 Pitfalls
    3. E.1.2 Good recursion
    4. E.2 Tail recursion
    5. E.3 Mutual recursion
  33. appendix F. Classification problems and randomnized algorithm metrics
    1. F.1 Decision problems
    2. F.2 Las Vegas algorithms
    3. F.3 Monte Carlo algorithms
    4. F.4 Classification metrics
    5. F.4.1 Accuracy
    6. F.4.2 Precision and recall
    7. F.4.3 Other metrics and recap
  34. index
  35. inside back cover
18.209.66.87