0%

Book Description

Understand the immense capabilities of Cassandra in managing large amounts of data and learn how to ensure that data is always available. This practical, hands-on guide takes you through every stage from installation to performance tuning.

  • Install and set up a multi datacenter Cassandra Troubleshoot and tune Cassandra Covers CAP tradeoffs, physical/hardware limitations, and helps you understand the magic Tune your kernel, JVM, to maximize the performance Includes security, monitoring metrics, Hadoop configuration, and query tracing

In Detail

Apache Cassandra is a massively scalable open source NoSQL database. Cassandra is perfect for managing large amounts of structured, semi-structured, and unstructured data across multiple data centers and the cloud. Cassandra delivers linear scalability and performance across many commodity servers with no single point of failure. This book starts by explaining how to derive the solution, basic concepts, and CAP theorem. You will learn how to install and configure a Cassandra cluster as well as tune the cluster for performance. After reading the book, you should be able to understand why the system works in a particular way, and you will also be able to find patterns (and/or use cases) and anti-patterns which would potentially cause performance degradation. Furthermore, the book explains how to configure Hadoop, vnodes, multi-DC clusters, enabling trace, enabling various security features, and querying data from Cassandra. Starting with explaining about the trade-offs, we gradually learn about setting up and configuring high performance clusters. This book will help the administrators understand the system better by understanding various components in Cassandra’s architecture and hence be more productive in operating the cluster. This book talks about the use cases and problems, anti-patterns, and potential practical solutions as opposed to raw techniques. You will learn about kernel and JVM tuning parameters that can be adjusted to get the maximum use out of system resources.

Table of Contents

  1. Learning Cassandra for Administrators
    1. Table of Contents
    2. Learning Cassandra for Administrators
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers and more
        1. Why Subscribe?
        2. Free Access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Errata
        2. Piracy
        3. Questions
    8. 1. Basic Concepts and Architecture
      1. CAP theorem
      2. BigTable / Log-structured data model
        1. Column families
        2. Keyspace
        3. Sorted String Table (SSTable)
        4. Memtable
        5. Compaction
      3. Partitioning and replication Dynamo style
        1. Gossip protocol
        2. Distributed hash table
        3. Eventual consistency
      4. Summary
    9. 2. Installing Cassandra
      1. Memory, CPU, and network requirements
      2. Cassandra in-memory data structures
        1. Index summary
        2. Bloom filter
        3. Compression metadata
        4. SSDs versus spinning disks
        5. Key cache
        6. Row cache
      3. Downloading/choosing binaries to install
        1. Configuring cassandra-env.sh
        2. Configuring Cassandra.yaml
          1. cluster_name
          2. seed_provider
          3. Partitioner
          4. auto_bootstrap
          5. broadcast_address
          6. commitlog_directory
          7. data_file_directories
          8. disk_failure_policy
          9. initial_token
          10. listen_address/rpc_address
          11. Ports
          12. endpoint_snitch
          13. commitlog_sync
          14. commitlog_segment_size_in_mb
          15. commitlog_total_space_in_mb
          16. Key cache and row cache saved to disk
          17. compaction_preheat_key_cache
          18. row_cache_provider
          19. column_index_size_in_kb
          20. compaction_throughput_mb_per_sec
          21. in_memory_compaction_limit_in_mb
          22. concurrent_compactors
          23. populate_io_cache_on_flush
          24. concurrent_reads
          25. concurrent_writes
          26. flush_largest_memtables_at
          27. index_interval
          28. memtable_total_space_in_mb
          29. memtable_flush_queue_size
          30. memtable_flush_writers
          31. stream_throughput_outbound_megabits_per_sec
          32. request_scheduler
          33. request_scheduler_options
          34. rpc_keepalive
          35. rpc_server_type
          36. thrift_framed_transport_size_in_mb
          37. rpc_max_threads
          38. rpc_min_threads
          39. Timeouts
        3. Dynamic snitch
        4. Backup configurations
          1. incremental_backups
          2. auto_snapshot
      4. Cassandra on EC2 instance
        1. Snitch
      5. Create a keyspace
        1. Creating a column family
          1. GC grace period
          2. Compaction
          3. Minimum and maximum compaction threshold
        2. Secondary indexes
        3. Composite primary key type
          1. Options
        4. read_repair_chance and dclocal_read_repair_chance
      6. Summary
    10. 3. Inserting Data and Manipulating Data
      1. Querying data
        1. USE
        2. CREATE
        3. ALTER
        4. DESCRIBE
        5. SELECT
      2. Tracing
      3. Data modeling
        1. Types of columns
        2. Common Cassandra data models
          1. Denormalization
          2. Creating a counter column family
          3. Tweet data structure
          4. Secondary index examples
            1. Creating a secondary index table
            2. Internal data structure
            3. Indexed column family
            4. Creating an index
      4. Summary
    11. 4. Administration and Large Deployments
      1. Manual repair
      2. Bootstrapping
        1. Vnodes
          1. Node tool commands
          2. Cfhistograms
          3. Cleanup
          4. Decommission
          5. Drain
      3. Monitoring tools
        1. DataStax OpsCenter
        2. Basic JMX monitoring
      4. Summary
    12. 5. Performance Tuning
      1. vmstat
      2. iostat
      3. dstat
      4. Garbage collection
        1. Enabling GC logging
        2. Understanding GCLogs
          1. Stop-the-world GC
          2. The jstat tool
          3. The jmap tool
        3. The write surveillance mode
      5. Tuning memtables
        1. memtable_flush_writers
      6. Compaction tuning
        1. SizeTieredCompactionStrategy
        2. LeveledCompactionStrategy
      7. Compression
        1. NodeTool
        2. compactionstats
        3. netstats
        4. tpstats
        5. Cassandra's caches
          1. Filesystem caches
        6. Separate drive for commit logs
        7. Tuning the kernel for Cassandra
        8. noop scheduler
        9. NUMA
        10. Other tuning parameters
        11. Dynamic snitch
        12. Configuring a Cassandra multiregion cluster
      8. Summary
    13. 6. Analytics
      1. Hadoop integration
        1. Configuring Hadoop with Cassandra
          1. Virtual datacenter
            1. PropertyFileSnitch
            2. GossipingPropertyFileSnitch
            3. DSE Hadoop
        2. Acunu Analytics
        3. Reading data directly from Cassandra
        4. Analytics on backups
          1. File streaming
            1. Keyspace and column family settings
            2. Communication configuration using the Thrift interface with Cassandra
            3. HDFS location of the temporary files
      2. Summary
    14. 7. Security and Troubleshooting
      1. Encryption
        1. Creating a keystore
        2. Creating a truststore
        3. Transparent data encryption
          1. Keyspace authentication (simple authenticator)
          2. JMX authentication
      2. Audit
      3. Things to look out for
      4. Summary
    15. Index
18.117.188.138