AWS considerations

We've not mentioned AWS so far in this chapter as there's been nothing in Sqoop that either supports or prevents its use on AWS. We can run Sqoop on an EC2 host as easily as on a local one, and it can access either a manually or EMR-created Hadoop cluster optionally running Hive. The only possible quirk when considering use in AWS is security group access as many default EC2 configurations will not allow traffic on the ports used by most relational databases (3306 by default for MySQL). But, that's no more of an issue than if our Hadoop cluster and MySQL database were to be located on different sides of a firewall or any other network security boundary.

Considering RDS

There is another AWS service that we've not mentioned before that does deserve an introduction now. Amazon Relational Database Service (RDS) offers hosted relational databases in the cloud and provides MySQL, Oracle, and Microsoft SQL Server options. Instead of having to worry about the installation, configuration, and management of a database engine, RDS allows an instance to be started from either the console or command-line tools. You then just point your database client tool at the database and start creating tables and manipulating data.

RDS and EMR are a powerful combination, providing hosted services that take much of the pain out of manually managing such services. If you need a relational database but don't want to worry about its management, RDS may be for you.

The RDS and EMR combination can be particularly powerful if you use EC2 hosts to generate data or store data in S3. Amazon has a general policy that there is no cost for data transfer from one service to another within a single region. Consequently, it's possible to have a fleet of EC2 hosts generating large data volumes that get pushed into a relational database in RDS for query access and are stored in EMR for archival and long-term analytics. Getting data into the storage and processing systems is often a technically challenging activity that can easily consume significant expense if the data needs be moved across commercial network links. Architectures built atop collaborating AWS services such as EC2, RDS, and EMR can minimize both these concerns.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.206.254