Interaction with Amazon S3

Simple Storage Service, also known as S3, is an auto scalable storage platform provided by Amazon Web Services (AWS). It is a widely used service for storing data. Once connected to the internet, users can store and retrieve any amount of data at any time from S3. Being one of the cheapest storage services available on the internet, it is used extensively to store huge amounts of data. As it is an AWS service, users do not have to worry about its availability and management.

If you have not used S3, it is suggested that you refer to the AWS S3 documentation (http://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html ) for more information.

The following filesystem schemes can be used to access S3 data in Spark:

  • S3 (URL: s3://): This scheme is bound the to Hadoop File System. In other words, this scheme is the Hadoop implementation of the HDFS, backed up by S3. So, files will be actually stored in blocks using this scheme and as those are stored in blocks, tools other than Hadoop are unable to read them as regular files. This scheme is deprecated now.
As this scheme is deprecated, we have not considered it for our code examples.
  • S3N (URI : s3n://): S3N, aka S3 Native, is the scheme used to read/write files as regular S3 objects. Using this scheme, files written by Hadoop are accessible by other tools and vice versa. It requires the jets3t library to work.

With this scheme, there is an upper limit of 5GB, that is, files larger than 5GB can't be read using this scheme. Although this scheme is widely used, it is not considered for new enhancements anymore.

  • S3A (URI : s3a://): This is the latest filesystem scheme. This scheme has removed the upper limit constraint of S3N and provides various performance improvements. This scheme leverages Amazon's libraries to access S3 objects. S3A is also backward compatible with S3N.

As per Amazon documentation, this scheme is recommended for any newer developments.

To read the data from S3, the following AWS credentials are required:

  • AWS Access key ID
  • AWS Secret access key

Users can configure AWS credentials using any of the following ways:

  • The credentials can be set inline in the Spark program, as follows:
    • Using S3N
jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "accessKeyIDValue"); 
jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "securityAccesKeyValue"); 
    • Using S3A
jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", "accessKeyIDValue"); 
jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", "securityAccesKeyValue"); 

It can be read using a property file to avoid security issues. These credentials can be set using environment variables, as follows:

export AWS_ACCESS_KEY_ID=accessKeyIDValue
export AWS_SECRET_ACCESS_KEY=securityAccesKeyValue

Also, to use the S3A filesystem, the hadoop-aws library is required to be added in the classpath. The following is the Maven dependency template for the same:

<dependency> 
    <groupId>org.apache.hadoop</groupId> 
    <artifactId>hadoop-aws</artifactId> 
    <version>2.7.1</version> 
</dependency> 

S3 scheme, to read the data from S3, can be provided in Spark as follows:

    • Using S3N:
JavaRDD<String> s3File = jsc.textFile("s3n://"+"bucket-name"+"/"+"file-path"); 
    • Using S3A:
JavaRDD<String> s3File = jsc.textFile("s3a://"+"bucket-name"+"/"+"file-path"); 

Once RDD is created, further operations (transformation/actions) can be performed on the data to get the required results. For example, the following transformations can be executed on it to generate word count:

JavaPairRDD<String, Integer> s3WordCount=s3File .flatMap(x -> Arrays.asList(x.split(" ")).iterator()) 
               .mapToPair(x -> new Tuple2<String, Integer>((String) x, 1)) 
               .reduceByKey((x, y) -> x + y); 
 

The textfile method can be used to read multiple files from HDFS as follows:

  • To read more than one file:
    • Using S3N
jsc.textFile("s3n://"+"bucket-name"+"/"+"path-of-file1"+","+"s3n://"+"bucket-name"+"/"+"path-of-file2"); 
    • Using S3A
jsc.textFile("s3a://"+"bucket-name"+"/"+"path-of-file1"+","+"s3a://"+"bucket-name"+"/"+"path-of-file2"); 
  • To read all the files in a directory:
    • Using S3N
jsc.textFile("s3n://"+"bucket-name"+"/"+"path-of-dir/*"); 
    • Using S3A
jsc.textFile("s3a://"+"bucket-name"+"/"+"path-of-dir/*"); 
  • To read all the files in multiple directories
    • Using S3N
jsc.textFile("s3n://"+"bucket-name"+"/"+"path-of-dir1/*"+","+"s3n://"+"bucket-name"+"/"+"path-of-dir2/*"); 
    • Using S3A
jsc.textFile("s3a://"+"bucket-name"+"/"+"path-of-dir1/*"+","+"s3a://"+"bucket-name"+"/"+"path-of-dir2/*"); 
  • To save the output back to the S3 schemes can be used as follows:
    • Using S3N
s3WordCount.saveAsTextFile("s3n://"+"buxket-name"+"/"+"outputDirPath"); 
    • Using S3A
s3WordCount.saveAsTextFile("s3a://"+"buxket-name"+"/"+"outputDirPath") 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.210.102