fileStream

Data files from any directory can be read from a directory using the fileStream() API of StreamingContext. The fileStream() API accepts four values, that is, the directory being watched for the files to be streamed, the key class, the value class, and the InputFormat class. For files compatible with Hadoop and having text data can use the following syntax for streaming data:

streamingContext.fileStream(inputDirectory, LongWritable.class, Text.class, TextInputFormat.class)

There is another simpler API that hides all the key, value, and text input format details and comes in handy while reading simple text files:

streamingContext.textFileStream(inputDirectory);

One important point of difference from previous examples of socket streaming is that since for streaming filesystems data one does not need receivers and hence an extra thread is not required for running receiver process. Some key points to remember while processing file stream data is that only files that have been moved after the Spark job have be launched in running state is considered, also the file should not be changed once it has been moved to the streaming directory.

Table of Contents for fileStream

Create new playlist

Sign In

Sign Up

Table of Contents for
fileStream