Let's look at the numeric data, even though most of the columns in the dataset are either categorical or complex. The traditional way to summarize the numeric data is a five-number-summary, which is a representation of the median or mean, interquartile range, and minimum and maximum. I'll leave the computations of the median and interquartile ranges till the Spark DataFrame is introduced, as it makes these computations extremely easy; but we can compute mean, min, and max in Scala by just applying the corresponding operators:
scala> import scala.sys.process._ import scala.sys.process._ scala> val nums = ( "gzcat chapter01/data/clickstream/clickstream_sample.tsv.gz" #| "cut -f 6" ).lineStream nums: Stream[String] = Stream(0, ?) scala> val m = nums.map(_.toDouble).min m: Double = 0.0 scala> val m = nums.map(_.toDouble).sum/nums.size m: Double = 3.6883642764024662 scala> val m = nums.map(_.toDouble).max m: Double = 33.0
Sometimes one needs to get an idea of how a certain value looks across multiple fields—most common are IP/MAC addresses, dates, and formatted messages. For examples, if I want to see all IP addresses mentioned throughout a file or a document, I need to replace the cut
command in the previous example by grep -o -E [1-9][0-9]{0,2}(?:\.[1-9][0-9]{0,2}){3}
, where the –o
option instructs grep
to print only the matching parts—a more precise regex for the IP address should be grep –o –E (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
, but is about 50% slower on my laptop and the original one works in most practical cases. I'll leave it as an excursive to run this command on the sample file provided with the book.
18.191.218.187