Chapter 9. Files and Regular Expressions

Topics in This Chapter A1

This chapter interrupts the treatment of the Scala language to give you some tools from the Scala libraries. You will learn how to carry out common file processing tasks, such as reading all lines or words from a file, and to work with regular expressions.

This interlude has useful information for projects that you can embark on with your current knowledge of Scala. Of course, if you prefer, skip the chapter until you need it and move on to more information about the Scala language.

Chapter highlights:

  • Source.fromFile(...).getLines.toArray yields all lines of a file.

  • Source.fromFile(...).mkString yields the file contents as a string.

  • To convert a string into a number, use the toInt or toDouble method.

  • Use the Java PrintWriter to write text files.

  • "regex".r is a Regex object.

  • Use """...""" if your regular expression contains backslashes or quotes.

  • If a regex pattern has groups, you can extract their contents using the syntax for regex(var1, ..., varn) <- string.

9.1 Reading Lines

To read all lines from a file, call the getLines method on a scala.io.Source object:

import scala.io.Source
val filename = "/usr/share/dict/words"
var source = Source.fromFile(filename, "UTF-8")
  // You can omit the encoding if you know that the file uses
  // the default platform encoding
var lineIterator = source.getLines

The result is an iterator (see Chapter 13). You can use it to process the lines one at a time:

for l <- lineIterator do
  process(l)

Or you can put the lines into an array or array buffer by applying the toArray or toBuffer method to the iterator:

val lines = source.getLines.toArray

Sometimes, you just want to read an entire file into a string. That’s even simpler:

var contents = source.mkString

Images Caution

Call close when you are done using the Source object.

When the Source class was created, the Java API for file processing was very limited. Java has caught up, and you may want to use the java.nio.file.Files class instead:

import java.nio.file.{Files,Path}
import java.nio.charset.StandardCharsets.UTF_8
contents = Files.readString(Path.of(filename), UTF_8)

To read all lines with the Files.lines method, convert the Java stream to Scala:

import scala.jdk.StreamConverters.*
val lineBuffer = Files.lines(Path.of(filename), UTF_8).toScala(Buffer)
lineIterator = Files.lines(Path.of(filename), UTF_8).toScala(Iterator)

9.2 Reading Characters

To read individual characters from a file, you can use a Source object directly as an iterator since the Source class extends Iterator[Char]:

for c <- source do process(c)

If you want to be able to peek at a character without consuming it (like istream::peek in C++ or a PushbackInputStreamReader in Java), call the buffered method on the source object. Then you can peek at the next input character with the head method without consuming it.

source = Source.fromFile("myfile.txt", "UTF-8")
val iter = source.buffered
while iter.hasNext do
  if isNice(iter.head) then
    process(iter)
  else
    iter.next
source.close()

9.3 Reading Tokens and Numbers

Here is a quick-and-dirty way of reading all whitespace-separated tokens in a source:

val tokens = source.mkString.split("\s+")

To convert a string into a number, use the toInt or toDouble method. For example, if you have a file containing floating-point numbers, you can read them all into an array by

val numbers = for w <- tokens yield w.toDouble

Images Tip

Remember—you can always use the java.util.Scanner class to process a file that contains a mixture of text and numbers.

Finally, note that you can read numbers from scala.io.StdIn:

print("How old are you? ")
val age = StdIn.readInt()
  // Or use readDouble or readLong

Images Caution

These methods assume that the next input line contains a single number, without leading or trailing whitespace. Otherwise, a NumberFormatException occurs.

9.4 Reading from URLs and Other Sources

The Source object has methods to read from sources other than files:

val source1 = Source.fromURL("https://horstmann.com/index.html", "UTF-8")
val source2 = Source.fromString("Hello, World!")
  // Reads from the given string—useful for debugging
val source3 = Source.stdin
  // Reads from standard input

Images Caution

When you read from a URL, you need to know the character set in advance, from an HTTP header or the first 1024 bytes of the contents. See https://www.w3.org/International/questions/qa-html-encoding-declarations for more information.

Scala has no provision for reading binary files. You’ll need to use the Java library. Here is how you can read a file into a byte array:

val bytes = Files.readAllBytes(Path.of(filename)); // An Array[Byte]

9.5 Writing Files

Scala has no built-in support for writing files. To write a text file, use a java.io.PrintWriter, for example:

val out = PrintWriter(filename)
for i <- 1 to 100 do out.println(i)

You can also write formatted output:

val quantity = 10
val price = 29.95
out.printf("%6d %10.2f%n", quantity, price)

Remember to close the writer:

out.close()

9.6 Visiting Directories

There are no “official” Scala classes for visiting all files in a directory, or for recursively traversing directories.

The simplest approach is to use the Files.list and Files.walk methods of the java.nio.file package. The list method only visits the children of a directory, and the walk method visits all descendants. These methods yield Java streams of Path objects. You can visit them as follows:

import java.nio.file.*
import scala.jdk.StreamConverters.*
val dirname = "/home"
val entries = Files.list(Paths.get(dirname)) // or Files.walk
try
  for p <- entries.toScala(Iterator) do
    process(p)
finally
  entries.close()

9.7 Serialization

In Java, serialization is used to transmit objects to other virtual machines or for short-term storage. (For long-term storage, serialization can be awkward—it is tedious to deal with different object versions as classes evolve over time.)

Here is how you declare a serializable class in Java and Scala.

Java:

public class Person implements java.io.Serializable { // This is Java
  private static final long serialVersionUID = 42L;
  private String name;
  ...
}

Scala:

@SerialVersionUID(42L) class Person(val name: String) extends Serializable

The Serializable trait is defined in the scala package and does not require an import.

Images Note

You can omit the @SerialVersionUID annotation if you are OK with the default ID.

Serialize and deserialize objects in the usual way:

val fred = Person("Fred")
val out = ObjectOutputStream(FileOutputStream("/tmp/test.ser"))
out.writeObject(fred)
out.close()
val in = ObjectInputStream(FileInputStream("/tmp/test.ser"))
val savedFred = in.readObject().asInstanceOf[Person]

The Scala collections are serializable, so you can have them as members of your serializable classes:

class Person extends Serializable :
  private val friends = ArrayBuffer[Person]() // OK—ArrayBuffer is serializable
  ...

9.8 Process Control A2

Traditionally, programmers use shell scripts to carry out mundane processing tasks, such as moving files from one place to another, or combining a set of files. The shell language makes it easy to specify subsets of files and to pipe the output of one program into the input of another. However, as programming languages, most shell languages leave much to be desired.

Scala was designed to scale from humble scripting tasks to massive programs. The scala.sys.process package provides utilities to interact with shell programs. You can write your shell scripts in Scala, with all the power that the Scala language puts at your disposal.

Here is a simple example:

import scala.sys.process.*
"ls -al ..".!

As a result, the ls -al .. command is executed, showing all files in the parent directory. The result is printed to standard output.

The scala.sys.process package contains an implicit conversion from strings to ProcessBuilder objects. The ! method executes the ProcessBuilder object.

The result of the ! method is the exit code of the executed program: 0 if the program was successful, or a nonzero failure indicator otherwise.

If you use !! instead of !, the output is returned as a string:

val result = "ls -al /".!!

Images Note

The ! and !! operators were originally intended to be used as postfix operators without the method invocation syntax:

"ls -al /" !!

However, as you will see in Chapter 11, the postfix syntax is being deprecated since it can lead to parsing errors.

You can pipe the output of one program into the input of another, using the #| method:

("ls -al /" #| "grep u").!

Images Note

As you can see, the process library uses the commands of the underlying operating system. Here, I use bash commands because bash is available on Linux, Mac OS X, and Windows.

To redirect the output to a file, use the #> method:

("ls -al /" #> File("/tmp/filelist.txt")).!

To append to a file, use #>> instead:

("ls -al /etc" #>> File("/tmp/filelist.txt")).!

To redirect input from a file, use #<:

("grep u" #< File("/tmp/filelist.txt")).!

You can also redirect input from a URL:

("grep Scala" #< URL("http://horstmann.com/index.html")).!

You can combine processes with p #&& q (execute q if p was successful) and p #|| q (execute q if p was unsuccessful). But frankly, Scala is better at control flow than the shell, so why not implement the control flow in Scala?

Images Note

The process library uses the familiar shell operators | > >> < && ||, but it prefixes them with a # so that they all have the same precedence.

If you need to run a process in a different directory, or with different environment variables, construct a ProcessBuilder with the apply method of the Process object. Supply the command, the starting directory, and a sequence of (name, value) pairs for environment settings:

val p = Process(cmd, File(dirName), ("LC_ALL", myLocale))

Then execute it with the ! method:

("echo 42" #| p).!

When executing a process command that generates a large amount of output, you can read the output lazily:

val result = "ls -al /".lazyLines // Yields a LazyList[String]

See Chapter 13 how to process a lazy list.

Images Note

If you want to use Scala for shell scripts in a UNIX/Linux/MacOS environment, start your script files like this:

#!/bin/sh
exec scala "$0" "$@"
!#
Scala commands

Images Note

You can also run Scala scripts from Java programs with the scripting integration of the javax.script package. To get a script engine, call

ScriptEngine engine =
  new ScriptEngineManager().getEngineByName("scala") // This is Java

You need the Scala compiler on the class path. If you use Coursier, you can get the class path as

coursier fetch -p org.scala-lang:scala3-compiler_3:3.2.0

9.9 Regular Expressions

When you process input, you often want to use regular expressions to analyze it. The scala.util.matching.Regex class makes this simple. To construct a Regex object, use the r method of the String class:

val numPattern = "[0-9]+".r

If the regular expression contains backslashes or quotation marks, then it is a good idea to use the “raw” string syntax, """...""". For example:

val wsnumwsPattern = """s+[0-9]+s+""".r
  // A bit easier to read than "\s+[0-9]+\s+".r

The matches method tests whether a regular expression matches a string:

if numPattern.matches(input) then
  val n = input.toInt
  ...

The entire input must match. To find out whether the string contains a match, turn the regular expression into unanchored mode:

if numPattern.unanchored.matches(input) then
  println("There is a number here somewhere")

The findAllIn method returns an Iterator[String] through all matches. Since you are unlikely to have many matches, you can simply collect the results:

input = "99 bottles, 98 bottles"
numPattern.findAllIn(input).toArray // Yields Array(99, 98)

Note that you don’t need to call unanchored.

To get more information about the matches, call findAllMatchIn to get an Iterator[Match]. Each Match object describes the current match. Use the following methods for the match details:

  • start, end: The starting and ending index of the matching substring

  • matched: The matched substring

  • before, after: The substrings before or after the match

For example:

for m <- numPattern.findAllMatchIn(input) do
  println(s"${m.start} ${m.end}")

To find the first match in a string, use findFirstIn or findFirstMatchIn. You get an Option[String] or Option[Match].

val firstMatch = wsnumwsPattern.findFirstIn("99 bottles, 98 bottles")
  // Some(" 98 ")

You can replace the first match, all matches, or some matches. In the latter case, supply a function Match => Option[String]. If the function returns Some(str), the match is replaced with str.

numPattern.replaceFirstIn("99 bottles, 98 bottles", "XX")
  // "XX bottles, 98 bottles"
numPattern.replaceAllIn("99 bottles, 98 bottles", "XX")
  // "XX bottles, XX bottles"
numPattern.replaceSomeIn("99 bottles, 98 bottles",
  m => if m.matched.toInt % 2 == 0 then Some("XX") else None)
  // "99 bottles, XX bottles"

Here is a more useful application of the replaceSomeIn method. We want to replace placeholders $0, $1, and so on, in a message string with values from an argument sequence. Make a pattern for the variable with a group for the index, and then map the group to the sequence element.

val varPattern = """$[0-9]+""".r
def format(message: String, vars: String*) =
  varPattern.replaceSomeIn(message, m => vars.lift(
    m.matched.tail.toInt))
format("At $1, there was $2 on $0.",
  "planet 7", "12:30 pm", "a disturbance of the force")
   // At 12:30 pm, there was a disturbance of the force on planet 7.

The lift method turns a Seq[String] into a function. The expression vars.lift(i) is Some(vars(i)) if i is a valid index or None if it is not.

9.10 Regular Expression Groups

Groups are useful to get subexpressions of regular expressions. Add parentheses around the subexpressions that you want to extract, for example:

val numitemPattern = "([0-9]+) ([a-z]+)".r

You can get the group contents from a Match object. If m is a Match object, then m.group(i) is the ith group. The start and end positions of these substrings in the original string are m.start(i), and m.end(i).

for m <- numitemPattern.findAllMatchIn("99 bottles, 98 bottles") do
  println(m.group(1)) // Prints 99 and 98

Images Caution

The Match class has methods for retrieving groups by name. However, this does not work with group names inside regular expressions, such as "(?<num>[0-9]+) (?<item>[a-z]+)".r. Instead, one needs to supply names to the r method: "([0-9]+) ([a-z]+)".r("num", "item").

There is another convenient way of extracting group matches. Use a regular expression variable as an “extractor” (see Chapter 14), like this:

val numitemPattern(num, item) = "99 bottles"
  // Sets num to "99", item to "bottles"

When you use a pattern as an extractor, it must match the string from which you extract the matches, and there must be a group for each variable.

If you are not sure whether there is a match, use

str match
  case numitemPattern(num, item) => ...

To extract groups from multiple matches, you can use a for statement like this:

for numitemPattern(num, item) <- numitemPattern.findAllIn("99 bottles, 98 bottles") do
  process(num, item)

Exercises

1. Write a Scala code snippet that reverses the lines in a file (making the last line the first one, and so on).

2. Write a Scala program that reads a file with tabs, replaces each tab with spaces so that tab stops are at n-column boundaries, and writes the result to the same file.

3. Write a Scala code snippet that reads a file and prints all words with more than 12 characters to the console. Extra credit if you can do this in a single line.

4. Write a Scala program that reads a text file containing only floating-point numbers. Print the sum, average, maximum, and minimum of the numbers in the file.

5. Write a Scala program that writes the powers of 2 and their reciprocals to a file, with the exponent ranging from 0 to 20. Line up the columns:

    1               1
    2               0.5
    4               0.25
  ...               ...

6. Make a regular expression searching for quoted strings "like this, maybe with " or \" in a source file. Write a Scala program that prints out all such strings.

7. Write a Scala program that reads a text file and prints all tokens in the file that are not floating-point numbers. Use a regular expression.

8. Write a Scala program that prints the src attributes of all img tags of a web page. Use regular expressions and groups.

9. Write a Scala program that counts how many files with .class extension are in a given directory and its subdirectories.

10. Expand the example in Section 9.7, “Serialization,” on page 121. Construct a few Person objects, make some of them friends of others, and save an Array[Person] to a file. Read the array back in and verify that the friend relations are intact.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.171.103