Reading a text file word by word

The technique presented in this subsection will be demonstrated by the byWord.go file, which is shown in four parts. As you will see in the Go code, separating the words of a line can be tricky. The first part of this utility is as follows:

package main 
 
import ( 
    "bufio" 
    "flag" 
    "fmt" 
    "io" 
    "os" 
    "regexp" 
) 

The second code portion of byWord.go is shown in the following Go code:

func wordByWord(file string) error { 
    var err error 
    f, err := os.Open(file) 
    if err != nil { 
        return err 
    } 
    defer f.Close() 
 
    r := bufio.NewReader(f) 
    for { 
        line, err := r.ReadString('
') 
        if err == io.EOF { 
            break 
        } else if err != nil { 
            fmt.Printf("error reading file %s", err) 
            return err 
        } 

This part of the wordByWord() function is the same as the lineByLine() function of the byLine.go utility.

The third part of byWord.go is as follows:

        r := regexp.MustCompile("[^\s]+") 
        words := r.FindAllString(line, -1) 
        for i := 0; i < len(words); i++ { 
            fmt.Println(words[i]) 
        } 
    } 
    return nil 
} 

The remaining code of the wordByWord() function is totally new, and it uses regular expressions to separate the words found in each line of the input. The regular expression defined in the regexp.MustCompile("[^\s]+") statement states that empty characters will separate one word from another.

The last code segment of byWord.go is as follows:

func main() { 
    flag.Parse() 
    if len(flag.Args()) == 0 { 
        fmt.Printf("usage: byWord <file1> [<file2> ...]
") 
        return 
    } 
 
    for _, file := range flag.Args() { 
        err := wordByWord(file) 
        if err != nil { 
            fmt.Println(err) 
        } 
    } 
} 

Executing byWord.go will produce the following type of output:

$ go run byWord.go /tmp/adobegc.log
01/08/18
20:25:09:669
|
[INFO]  

You can verify the validity of byWord.go with the help of the wc(1) utility:

$ go run byWord.go /tmp/adobegc.log | wc
    91591   91591  559005
$ wc /tmp/adobegc.log
    4831   91591  583454 /tmp/adobegc.log

As you can see, the number of words calculated by wc(1) is the same as the number of lines and words that you took from the execution of byWord.go.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.4.181