Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

6. Text Manipulation

Moshe Zadka¹

(1)

Belmont, CA, USA

Automation of UNIX-based systems often involves text manipulation. Many programs are configured with textual configuration files. Text is the output format, and the input format, of many systems. While tools like sed, grep, and awk have their place, Python is a powerful tool for sophisticated text manipulation.

6.1 Bytes, Strings, and Unicode

When manipulating text or text-like streams, it is easy to write code that fails in funny ways when encountering a foreign name, or emoji. These are no longer mere theoretical concerns: you will have users from the entire world, who insist on their usernames reflecting how they spell their names. You will have people who write git commits with emojis in them. In order to make sure to write robust code, which does not fail in ways that, to be fair, seem a lot less funny when they case a 3 a.m. page, it is important to understand that “text” is a subtle thing.

You can understand the distinction, or you can wake up at 3 a.m. when someone tries to log in with an emoji username.

Python 3 has two distinct types that both represent the kind of things that are often in UNIX “text” files: bytes and strings. Bytes correspond to what RFCs usually refer to as an “octet-stream.” This is a sequence of values that fit into 8 bits, or in other words, a sequence of numbers that are in the range 0 to 256 (including 0 and not including 256). When all of these values are below 128, we call the sequence “ASCII” (American Standard Code of Information Interchange) and assign to the numbers the meaning ASCII has assigned them. When all of these values are between 32 and 128 (including 32 and not including 128), we call the sequence “printable ASCII,” or “ASCII text.” The first 32 characters are sometimes called “Control characters.” The “Ctrl” key on keyboards is a reference to that – its original purpose was to be able to input those characters.

ASCII only encompasses the English alphabet, used in “America.” In order to represent text in (almost) any language, we have Unicode. Unicode code points are (some of the) numbers between 0 and 2∗∗32 (including 0 and not including 2∗∗32). Each Unicode code point is assigned a meaning. Successive versions of the standards leave assigned meanings as is, but add meanings to more numbers. An example is the addition of more emojis. The International Standards Organization, ISO, ratifies versions of Unicode in its 10464 standards. For this reason, Unicode is sometimes called ISO-10464.

Unicode points that are also ASCII have the same meaning – if ASCII assigns a number “uppercase A,” then so does Unicode.

Properly speaking, only Unicode is “text.” This is what Python strings represent. Converting bytes to strings, or vice versa, is done with an encoding . The most popular encoding these days is UTF-8. Confusingly, turning the bytes to text is “decoding.” Turning the text to bytes is “encoding.”

Remembering the difference between encoding and decoding is crucial in order to manipulate textual data. A way to remember it is that since UTF-8 is an encoding, moving from strings to UTF-8 encoded data is “encoding,” while moving from UTF-8 encoded data to strings is “decoding.”

UTF-8 has an interesting property: when given a Unicode string that happens to be ASCII, it will produce bytes with the values of the code points. This means that “visually,” the encoded and decoded form will look the same.

>>> "hello".encode("utf-8")

b'hello'

>>> "hello".encode("utf-16")

b'xffxfehx00ex00lx00lx00ox00'

We show the example with UTF-16 to show that this is not a trivial property of encodings. Another property of UTF-8 is that if the bytes are not ASCII, and UTF-8 decoding of the bytes succeeds, then it is unlikely that they were encoded with a different encoding. This is because UTF-8 was designed to be self-synchronizing : starting at a random byte, it is possible to synchronize with the string with a limited number of bytes being checked. Self-synchronization was designed to allow recovery from truncation and corruption, but as a side benefit, it allows detecting invalid characters reliably, and thus detect if the string was UTF-8 to begin with.

This means “try decoding with UTF-8” is a safe operation; it will do the right thing for ASCII-only texts, and it will, of course, work for UTF-8 encoded texts and will fail cleanly for things that are neither ASCII nor UTF-8 encoded – either text in a different encoding or a binary format such as JPEG.

For Python, fails cleanly means “throws an exception.”

>>> snowman = 'N{snowman}'

>>> snowman.encode('utf-16').decode('utf-8')

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

For random data, this will also tend to fail:

>>> struct.pack('B'∗12,

∗(random.randrange(0, 256)

for i in range(12))

).decode('utf-8')

The errors are random, since the inputs are random. Some example errors might be:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 4: invalid continuation byte

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x98 in position 2: invalid start byte

It is a good exercise to try and run this a few times; it will almost never succeed.

6.2 Strings

The Python string object is subtle. From one perspective it appears to be a sequence of characters: and a character is a string of length 1.

>>> a="hello"

>>> for i, x in enumerate(a):

... print(i, x, len(x))

...

0 h 1

1 e 1

2 l 1

3 l 1

4 o 1

The string “hello” has five elements, each of which is a string of length 1. Since the string is a sequence, the usual sequence operations work on it.

We can create a slice by specifying both endpoints:

>>> a[2:4]

'll'

or just the end:

>>> a[:2]

'he'

or just the beginning:

>>> a[3:]

'lo'

We can also use negative indices to count from the end:

>>> a[:-3]

'he'

And of course, we can reverse a string by specifying an extended slice with a negative step:

>>> a[::-1]

'olleh'

However, strings also have quite a few methods that are not part of the general sequence interface and are useful when analyzing text.

The startswith and endswith methods are useful, since text analysis is often around the ends.

>>> "hello world".endswith("world")

True

A little-known feature is that endswith allows a tuple of strings and will check if it ends with any of these strings:

>>> "hello world".endswith(("universe", "world"))

True

An example where it comes in useful is testing for a few common endings:

>>> filename.endswith((".tgz", ".tar.gz"))

We can easily test here whether a file has either of the common suffixes for a gzipped tarball: either the tgz or tar.gz suffix.

The strip and split methods are useful for parsing the kind of ad hoc formats that many UNIX files or utilities come in. For example, the file /etc/fstab contains static mounts.

with open("/etc/fstab") as fpin:

for line in fpin:

line = line.rstrip(' ')

line = line.split('#', 1)[0]

if not line:

continue

device, path, fstype, options, freq, passno = line.split()

print(f"Mounting {device} on {path}")

This parses the file and prints a summary. The first line in the loop strips out the newline. The rstrip method strips from the right (the end) of the string.

Note that rstrip, as well as strip, accept a sequence of characters to remove. This means that passing a string to rstrip means “any of the characters in the string” and not “remove occurrences of this string.” This does not affect one-character arguments to rstrip, but it does mean that longer strings are almost always a mistaken use.

We then remove comments, if any. We skip empty lines. Any line that is not empty, we use the split with no argument, to split on any sequence of whitespaces. Conveniently, this convention is common to several formats, and the correct handling is built into the specification of split.

Lastly, we use a format string to format the output for easy consumption.

This is a typical usage of string parsing, and it is the kind of code that replaces long pipelines in shell.

Finally, the join method on a string uses it as a “glue” and glues together an iterable of strings.

The simple example of ' '.join(["hello", "world"]) will return "hello world," but this is only scratching the surface of join. Since it accepts an iterable, we have the ability to pass it anything that supports iteration.

>>> names=dict(hello=1,world=2)

>>> ' '.join(names)

'hello world'

Since iterating on a dictionary objects yields the list of keys, passing it to join means that we get a string with the list of keys, joined together.

We can also pass in a generator:

>>> '-∗-'.join(str(x) for x in range(3))

'0-∗-1-∗-2'

This allows calculating sequences on the fly and joining them, without the need to have intermediate storage for the sequence.

The usual question about join is why it is a method on the “glue” string rather than a method on sequences. The reason is exactly this: we can pass in any iterable, and the glue string will glue in the bits in it.

Note that join does nothing to single-element iterables:

>>> '-∗-'.join(str(x) for x in range(0))

'0'

6.3 Regular Expressions

Regular expressions are a special DSL for specifying properties of strings, also called “patterns.” They are common in many utilities, although each implementation will have its own idiosyncrasies. In Python, regular expressions are implemented by the re module. It fundamentally allows two modes of interaction – one where regular expressions are auto-parsed at the time of text analysis, and one where they are parsed in advance.

In general, the latter style is preferred. Auto-parsing the regular expression is suited only to an interactive loop, where they will be used quickly and forgotten. For this reason, we will not really cover this usage here.

In order to compile a regular expression, we use re.compile. This function returns a regular expression object that will look for strings that match the expression. The object can be used to do several things: for example, find one match, find all matches, or even replace the matches.

The regular expression mini-language has a lot of subtlety. Here, we will cover only the basics that we need to illustrate how to use regular expressions effectively.

Most characters stand in for themselves. The regular expression hello, for example, matches exactly hello. The . stands in for any character. So hell. would match hello and hella, but not hell – since the latter does not have any character corresponding to the .. Square brackets delimit “character classes”: for example, wom[ae]n matches both women and woman. Character classes can also have ranges in them – [0-9] matches any digit, [a-z] matches any lowercase character, and [0-9a-fA-F] matches any hexadecimal digit (hexadecimal digits and numbers pop up a lot in many places, since two hexadecimal digits correspond exactly to a standard byte).

We also have the “repeat modifiers” that modify the expression that precedes them. For example, ba?b matches both bb and bab – the ? stands for “zero or one.” The ∗ stands for any number: so ba∗b stands for bb, bab, baab, baaab, and so on. If we want “at least one,” ba+b will match almost everything that ba∗b matches, except for bb. Finally, we have the exact counters: ba{3}b matches baaab while ba{1,2}b matches bab and baab and nothing else.

In order to make a special character (like . or ∗) match itself, we prefix it with a backslash. Since in Python strings, backslash has other meanings, Python supports “raw” strings. While we can use any string to denote a regular expression, often raw strings are easier.

For example, we want a DOS-like filename regular expression: r"[^.]{1,8}.[^.]{0,3}." This will match, say, readme.txt but not archive.tar.gz. Note that to match a literal . we escaped it with a backslash. Also note that we used an interesting character class: [^.]. This means “anything except .”: the ^ means “exclude” inside of a characer class.

Regular expressions also support grouping. Grouping does two things: it allows addressing parts of the expression, and it allows treating a part of the expression as a single object in order to apply one of the repeat operations to it. If only the latter is needed, this is a “non-capture” group, denoted by (?:....).

For example, (?:[a-z]{2,5}-){1,4}[0-9] will match hello-3 or hello-world-5 but not a-hello-2 (since the first part is not two characters long) or hello-world-this-is-too-long-7 since it is made up of six repetitions of the inner pattern, and we specified a maximum of four.

This allows arbitrary nesting; for example (?:(?:[a-z]{2,5}-){1,4}[0-9];)+ allows any semicolon-terminated, separated sequence of the previous pattern: for example az-2;hello-world-5; will match but this-is-3;not-good-match-6 will not, since it is missing the ; at the end.

This is a good example of how complex regular expressions can get. It is easy to use this dense mini-language inside Python to specify constraints on strings that are hard to understand.

Once we have a regular expression object, there are two main methods on it: match and search. The match method will look for matches at the beginning of the string, while search will look for the first match, wherever it may start. When they find a match, they return a match object.

>>> reobj = re.compile('ab+a')

>>> m = reobj.search('hello abba world')

>>> m

<_sre.SRE_Match object; span=(6, 10), match="abba">

>>> m.group()

'abba'

The first method that is often used is .group(), which returns the part of the string matched. This method can get a part of the match, if the regular expression contained capturing groups. A capturing group is usually marked with ().

>>> reobj = re.compile('(a)(b+)(a)')

>>> m = reobj.search('hello abba world')

>>> m.group()

'abba'

>>> m.group(1)

'a'

>>> m.group(2)

'bb'

>>> m.group(3)

'a'

When the number of groups is significant, or when modifying the group, managing the indices to the group can prove to be a challenge. If analysis of the groups is needed, we can also name the groups.

>>> reobj = re.compile('(?P<prefix>a)(?P<body>b+)(?P<suffix>a)')

>>> m = reobj.search('hello abba world')

>>> m.group('prefix')

'a'

>>> m.group('body')

'bb'

>>> m.group('suffix')

'a'

Since regular expressions can get dense, there is a way to make them a bit easier to read: the verbose mode.

>>> reobj = re.compile(r"""

... (?P<prefix>a) # The beginning -- always an a

... (?P<body>b+) # The middle -- any numbers of b, for emphasis

... (?P<suffix>a) # An a at the end to properly anchor

... """, re.VERBOSE)

>>> m = reobj.search("hello abba world")

>>> m.groups()

('a', 'bb', 'a')

>>> m.group('prefix'), m.group('body'), m.group('suffix')

('a', 'bb', 'a')

When compiling regular expressions with the flag re.VERBOSE, whitespace is ignored, and comments, Python-like:

# to end of line, are also ignored. In order to match a space or #, they need to be backslash escaped.

This allows writing long regular expressions while still making them easier to understand with judicious line breaks, spaces, and comments.

Regular expressions are loosely based on the mathematical theory of finite automaton. While they do go beyond the constraints of what finite automata can match, they are not fully general. Among other things, they are poorly suited for nested patterns ; whether matching parentheses or HTML elements, they are not a good fit for regular expressions.

6.4 JSON

JSON is a hierarchical file format that has the advantage of being simple to parse, and reasonably easy to read and write by hand. It has its origins on the web: the name stands for “JavaScript Object Notation.” Indeed, it is still popular on the internet; one reason to care about JSON is that many web APIs use JSON as a transfer format.

It is also useful, however, in other places. For example, in JavaScript projects, package.json includes the dependencies of this project. Parsing this is often useful to determine third-party dependencies for security or compliance audits, for example.

In theory, JSON is a format defined in Unicode, not bytes. When serializing, it takes a data structure and transforms it into a Unicode string, and when deserializing, it takes a Unicode string and returns a data structure. Recently, however, the standard was amended to specify a preferred encoding: utf-8. With this addition, now the format is also defined as a byte stream.

However, note that in some use cases, the encoding is still separate from the format. In particular, when sending or receiving JSON over HTTP, the HTTP encoding is the ultimate truth. Even then, though, when no encoding is explicitly specified, UTF-8 should be assumed.

JSON is a simple serialization format, only supporting a few types:

Strings
Numbers
Booleans
A null type
Arrays of JSON values
“Objects”: dictionaries mapping strings to JSON values

Note that JSON does not full specify numerical ranges or precision. If precise integers are required, usually the range -2∗∗53 to 2∗∗53 can be assumed to be representable precisely.

Although the Python json library has the ability to read/write directly to files, in practice we almost always separate the tasks; we read as much data as we need and pass the string directly to JSON.

The two functions that are the most important in the json module are loads and dumps. The s at the end stands for “string,” which is what those functions accept and return.

>>> thing = [{"hello": 1, "world": 2}, None, True]

>>> json.dumps(thing)

'[{"hello": 1, "world": 2}, null, true]'

>>> json.loads(_)

[{'hello': 1, 'world': 2}, None, True]

The None object in Python maps to the JSON null object , booleans in Python map to booleans in JSON, and numbers and strings map to number and strings. Note that the Python JSON parsing libraries makes ad hoc decisions about whether a number should map to an integer or a float based on the notation it uses:

>>> json.loads("1")

>>> json.loads("1.0")

1.0

It is important to remember not all JSON loading libraries make the same decision, and in some cases, this can lead to interoperability problems.

For debugging reasons, it is often useful to be able to “pretty print” JSON. The dumps function can do that, with some extra arguments. The usual set of arguments for pretty printing is the following:

json.dumps(thing, sort_keys=True, indent=4)

If we want to round-trip into an equivalent, but pretty version, we can even do this:

json.dumps(json.loads(encoded_string), sort_keys=True, indent=4)

Finally, at the command line, the module:json.tool will do this automatically:

$ python -m json.tool < somefile.json | less

This is an easy way to scan through dumped JSON and look for interesting information.

Note that with Python 3.7 and above, sort_keys should be used judiciously; since all dictionaries are ordered by insertion, not using sort_keys will keep the original order in the dictionary.

One frequently missed type from JSON is a date-time type. Usually this is represented with strings, and is the most common need for a “schema” to parse JSON against, in order to know which strings to convert to a datetime object .

6.5 CSV

The CSV format has a few advantages. It is constrained: it always represents scalar types in a two-dimensional array. For this reason, there are not a lot of surprises that can go in. In addition, it is a format that imports natively into spreadsheet applications like Microsoft Excel or Google Sheets. This comes in handy when preparing reports.

Examples of such reports are of breaking down expenses for paying for third-party services for the financial department, or a report on incidents managed and time to recovery for management. In all these cases, having a format that is easy to produce and import into spreadsheet applications allows for easy automation of the task.

Writing CSV files is done with csv.writer . A typical example involves serializing a homogenous array, an array of things with the same type.

@attr.s(frozen=True, auto_attribs=True)

class LoginAttempt:

username: str

time_stamp: int

success: bool

This class represents a login attempt by some user, at a given time, and with a record of the success of the attempt. For a security audit, we need to send the auditors an Excel file of the login attempts.

def write_attempts(attempts, fname):

with open(fname, 'w') as fpout:

writer = csv.writer(fpout)

writer.writerow(['Username', 'Timestamp', 'Success'])

for attempt in attempts:

writer.writerow([

attempt.username,

attempt.time_stamp,

str(attempt.success),

])

Note that by convention, the first row should be a “title row.” Though the Python API does not enforce it, it is highly recommended to follow this convention. In this example, we first wrote a “title row” with the names of the fields.

Then we looped through the attempts. Note that CSV can only represent strings and numbers, so instead of relying on thinly documented standards on how a boolean will be written out, we have done so explicitly.

This way, if the auditor asks for that field to be “yes/no,” we can change our explicit serialization step to match.

When it comes to reading CSV files, there are two main approaches.

Using csv.reader will return an iterator that yields parsed row by parsed row, as a list. However, assuming the convention about the first row being the names of fields has been followed, csv.DictReader will yield nothing for the first row, and a dictionary for every subsequent row, using field names as keys. This enables more robust parsing in the face of end users adding fields or changing their order.

>>> reader = csv.DictReader(fileobj)

>>> list(reader)

[OrderedDict([('Username', 'alice'),

('Timestamp', '1514793600.0'),

('Success', 'False')]),

OrderedDict([('Username', 'bob'),

('Timestamp', '1539154800.0'),

('Success', 'True')])]

Reading the same CSV that we have written in the previous example will yield reasonable results. The dictionary maps the field names to the values. It is important to note that the types have all been forgotten, and everything is returned as a string. Unfortunately, CSV does not keep type information.

It is sometimes tempting to just “improvise” parsing CSV files with .split. However, CSV has quite a few corner cases that are not readily apparent.

For example,

1,"Miami, FL","he""llo"

is properly parsed as

('1', 'Miami, FL', 'he"llo')

For the same reason, it is a good idea to avoid writing CSV files using anything other than csv.writer.

6.6 Summary

Much of the content that is needed for many DevOps tasks arrives as text: logs, JSON dumps of data structures, or a CSV file of paid licenses. Understanding what “text” is and how to manipulate it in Python allow much of the automation that is the cornerstone of DevOps, be it through build automation, monitoring result analysis, or just preparing summaries for easy consumption by others.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6. Text Manipulation

Create new playlist

Sign In

Sign Up

6. Text Manipulation

6.1 Bytes, Strings, and Unicode

6.2 Strings

6.3 Regular Expressions

6.4 JSON

6.5 CSV

6.6 Summary

Table of Contents for
6. Text Manipulation