The most important peculiarity of Word documents is that the data is structured in paragraphs, instead of in pages. The size of the font, line size and other considerations may make the number of pages change.
Most of the paragraphs are also typically empty, or include only new lines, tabs, or other whitespace characters. It is a good idea to check when a paragraph is empty and skip it.
In the How to do it… section, step 2 opens the file and step 3 shows how to access the core properties. These are properties that are defined in Word as document metadata, such as the author or creation date.
The paragraphs of the document can be iterated and have their text extracted in raw format, as shown in step 6. This is information that doesn't include styling information and it's typically the most useful one for processing the data automatically.
If the styling information is required, the runs can be used, as in steps 7 and 8. Each paragraph can contain one or more runs, which are smaller units that share the same styling. For example, if a sentence is Word1 word2 word3, there will be three runs, one with italic text (Word1), another with underline (word2), and another with bold (word3). Even more so, there can be intermediate runs with regular text that contains just the whitespaces, making a total of 5 runs.
The styling can be detected individually on properties such as bold, italic, or underline.
Be aware that, because Word documents are produced by so many sources, a lot of properties may not be set up, leaving it to the tool on what specifics to use. For example, is very common to keep the default font, which may mean that the font information is left empty.