How it works...

The most important peculiarity of Word documents is that the data is structured in paragraphs, instead of in pages. The size of the font, line size and other considerations may make the number of pages change.

Most of the paragraphs are also typically empty, or include only new lines, tabs, or other whitespace characters. It is a good idea to check when a paragraph is empty and skip it.

In the How to do it… section, step 2 opens the file and step 3 shows how to access the core properties. These are properties that are defined in Word as document metadata, such as the author or creation date.

This information needs to be taken with a grain of salt, as a lot of tools that produce Word documents (but not Microsoft Office) won't necessarily fill it. Double-check before using that information.

The paragraphs of the document can be iterated and have their text extracted in raw format, as shown in step 6. This is information that doesn't include styling information and it's typically the most useful one for processing the data automatically.

If the styling information is required, the runs can be used, as in steps 7 and 8. Each paragraph can contain one or more runs, which are smaller units that share the same styling. For example, if a sentence is Word1 word2 word3, there will be three runs, one with italic text (Word1), another with underline (word2), and another with bold (word3). Even more so, there can be intermediate runs with regular text that contains just the whitespaces, making a total of 5 runs.

The styling can be detected individually on properties such as bold, italic, or underline. 

The division in runs can quite complicated. Due to the way editors work it, is not uncommon to have half-words, a split word in two runs, sometimes with the same properties. Do not rely on the number of runs and analyse the content. In particular, double-check if trying to ensure if a part with a particular style is divided in two or more runs. A good example is the words lore m (it should be lorem) in Step 8.

Be aware that, because Word documents are produced by so many sources, a lot of properties may not be set up, leaving it to the tool on what specifics to use. For example, is very common to keep the default font, which may mean that the font information is left empty.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.