Batch-processing documents

We will now read a larger set of 2,225 BBC News articles (see GitHub for data source details) that belong to five categories and are stored in individual text files. We need to do the following:

  1. Call the .glob() method of pathlib's Path object.
  2. Iterate over the resulting list of paths.
  3. Read all lines of the news article excluding the heading in the first line.
  4. Append the cleaned result to a list:
files = Path('..', 'data', 'bbc').glob('**/*.txt')
bbc_articles = []
for i, file in enumerate(files):
_, _, _, topic, file_name = file.parts
with file.open(encoding='latin1') as f:
lines = f.readlines()
body = ' '.join([l.strip() for l in lines[1:]]).strip()
bbc_articles.append(body)
len(bbc_articles)
2225
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.94.150.98