Let's start with by defining a few common terms, so that we remove any ambiguity their use might cause. I know that, since you can read, you likely have some understanding of these terms. I apologize if this seems pedantic, but I do promise it will immediately relate to the models we talk about next:
- Words: The atomic element of most of the systems we will be using. While some character level models do exist, we won't be talking about them today.
- Sentence: A collection of words that expresses a statement, question, and so on.
- Document: A document is a collection of sentences. It might be a sentence, or more likely multiple sentences.
- Corpus: A collection of documents.