N-grams

What are N-grams?

N-grams are contiguous sequences of n items from a given sample of text or speech. In the context of natural language processing, an n-gram is a sequence of n words or characters. N-grams are used to capture the linguistic structure in a text, such as word or character dependencies, and can be employed in various NLP tasks, such as language modeling, text classification, and information retrieval.

Examples of N-grams:

  • Unigrams (n = 1): Single words or characters, e.g., “the”, “cat”, “sat”.
  • Bigrams (n = 2): Sequences of two words or characters, e.g., “the cat”, “cat sat”, “sat on”.
  • Trigrams (n = 3): Sequences of three words or characters, e.g., “the cat sat”, “cat sat on”, “sat on the”.

Resources to learn more about N-grams:

Saturn Cloud