Introduction to NLP with spaCy#
Learning objectives#
Understand the basic concepts of probabilistic language models and how these can be applied to different NLP tasks including sentiment analysis, genre classification and named-entity recognition.
Understand and know how to apply n-grams and word embeddings for feature extraction in a classification pipeline.
Have a conceptual understanding of transformers and deep learning techniques for NLP.
Understand and know how to use Python spaCy for NLP and develop reproducible text processing pipelines.
spaCy#
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It’s designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems.
Suggested study plan#
Read the lecture notes here.
Go through the interactive online course Advanced NLP with spaCy. Alternatively, get started with a shorter tutorial on Real Python NLP with spaCy in Python.
Go through a more detailed explanation of word2vec
word2vec explained by Julian Gilyadov (more technical).
Have a go at the Dutch restaurant reviews dataset to predict the next Dutch Michelin stars restaurants. A series of blog posts by the Analytics Lab provides a step-by-step explanation using R. Try to reproduce these with spaCy.
Go through the example notebooks in the following section.
Recommended material for further study#
Artificial Intelligence: A Modern Approach (4th edition) provides a thorough overview in two chapters:
chapter 23 - Natural Language Processing: explains language models, grammars and parsing.
chapter 24 - Deep Learning for NLP: explains word embeddings, RNNs, sequence2sequence models, transformers and transfer learning.
You need to buy the book, not cheap, but warmly recommended as a desk reference.