Caroline Eppright | Content Strategist | March 25, 2021
Natural language processing (NLP) is a branch of artificial intelligence (AI) that enables computers to comprehend, generate, and manipulate human language. Natural language processing has the ability to interrogate the data with natural language text or voice. This is also called “language in.” Most consumers have probably interacted with NLP without realizing it. For instance, NLP is the core technology behind virtual assistants, such as the Oracle Digital Assistant (ODA), Siri, Cortana, or Alexa. When we ask questions of these virtual assistants, NLP is what enables them to not only understand the user’s request, but to also respond in natural language. NLP applies both to written text and speech, and can be applied to all human languages. Other examples of tools powered by NLP include web search, email spam filtering, automatic translation of text or speech, document summarization, sentiment analysis, and grammar/spell checking. For example, some email programs can automatically suggest an appropriate reply to a message based on its content—these programs use NLP to read, analyze, and respond to your message.
There are several other terms that are roughly synonymous with NLP. Natural language understanding (NLU) and natural language generation (NLG) refer to using computers to understand and produce human language, respectively. NLG has the ability to provide a verbal description of what has happened. This is also called “language out” by summarizing by meaningful information into text using a concept known as "grammar of graphics."
In practice, NLU is used to mean NLP. The understanding by computers of the structure and meaning of all human languages, allowing developers and users to interact with computers using natural sentences and communication. Computational linguistics (CL) is the scientific field that studies computational aspects of human language, while NLP is the engineering discipline concerned with building computational artifacts that understand, generate, or manipulate human language.
Research on NLP began shortly after the invention of digital computers in the 1950s, and NLP draws on both linguistics and AI. However, the major breakthroughs of the past few years have been powered by machine learning, which is a branch of AI that develops systems that learn and generalize from data. Deep learning is a kind of machine learning that can learn very complex patterns from large datasets, which means that it is ideally suited to learning the complexities of natural language from datasets sourced from the web.
Automate routine tasks: Chatbots powered by NLP can process a large number of routine tasks that are handled by human agents today, freeing up employees to work on more challenging and interesting tasks. For example, chatbots and Digital Assistants can recognize a wide variety of user requests, match them to the appropriate entry in a corporate database, and formulate an appropriate response to the user.
Improve search: NLP can improve on keyword matching search for document and FAQ retrieval by disambiguating word senses based on context (for example, “carrier” means something different in biomedical and industrial contexts), matching synonyms (for example, retrieving documents mentioning “car” given a search for “automobile”), and taking morphological variation into account (which is important for non-English queries). Effective NLP-powered academic search systems can dramatically improve access to relevant cutting-edge research for doctors, lawyers, and other specialists.
Search engine optimization: NLP is a great tool for getting your business ranked higher in online search by analyzing searches to optimize your content. Search engines use NLP to rank their results—and knowing how to effectively use these techniques makes it easier to be ranked above your competitors. This will lead to greater visibility for your business.
Analyzing and organizing large document collections: NLP techniques such as document clustering and topic modeling simplify the task of understanding the diversity of content in large document collections, such as corporate reports, news articles, or scientific documents. These techniques are often used in legal discovery purposes.
Social media analytics: NLP can analyze customer reviews and social media comments to make better sense of huge volumes of information. Sentiment analysis identifies positive and negative comments in a stream of social-media comments, providing a direct measure of customer sentiment in real time. This can lead to huge payoffs down the line, such as increased customer satisfaction and revenue.
Market insights: With NLP working to analyze the language of your business’ customers, you’ll have a better handle on what they want, and also a better idea of how to communicate with them. Aspect-oriented sentiment analysis detects the sentiment associated with specific aspects or products in social media (for example, “the keyboard is great, but the screen is too dim”), providing directly actionable information for product design and marketing.
Moderating content: If your business attracts large amounts of user or customer comments, NLP enables you to moderate what’s being said in order to maintain quality and civility by analyzing not only the words, but also the tone and intent of comments.
NLP simplifies and automates a wide range of business processes, especially ones that involve large amounts of unstructured text like emails, surveys, social media conversations, and more. With NLP, businesses are better able to analyze their data to help make the right decisions. Here are just a few examples of practical applications of NLP:
Machine learning models for NLP: We mentioned earlier that modern NLP relies heavily on an approach to AI called machine learning. Machine learning make predictions by generalizing over examples in a dataset. This dataset is called the training data, and machine learning algorithms train on this training data to produce a machine learning model that accomplishes a target task.
For example, sentiment analysis training data consists of sentences together with their sentiment (for example, positive, negative, or neutral sentiment). A machine-learning algorithm reads this dataset and produces a model which takes sentences as input and returns their sentiments. This kind of model, which takes sentences or documents as inputs and returns a label for that input, is called a document classification model. Document classifiers can also be used to classify documents by the topics they mention (for example, as sports, finance, politics, etc.).
Another kind of model is used to recognize and classify entities in documents. For each word in a document, the model predicts whether that word is part of an entity mention, and if so, what kind of entity is involved. For example, in “XYZ Corp shares traded for $28 yesterday”, “XYZ Corp” is a company entity, “$28” is a currency amount, and “yesterday” is a date. The training data for entity recognition is a collection of texts, where each word is labeled with the kinds of entities the word refers to. This kind of model, which produces a label for each word in the input, is called a sequence labeling model.
Sequence to sequence models are a very recent addition to the family of models used in NLP. A sequence to sequence (or seq2seq) model takes an entire sentence or document as input (as in a document classifier) but it produces a sentence or some other sequence (for example, a computer program) as output. (A document classifier only produces a single symbol as output). Example applications of seq2seq models include machine translation, which for example, takes an English sentence as input and returns its French sentence as output; document summarization (where the output is a summary of the input); and semantic parsing (where the input is a query or request in English, and the output is a computer program implementing that request).
Deep learning, pretrained models, and transfer learning: Deep learning is the most widely-used kind of machine learning in NLP. In the 1980s, researchers developed neural networks, in which a large number of primitive machine learning models are combined into a single network: by analogy with brains, the simple machine learning models are sometimes called “neurons.” These neurons are arranged in layers, and a deep neural network is one with many layers. Deep learning is machine learning using deep neural network models.
Because of their complexity, generally it takes a lot of data to train a deep neural network, and processing it takes a lot of compute power and time. Modern deep neural network NLP models are trained from a diverse array of sources, such as all of Wikipedia and data scraped from the web. The training data might be on the order of 10 GB or more in size, and it might take a week or more on a high-performance cluster to train the deep neural network. (Researchers find that training even deeper models from even larger datasets have even higher performance, so currently there is a race to train bigger and bigger models from larger and larger datasets).
The voracious data and compute requirements of Deep Neural Networks would seem to severely limit their usefulness. However, transfer learning enables a trained deep neural network to be further trained to achieve a new task with much less training data and compute effort. The simplest kind of transfer learning is called fine tuning. It consists simply of first training the model on a large generic dataset (for example, Wikipedia) and then further training (“fine-tuning”) the model on a much smaller task-specific dataset that is labeled with the actual target task. Perhaps surprisingly, the fine-tuning datasets can be extremely small, maybe containing only hundreds or even tens of training examples, and fine-tuning training only requires minutes on a single CPU. Transfer learning makes it easy to deploy deep learning models throughout the enterprise.
There is now an entire ecosystem of providers delivering pretrained deep learning models that are trained on different combinations of languages, datasets, and pretraining tasks. These pretrained models can be downloaded and fine-tuned for a wide variety of different target tasks.
Learn how establishing an AI center of excellence (CoE) can boost your success with NLP technologies. Our ebook provides tips for building a CoE and effectively using advanced machine learning models.
Tokenization: Tokenization splits raw text (for example., a sentence or a document) into a sequence of tokens, such as words or subword pieces. Tokenization is often the first step in an NLP processing pipeline. Tokens are commonly recurring sequences of text that are treated as atomic units in later processing. They may be words, subword units called morphemes (for example, prefixes such as “un-“ or suffixes such as “-ing” in English), or even individual characters.
Bag-of-words models: Bag-of-words models treat documents as unordered collections of tokens or words (a bag is like a set, except that it tracks the number of times each element appears). Because they completely ignore word order, bag-of-words models will confuse a sentence such as “dog bites man” with “man bites dog.” However, bag-of-words models are often used for efficiency reasons on large information retrieval tasks such as search engines. They can produce close to state-of-the-art results with longer documents.
Stop word removal: A “stop word” is a token that is ignored in later processing. They are typically short, frequent words such as “a,” “the,” or “an.” Bag-of-words models and search engines often ignore stop words in order to reduce processing time and storage within the database. Deep neural networks typically do take word-order into account (that is, they are not bag-of-words models) and do not do stop word removal because stop words can convey subtle distinctions in meaning (for example, “the package was lost” and “a package is lost” don’t mean the same thing, even though they are the same after stop word removal).
Stemming and lemmatization: Morphemes are the smallest meaning-bearing elements of language. Typically morphemes are smaller than words. For example, “revisited” consists of the prefix “re-“, the stem “visit,” and the past-tense suffix “-ed.” Stemming and lemmatization map words to their stem forms (for example, “revisit” + PAST). Stemming and lemmatization are crucial steps in pre-deep learning models, but deep learning models generally learn these regularities from their training data, and so do not require explicit stemming or lemmatization steps.
Part-of-speech tagging and syntactic parsing: Part-of-speech (PoS) tagging is the process of labeling each word with its part of speech (for example, noun, verb, adjective, etc.). A Syntactic parser identifies how words combine to form phrases, clauses, and entire sentences. PoS tagging is a sequence labeling task, syntactic parsing is an extended kind of sequence labeling task, and deep neural Nntworks are the state-of-the-art technology for both PoS tagging and syntactic parsing. Before deep learning, PoS tagging and syntactic parsing were essential steps in sentence understanding. However, modern deep learning NLP models generally only benefit marginally (if at all) from PoS or syntax information, so neither PoS tagging nor syntactic parsing are widely used in deep learning NLP.
The NLP Libraries and toolkits are generally available in Python, and for this reason by far the majority of NLP projects are developed in Python. Python’s interactive development environment makes it easy to develop and test new code.
For processing large amounts of data, C++ and Java are often preferred because they can support more efficient code.
Here are examples of some popular NLP libraries.
TensorFlow and PyTorch: These are the two most popular deep learning toolkits. They are freely available for research and commercial purposes. While they support multiple languages, their primary language is Python. They come with large libraries of prebuilt components, so even very sophisticated deep learning NLP models often only require plugging these components together. They also support high-performance computing infrastructure, such as clusters of machines with graphical processor unit (GPU) accelerators. They have excellent documentation and tutorials.
AllenNLP: This is a library of high-level NLP components (for example, simple chatbots) implemented in PyTorch and Python. The documentation is excellent.
HuggingFace: This company distributes hundreds of different pretrained Deep Learning NLP models, as well as a plug-and-play software toolkit in TensorFlow and PyTorch that enables developers to rapidly evaluate how well different pretrained models perform on their specific tasks.
Spark NLP: Spark NLP is an open source text processing library for advanced NLP for the Python, Java, and Scala programming languages. Its goal is to provide an application programming interface (API) for natural language processing pipelines. It offers pretrained neural network models, pipelines, and embeddings, as well as support for training custom models.
SpaCy NLP: SpaCy is a free, open source library for advanced NLP in Python, and it is specifically designed to help build applications that can process and understand large volumes of text. SpaCy is known to be highly intuitive and can handle many of the tasks needed in common NLP projects.
In summary, Natural language processing is an exciting area of artificial intelligence development that fuels a wide range of new products such as search engines, chatbots, recommendation systems, and speech-to-text systems. As human interfaces with computers continue to move away from buttons, forms, and domain-specific languages, the demand for growth in natural language processing will continue to increase. For this reason, Oracle Cloud Infrastructure is committed to providing on-premises performance with our performance-optimized compute shapes and tools for NLP. Oracle Cloud Infrastructure offers an array of GPU shapes that you can deploy in minutes to begin experimenting with NLP.