How does pattern-based coding work?

This feature is available in NVivo Plus edition. Learn more

This topic explains how NVivo uses existing coding patterns to determine what to code. This is an experimental feature.

When you auto code using existing patterns, NVivo compares each text passage—for example, sentence or paragraph—to the content already coded to existing nodes. If the content of the text passage is similar in wording to content already coded to a node, then the text passage will be coded to that node.

NVivo first analyzes the text in existing nodes, then prepares the content to be coded, and finally does a comparison in order to determine what to code.

During pattern-based auto coding, words in the content to be coded are compared to the words in previously coded nodes. Stop words are ignored when comparing text content to existing nodes.

Analysis of text in existing nodes

NVivo first examines all of the nodes in your project for existing coding patterns and creates a 'training set'. Coding references created by previous pattern-based coding—that is, any coding by the user 'NVivo'—are ignored.

A 'training set' is created as follows:

Each node is broken down into a list of words. During this process, stop words are removed from the list. For stemming languages, words with the same stem—for example house, houses and housing—are grouped together.

For best results, make sure the text content language is set to match the language of your source materials—refer to Set the text content language and stop words for more information.

For each term (group of words with the same stem) within the node, the 'term frequency-inverse document frequency' (tf-idf) is calculated.

In this context, the tf-idf reflects how important a term is to a specific node. It is based on a term frequency calculation within the node that is offset by the frequency of the term in all of the nodes. For more information, refer to the Wikipedia article on the tf-idf numerical statistic.

The list of terms and their associated tf-idf values are then represented as a one-dimensional array—known as a vector—for each node. The vectors for each node are used later on when determining what to code.

Preparation of content to be coded

NVivo then prepares the content you have selected to code.

The content that is to be coded is divided into text passages—sentences, paragraphs or cells—depending on the option selected on the Wizard.
Each text passage is further divided into a list of words. During this process, stop words are removed from the list. Words with the same stem are grouped together into a single term—for example. fish, fishing, and fishes are grouped together as fish.
For each term (group of words with the same stem) within the text passage, the term frequency-inverse document frequency (tf-idf) is calculated.

In this context, the tf-idf reflects how important a term is to a text passage. It is based on a term frequency calculation within the text passage that is offset by the frequency of the term in all of the nodes. For more information, refer to the Wikipedia article on the tf-idf numerical statistic.

The list of terms and their associated 'tf-idf' values are then represented as a one-dimensional array—known as a vector—for each text passage. The vectors for each text passage are used later on when determining what to code.

Determining what to code

NVivo then compares each text passage to content already coded to the selected nodes and checks for similarity of wording to determine what to code.

Each text passage vector is compared against each node vector. NVivo calculates the cosine similarity between each pair. The cosine similarity is a value between 0 and 1. A higher number indicates a greater degree of similarity. For more information, refer to the Wikipedia article on Cosine similarity.
If the cosine similarity for a given pair of text passage vector and node vector is above a particular threshold, then the text passage will be coded at that node. (The threshold can be adjusted with the slider in the Wizard).

NOTE While the 'training set' includes vectors for all of the nodes in the project, only the vectors for the nodes selected in the Wizard are used to code the text passages.