Michael Chen | Content Strategest | July 17, 2024
Unsupervised learning is a machine learning technique that uses unlabeled data sets for training. With unsupervised learning, a model has no established guidelines for desired outputs or relationships. Instead, the goal is to explore the data and, in doing so, discover patterns, trends, and relationships.
Unsupervised learning is the optimal choice for a machine learning project with a large amount of unlabeled, often diverse, data, where patterns and relationships aren’t yet known. The algorithm often will uncover insights that may not otherwise have been found. For example, examining a data set of purchasing histories can reveal clusters of customers who buy in similar, previously unknown, ways. Decision-makers might use that information to develop new sales programs.
Because of its exploratory nature, unsupervised learning works best for specific scenarios. These include the following:
Raw data analysis: Unsupervised learning algorithms can explore very large, unstructured volumes of data, such as text, to find patterns and trends. An example of this comes from historical customer email inquiries, where an unsupervised learning algorithm can explore an unstructured data set of customer emails. Though there’s no labeling to define the quality or purpose of these interactions, the algorithm can detect patterns that might highlight opportunities for improvement, such as a high volume of inquiries about the same technical issue.
Groupings: For data segmentation, unsupervised learning can examine the traits of data points to determine commonalities and patterns and create groups. An example of this comes from a project to train a large language model (LLM) to reply to customer input. Using unstructured customer feedback from chatbots and messages, the algorithm can learn to identify categories based on the text, such as billing question, positive or negative feedback, technical question, or employment inquiry. This categorization then helps the model identify appropriate responses in terms of both language and tone.
Relationships: Similar to groupings, unsupervised learning can look at the weight (the importance of features or inputs overlapping data points), distance (the measure of overall similarity between data points), and quality of relationships to determine how data points are connected. Consider a fraud detection algorithm that goes beyond binary flagging of questionable records by examining different related data points, such as similar purchases made by previously flagged accounts or other purchases by the account in question. Relationship analysis provides context, letting institutions determine if the flagged record was a one-off instance, part of a larger behavior pattern, or fraud.
In each of these cases, unsupervised learning identifies patterns and characteristics within the data. This process can lead to a better understanding of what can be learned to drive decision-making.
Unsupervised learning is a type of machine learning where the algorithm is trained on unlabeled data. An unsupervised learning project starts with establishing the problem to be solved or other goal. With that information, the project’s leads can choose the type of algorithm for the project. This selection is usually based on the desired outcome: clustering, relationships, or dimensionality—the process of identifying and defining features or variables within a data set. Goals also drive the search for appropriate training data sets, as the project’s goals and algorithm types drive the type of data needed.
Once these pieces are set, the algorithm undergoes training, using trial and error to mimic established input/output relationships until an acceptable performance standard is met. Data experts analyze the results to see if the model has uncovered desired insights and iterate by refining it and tweaking parameters to improve the performance.
The decision to use unsupervised learning comes with caveats. Since unsupervised learning is a more complex training method versus supervised or semi-supervised learning, due to the lack of labeled data that would help validate results, it generally requires oversight by experts who can verify the model’s performance. Thus, while unsupervised learning is a hands-off process from a data labeling and preparation standpoint, it needs close supervision to stay on the right path. For example, in a generative AI model tasked with producing realistic illustrations, domain experts will need to review results closely to ensure that the patterns and relationships powering image generation are accurate in areas such as lighting, anatomy, and structural feasibility. Otherwise, you might end up with extra fingers or toes.
The most common types of unsupervised learning are as follows:
Clustering: When the algorithm seeks out groups of similar data and the commonalities between them. Real-world examples include customer segmentation and auto-sorting email filtering.
Association rule: When the algorithm examines relationships between data points, whether surface level or hidden several layers deep. Real-world examples include customer purchase patterns and symptom relationships for medical diagnosis.
Dimensionality reduction: When the model examines a data set to reduce the number of irrelevant features (dimensions) used. Real-world examples include image recognition and data compression algorithms.
Unsupervised machine learning lets companies discover patterns and insights in large, diverse, unstructured data sets that lack predefined categories or labels, without human intervention. It’s akin to sifting through thousands of grains of sand for flecks of gold, potentially unlocking new opportunities for growth and innovation.
Which AI use case is the best fit for unsupervised learning? Discover that and more in this ebook
What are the two types of unsupervised learning?
Unsupervised learning techniques are generally classified as one of two different types. Clustering refers to the process of grouping data based on traits, with algorithms using analysis methods such as hierarchical clustering—creating clusters in hierarchical trees, such as customer purchasing power based on zip code—and probabilistic clustering, which uses probability scores that calculate the likelihood of belonging, such as a customer’s risk characteristics in loan analysis. Association rule learning refers to the process of identifying relationships between data points to determine patterns and trends, with algorithms using methods such as quantitative association—relationships associated based on numerical or quantitative attributes between data points, such as purchasing trends by age—and multirelational association, that is, relationships associated among multiple possible variables between data points, such as a pro athlete’s performance based on age, quality of teammates, salary, and college program.
What’s a good example of unsupervised learning?
A good example of unsupervised learning is an artificial intelligence LLM for the health care industry. In this case, the LLM trains on unstructured data sets, such as medical textbooks, patient records, and study data. Using iterative training, the LLM learns relationships and patterns, with the eventual goal for the LLM to answer queries using appropriate medical language with a high level of accuracy.
What’s the difference between supervised and unsupervised learning?
Supervised learning uses labeled data sets in algorithm training. With clear input and output labels, supervised learning builds off a foundation of established definitions. For example, an algorithm for identifying cats trains off photos clearly labeled as either having cats or not having cats. Unsupervised learning uses unlabeled data sets in training. Without labels, the algorithm explores the data sets to identify patterns and trends. Using the same example of identifying cats, the system could pretrain with large unlabeled data sets of general encyclopedia-style text and images to learn visual patterns and concepts related to cats, then refine by training on smaller image data sets for specific items, such as cat faces, paws, and tails.
What’s an example of unsupervised feature learning?
In machine learning, features are variables found in a data set. An example of a feature for a weather algorithm is the day of the year. In the specific case of unsupervised learning, features are identified as the algorithm explores the data. Going back to that weather example, the model may find via exploration that date is an important factor in making predictions and thus determine that’s a required input feature for the model.