Visualizing Clustered Data in R

If you are performing unsupervised learning for exploratory data analysis, you’ll probably need to visualize the subgroups – or clusters – within the data. Color-coding the clusters is common, but here I’ll show you how to use the vegan package in R to make those clusters really stand out.

Full code is at the end of the post in case you want to skip the step-by-step tutorial.

Step 1: Load the vegan package

library(vegan)

Step 2: Create example data with four clusters

set.seed(531) # set seed for reproducibility

df = data.frame(
  X1=rnorm(50, mean=0, sd=1), Y1=rnorm(50, mean=2, sd=1), 
  X2=rnorm(50, mean=3, sd=1), Y2=rnorm(50, mean=0, sd=1),
  X3=rnorm(50, mean=6, sd=1), Y3=rnorm(50, mean=3, sd=1),
  X4=rnorm(50, mean=8, sd=2), Y4=rnorm(50, mean=-1, sd=1))

# reorder data for ordihull function
data = cbind(
  c(df$X1, df$X2, df$X3, df$X4), 
  c(df$Y1, df$Y2, df$Y3, df$Y4))

# create data label variable for ordihull function
grouping = cbind(rep(1,50),rep(2,50),rep(3,50),rep(4,50))

Step 3: Initial plot of the data clusters (color-coded)

plot(df$X1, df$Y1, pch=20, col="cornflowerblue", xlim=c(-4,12), ylim=c(-4,5), xlab="", ylab="", xaxt='n', yaxt='n')
points(df$X2, df$Y2, pch=20, col="darkseagreen")
points(df$X3, df$Y3, pch=20, col="darkgoldenrod")
points(df$X4, df$Y4, pch=20, col="deeppink")
Color-coded data on a plot
Plot of the data. Color-coding helps identify the four data clusters.

Step 4: Use ordihull function to make data clusters stand out

ordihull(data, groups=grouping, label=TRUE)
Color-coded data on a plot with each color enclosed by a black line
Using the ordihull function makes the data clusters even more apparent.

There are additional ordi plot functions that can achieve similar results, depending on your visualization needs and tastes. Here is an example with the ordispider function:

ordispider(data, groups=grouping, label=TRUE)
Color-coded data on a plot with each color marked by a spider plot
The ordispider function results in a different look, but still makes the data clusters easily identifiable.

It’s worth giving the ordi functions in the vegan package a thorough look if you’re performing unsupervised learning and need to visualize data clusters. The help manual for the vegan package can be found here.

Full code

library(vegan) # load package

set.seed(531) # set seed for reproducibility

df = data.frame(
  X1=rnorm(50, mean=0, sd=1), Y1=rnorm(50, mean=2, sd=1), 
  X2=rnorm(50, mean=3, sd=1), Y2=rnorm(50, mean=0, sd=1),
  X3=rnorm(50, mean=6, sd=1), Y3=rnorm(50, mean=3, sd=1),
  X4=rnorm(50, mean=8, sd=2), Y4=rnorm(50, mean=-1, sd=1))

# reorder data for ordihull function
data = cbind(
  c(df$X1, df$X2, df$X3, df$X4), 
  c(df$Y1, df$Y2, df$Y3, df$Y4))

# create data label variable for ordihull function
grouping = cbind(rep(1,50),rep(2,50),rep(3,50),rep(4,50))

plot(df$X1, df$Y1, pch=20, col="cornflowerblue", xlim=c(-4,12), ylim=c(-4,5), xlab="", ylab="", xaxt='n', yaxt='n')
points(df$X2, df$Y2, pch=20, col="darkseagreen")
points(df$X3, df$Y3, pch=20, col="darkgoldenrod")
points(df$X4, df$Y4, pch=20, col="deeppink")

ordihull(data, groups=grouping, label=TRUE)

Did the ordihull function help you visualize your data clusters? What other function and packages do you use? Let me know in the comments below!


Check out more at ProjectsByPeter.com/Projects


Post content and images © 2021 ProjectsByPeter.com – All rights reserved.

Leave a Reply