Site icon Projects by Peter

Visualizing Clustered Data in R

Photo by Lukas on Pexels.com

Advertisements

If you are performing unsupervised learning for exploratory data analysis, you’ll probably need to visualize the subgroups – or clusters – within the data. Color-coding the clusters is common, but here I’ll show you how to use the vegan package in R to make those clusters really stand out.

Full code is at the end of the post in case you want to skip the step-by-step tutorial.

Step 1: Load the vegan package

library(vegan)

Step 2: Create example data with four clusters

set.seed(531) # set seed for reproducibility

df = data.frame(
  X1=rnorm(50, mean=0, sd=1), Y1=rnorm(50, mean=2, sd=1), 
  X2=rnorm(50, mean=3, sd=1), Y2=rnorm(50, mean=0, sd=1),
  X3=rnorm(50, mean=6, sd=1), Y3=rnorm(50, mean=3, sd=1),
  X4=rnorm(50, mean=8, sd=2), Y4=rnorm(50, mean=-1, sd=1))

# reorder data for ordihull function
data = cbind(
  c(df$X1, df$X2, df$X3, df$X4), 
  c(df$Y1, df$Y2, df$Y3, df$Y4))

# create data label variable for ordihull function
grouping = cbind(rep(1,50),rep(2,50),rep(3,50),rep(4,50))

Step 3: Initial plot of the data clusters (color-coded)

plot(df$X1, df$Y1, pch=20, col="cornflowerblue", xlim=c(-4,12), ylim=c(-4,5), xlab="", ylab="", xaxt='n', yaxt='n')
points(df$X2, df$Y2, pch=20, col="darkseagreen")
points(df$X3, df$Y3, pch=20, col="darkgoldenrod")
points(df$X4, df$Y4, pch=20, col="deeppink")
Plot of the data. Color-coding helps identify the four data clusters.

Step 4: Use ordihull function to make data clusters stand out

ordihull(data, groups=grouping, label=TRUE)
Using the ordihull function makes the data clusters even more apparent.

There are additional ordi plot functions that can achieve similar results, depending on your visualization needs and tastes. Here is an example with the ordispider function:

ordispider(data, groups=grouping, label=TRUE)
The ordispider function results in a different look, but still makes the data clusters easily identifiable.

It’s worth giving the ordi functions in the vegan package a thorough look if you’re performing unsupervised learning and need to visualize data clusters. The help manual for the vegan package can be found here.

Full code

library(vegan) # load package

set.seed(531) # set seed for reproducibility

df = data.frame(
  X1=rnorm(50, mean=0, sd=1), Y1=rnorm(50, mean=2, sd=1), 
  X2=rnorm(50, mean=3, sd=1), Y2=rnorm(50, mean=0, sd=1),
  X3=rnorm(50, mean=6, sd=1), Y3=rnorm(50, mean=3, sd=1),
  X4=rnorm(50, mean=8, sd=2), Y4=rnorm(50, mean=-1, sd=1))

# reorder data for ordihull function
data = cbind(
  c(df$X1, df$X2, df$X3, df$X4), 
  c(df$Y1, df$Y2, df$Y3, df$Y4))

# create data label variable for ordihull function
grouping = cbind(rep(1,50),rep(2,50),rep(3,50),rep(4,50))

plot(df$X1, df$Y1, pch=20, col="cornflowerblue", xlim=c(-4,12), ylim=c(-4,5), xlab="", ylab="", xaxt='n', yaxt='n')
points(df$X2, df$Y2, pch=20, col="darkseagreen")
points(df$X3, df$Y3, pch=20, col="darkgoldenrod")
points(df$X4, df$Y4, pch=20, col="deeppink")

ordihull(data, groups=grouping, label=TRUE)

Did the ordihull function help you visualize your data clusters? What other function and packages do you use? Let me know in the comments below!


Check out more at ProjectsByPeter.com/Projects


Post content and images © 2021 ProjectsByPeter.com – All rights reserved.

Exit mobile version