Visualizing Clustered Data in R

February 4, 2021April 30, 2021 by Peter In Coding Leave a comment

If you are performing unsupervised learning for exploratory data analysis, you’ll probably need to visualize the subgroups – or clusters – within the data. Color-coding the clusters is common, but here I’ll show you how to use the vegan package in R to make those clusters really stand out.

Full code is at the end of the post in case you want to skip the step-by-step tutorial.

Step 1: Load the vegan package

library(vegan)

Step 2: Create example data with four clusters

set.seed(531) # set seed for reproducibility

df = data.frame(
  X1=rnorm(50, mean=0, sd=1), Y1=rnorm(50, mean=2, sd=1), 
  X2=rnorm(50, mean=3, sd=1), Y2=rnorm(50, mean=0, sd=1),
  X3=rnorm(50, mean=6, sd=1), Y3=rnorm(50, mean=3, sd=1),
  X4=rnorm(50, mean=8, sd=2), Y4=rnorm(50, mean=-1, sd=1))

# reorder data for ordihull function
data = cbind(
  c(df$X1, df$X2, df$X3, df$X4), 
  c(df$Y1, df$Y2, df$Y3, df$Y4))

# create data label variable for ordihull function
grouping = cbind(rep(1,50),rep(2,50),rep(3,50),rep(4,50))

Step 3: Initial plot of the data clusters (color-coded)

plot(df$X1, df$Y1, pch=20, col="cornflowerblue", xlim=c(-4,12), ylim=c(-4,5), xlab="", ylab="", xaxt='n', yaxt='n')
points(df$X2, df$Y2, pch=20, col="darkseagreen")
points(df$X3, df$Y3, pch=20, col="darkgoldenrod")
points(df$X4, df$Y4, pch=20, col="deeppink")

Color-coded data on a plot — Plot of the data. Color-coding helps identify the four data clusters.

Step 4: Use ordihull function to make data clusters stand out

ordihull(data, groups=grouping, label=TRUE)

Color-coded data on a plot with each color enclosed by a black line — Using the ordihull function makes the data clusters even more apparent.

There are additional ordi plot functions that can achieve similar results, depending on your visualization needs and tastes. Here is an example with the ordispider function:

ordispider(data, groups=grouping, label=TRUE)

Color-coded data on a plot with each color marked by a spider plot — The ordispider function results in a different look, but still makes the data clusters easily identifiable.

It’s worth giving the ordi functions in the vegan package a thorough look if you’re performing unsupervised learning and need to visualize data clusters. The help manual for the vegan package can be found here.

Full code

library(vegan) # load package

set.seed(531) # set seed for reproducibility

df = data.frame(
  X1=rnorm(50, mean=0, sd=1), Y1=rnorm(50, mean=2, sd=1), 
  X2=rnorm(50, mean=3, sd=1), Y2=rnorm(50, mean=0, sd=1),
  X3=rnorm(50, mean=6, sd=1), Y3=rnorm(50, mean=3, sd=1),
  X4=rnorm(50, mean=8, sd=2), Y4=rnorm(50, mean=-1, sd=1))

# reorder data for ordihull function
data = cbind(
  c(df$X1, df$X2, df$X3, df$X4), 
  c(df$Y1, df$Y2, df$Y3, df$Y4))

# create data label variable for ordihull function
grouping = cbind(rep(1,50),rep(2,50),rep(3,50),rep(4,50))

plot(df$X1, df$Y1, pch=20, col="cornflowerblue", xlim=c(-4,12), ylim=c(-4,5), xlab="", ylab="", xaxt='n', yaxt='n')
points(df$X2, df$Y2, pch=20, col="darkseagreen")
points(df$X3, df$Y3, pch=20, col="darkgoldenrod")
points(df$X4, df$Y4, pch=20, col="deeppink")

ordihull(data, groups=grouping, label=TRUE)

Did the ordihull function help you visualize your data clusters? What other function and packages do you use? Let me know in the comments below!

Check out more at ProjectsByPeter.com/Projects

Visualizing Clustered Data in R

Step 1: Load the vegan package

Step 2: Create example data with four clusters

Step 3: Initial plot of the data clusters (color-coded)

Step 4: Use ordihull function to make data clusters stand out

Full code

Like this:

Related

Related

Leave a ReplyCancel reply

Step 1: Load the vegan package

Step 2: Create example data with four clusters

Step 3: Initial plot of the data clusters (color-coded)

Step 4: Use ordihull function to make data clusters stand out

Full code

Share this:

Like this:

Related

Related

Leave a ReplyCancel reply

Discover more from Projects by Peter