Long H. Nguyen bio photo

Long H. Nguyen

Knowledge worths sharing!

Email

My collection of small and re-usable source code for R.

Feel free to use it if you find it helpful.

Correlation plot

It could be the first step of your data analytics project. You need to visually pair-wise distribution of your data and how the variables are correlated to each others.

corrgram(data,
         main="Correlation matrix for your data",
         lower.panel=panel.pts, upper.panel=panel.cor,
         diag.panel=panel.density)

Overlay data distributions for comparisons

Given two groups of data, we may need to compare their distributions. Overlay distribution plots will be helpful with beautiful visualization.

# assign two groups for legends
group1$partition = 'group1'
group2$partition = 'group2'
pltData = rbind(group1,group2)
# reset your data
group1$partition = NULL
group2$partition = NULL
for( var in imp_vars){
  xl = paste(var,'\n')
  xl = paste(xl, 'group1: ',round(mean(group1[[var]],na.rm = T),2), '+/-', round(sd(group1[[var]],na.rm = T),2),'\n')
  xl = paste(xl, 'group2: ',round(mean(group2[[var]],na.rm = T),2), '+/-', round(sd(group2[[var]],na.rm = T),2),'\n')
  print(ggplot(pltData, aes(pltData[[var]], fill = partition)) + geom_density(alpha = 0.2) + xlab(xl))
  cat(xl)
  invisible(readline(prompt="Press [enter] to continue"))
}

Histogram of important features vs label

One of my sample code taken from a Kaggle competition.

It is very helpful to create ‘golden’ features. :)

library(readr)
library(ggplot2)
library(ggthemes)
train <- read_csv("../input/train.csv")
#This important list is generated by my model, here I just provide top 10.
important_list = c('PropertyField37','SalesField5','PersonalField9','Field7'
                  ,'PersonalField2','PersonalField1','SalesField4','PersonalField10A'
                  , 'SalesField1B', 'PersonalField10B', 'PersonalField12')
train$QuoteConversion_Flag = as.factor(train$QuoteConversion_Flag)
for( att in important_list){
#Density histogram
plot <- ggplot(train, aes_string(att, fill = 'QuoteConversion_Flag')) +
  geom_histogram(alpha = 0.5, position = 'identity') +
  ggtitle(paste0('Histogram of attribute ', att))
}

For example, Field7 at bin value ~25 could be useful for creating an additional variable.

Histogram of Important Feature