[clean, wrangling, scraping, follow, likes, network]


Clean Data Scraped From Social Media

Overview

This is a handy function that can be used to clean social media data scraped from the web.

Usually, when scraping social media, the output data can contain letters like K’s (Thousands), M’s (Millions) and B’s (Billions). You are won’t be able to analyze them unless you first replace these letters with the appropriate zero digits.

Code

# Function to convert textual social media counts to proper digits
social_media_cleanup <- function(x) {
  if (class(x)%in%c('integer','numeric')) {
    warning('Input is already numeric.')
  }
  numerics <- gsub('[A-Za-z]','',x)
  units <- gsub('[0-9]|[.]|[,]','',x)
  multipliers <- rep(1, length(x))
  multipliers[grepl('K', units, ignore.case = T)]<-1000
  multipliers[grepl('M', units, ignore.case = T)]<-1E6
  multipliers[grepl('B', units, ignore.case = T)]<-1E9

  return(as.numeric(numerics)*multipliers)
}

# Example
social_media_cleanup(c('21.5k', '214m', '1204', 'NA', '642b'))