[clean, wrangling, scraping, follow, likes, network, social media, function, numeric conversion]

Clean Numeric Data Scraped From Social Media

1 mins


This is a handy function that can be used to clean social media data scraped from the web.

Usually, when scraping social media, the output data can contain letters like K’s (Thousands), M’s (Millions), and B’s (Billions). You won’t be able to analyze them unless you first replace these letters with the appropriate zero digits.


# Function to convert textual social media counts to proper digits
social_media_cleanup <- function(x) {
  if (class(x)%in%c('integer','numeric')) {
    warning('Input is already numeric.')
  numerics <- gsub('[A-Za-z]','',x)
  units <- gsub('[0-9]|[.]|[,]','',x)
  multipliers <- rep(1, length(x))
  multipliers[grepl('K', units, ignore.case = T)]<-1000
  multipliers[grepl('M', units, ignore.case = T)]<-1E6
  multipliers[grepl('B', units, ignore.case = T)]<-1E9


# Example
social_media_cleanup(c('21.5k', '214m', '1204', 'NA', '642b'))
Contributed by Thierry Lahaije