Join the community!
Visit our GitHub or LinkedIn page to join the Tilburg Science Hub community, or check out our contributors' Hall of Fame!
Want to change something or add new content? Click the Contribute button!
Overview
This is a handy function that can be used to clean social media data scraped from the web.
Usually, when scraping social media, the output data can contain letters like K’s (Thousands), M’s (Millions) and B’s (Billions). You are won’t be able to analyze them unless you first replace these letters with the appropriate zero digits.
Code
# Function to convert textual social media counts to proper digits
social_media_cleanup <- function(x) {
if (class(x)%in%c('integer','numeric')) {
warning('Input is already numeric.')
}
numerics <- gsub('[A-Za-z]','',x)
units <- gsub('[0-9]|[.]|[,]','',x)
multipliers <- rep(1, length(x))
multipliers[grepl('K', units, ignore.case = T)]<-1000
multipliers[grepl('M', units, ignore.case = T)]<-1E6
multipliers[grepl('B', units, ignore.case = T)]<-1E9
return(as.numeric(numerics)*multipliers)
}
# Example
social_media_cleanup(c('21.5k', '214m', '1204', 'NA', '642b'))