I'm in the middle of prepping a new blog post based on some exciting old data. Rather than rush it, I decided to get everything nicely organised so I can publish the experiment on Gorilla open materials and put the raw data, preprocessing and analysis code on OSF. This has meant getting serious with R and doing lots of tutorials online, so it's taking longer than expected (the 33+ degree heat wave isn't helping either!).
Now I'm starting to get more comfortable with R I thought I'd revisit an old data visualisation tweet from a couple years ago and try improve upon it.
Back in 2017, I went from working in other people's labs to becoming a fully independent researcher and starting my own lab. At this time, one of the big talking points on scientific twitter was how early career researchers name their new labs. The Scientific Twitter Police were quite adamant that labs shouldn't be named after the person in charge, but rather the lab name should reflect its scientific goals. I wasted months on this! What kind of science was my lab going to produce? And could it be boiled down to a couple of words?
For some scientists coming up with a lab name is simple enough as they have a clear research theme like language, attention, or maybe studying the properties of a specific brain area like the cerebellum. This is actually a good career move, branding and marketing are just as important in science as industry! In 2017/2018 I was leading projects on EEG methodology, brain parcellation and neuroanatomy, perceptual decision making, neuroeconomics in autism, and mapping a mouse brain onto a human brain. There isn't one word or topic that really ties all this together! So I decided to solve this problem by letting the data speak for itself and create a wordcloud based on my scientific outputs.
When I made my first wordcloud in 2017 I just copy and paste all my abstracts manually onto a wordcloud generator webpage. Now I've embraced R and started using the easyPubMed and ggwordcloud packages to see about scripting this process. Let's start with getting your scientific information from PubMed.
To install easyPubMed you can use the following command in R
install.packages("easyPubMed")
Once installed, add the package to the library.
library(easyPubMed)
The next part is to setup your query. Here, just type whatever you would type into your PubMed search. In my case, I used my name with [AU] to constrain this to the author field.
my_query <- 'Balsters J[AU]'
These next parts probably won't need to change too much as they just extract the PubMed IDs that match the query above and then use the IDs to get all the extra information such as co-authors, titles, dates, abstract, keywords etc.
my_entrez_id <- get_pubmed_ids(my_query)
my_abstracts_xml <- fetch_pubmed_data(pubmed_id_list = my_entrez_id)
This next part selects the specific data you're interested in from PubMed. I just wanted the abstracts, hence I used "AbstractText" in the command below. To understand the different fields I just printed the my_abstracts_xml variable in the console and scrolled through to find what I wanted.
text <- custom_grep(my_abstracts_xml, "AbstractText", "char")
And there you have it! Five lines of code and you can automatically extract whatever data you need from PubMed. This is obviously much faster and more efficient than my previous copy and paste approach. So what now?
The best way to visualise the recurring themes in my research (and therefore what my lab should be called) was to produce a wordcloud plot. For anyone not familiar, a wordcloud is a type of data visualisation that looks at word frequency. If a word is used more often then it's bigger in the picture.
After collecting all your text from PubMed, the easiest option was to print the variable 'text' and copy and paste this data into an online wordcloud generator. I created this on https://www.wordclouds.com/. I got the image to make the brain shape here, you just need any image where the background is transparent. I turned the mask on to highlight the brain a bit more clearly and changed the font to Cabin Sketch.
In the second half of this blog I try to recreate this wordcloud with R... in short it didn't go well and I'd recommend just using an online wordcloud generator like the one I've suggested. However, if you're interested in text mining and want to try and use R to create this I've written down my steps below.
And there you have it! A beautiful figure that clearly highlights my key areas of research - connectivity, cortex, social, motor, ASD. So what did I learn from doing this? Did I start the connectivity, cortex, social, motor, ASD lab? Well no, CCSMAL isn't a great acronym... I was a bit surprised that connectivity was the most commonly used word in my research and that did push me to focus more on my brain connectivity research. For me, this exercise hammered home the diversity of my research career. Rather than agonise about how to consolidate my research into my unique scientific selling point, I realised that I wanted to embrace my variety and keep doing work in any area that excited me - the Exciting and Applied Research (EAR) lab.
Now for those of you that want to create the wordcloud in R, here's where I got to. In short the text mining features are robust, easy to use, and can be quite useful if you want to clean your text data. However, the wordcloud packages leave a lot to be desired and I couldn't make anything close to what I made using online tools.
You'll need to install the following packages and add them to the library
install.packages("tm")# for text mining
install.packages("SnowballC")# for text stemming - optional
install.packages("ggwordcloud")# word-cloud generator install.packages("RColorBrewer")# for color palettes
library("tm")
library("SnowballC")
library("ggwordcloud")
library("RColorBrewer")
Use the Corpus function from the tm package to load the documents. You can use the inspect function to print this out in the console window.
docs <- Corpus(VectorSource(text))
inspect(docs)
There may be some special characters you want to remove from the text data. The lines below show you how to create a function toSpace that will change a given character into white space. Lines 2-3 give two examples which use toSpace to remove / and @ from your text.
toSpace<-content_transformer(function(x , pattern) gsub(pattern, " ", x))
docs<-tm_map(docs, toSpace, "/")
docs<-tm_map(docs, toSpace, "@")
The next few lines are quite standard when working with text data. They remove
capital letters (converted to lower case)
numbers
stop words (e.g. it, and, the, to)
custom words
punctuation
white spaces - thus removing the special characters from before
# 1 Convert the text to lower case
docs<-tm_map(docs, content_transformer(tolower))
# 2 Remove numbers
docs<-tm_map(docs, removeNumbers)
# 3 Remove english common stopwords
docs<-tm_map(docs, removeWords, stopwords("english"))
# 4 Remove your own stop word
# specify your stopwords as a character vector
docs<-tm_map(docs, removeWords, c("blabla1", "blabla2"))
# 5 Remove punctuations
docs<-tm_map(docs, removePunctuation)
# 6 Eliminate extra white spaces
docs<-tm_map(docs, stripWhitespace)
Everything so far has used the tm package. This next line uses the optional SnowballC package to perform text stemming. stemDocument is a library of english words and their stems. This means that variations of the same words (i.e. moved, moving, movement) can be reduced to their stems (move). When I applied this to my text data I saw that "connectivity", my most frequent word, got larger because it was shortened to "connect" along with "connections". So this may or may not be useful to you.
# Text stemming
docs<-tm_map(docs, stemDocument)
The next step is to convert your text data into a TermDocumentMatrix. This essentially turns your text into a table of words and how frequently they occur.
dtm<-TermDocumentMatrix(docs)
m<-as.matrix(dtm)
v<-sort(rowSums(m),decreasing=TRUE)
d<-data.frame(word=names(v),freq=v)
head(d, 10)
You can use the head function to print the table in the console as below
You can easily plot this using a bar plot function
par(mar=c(6,4,4,1)+.1) # This makes the margins bigger so you don't cut off axis labels
barplot(d[1:10,]$freq, ylim=c(0,70), las = 2, names.arg = d[1:10,]$word,col ="lightblue", main ="Most frequent words", ylab = "Word frequencies")
But the objective of today is to produce a wordcloud! To do this I've been using the ggwordcloud package. This lets you use the ggplot grammar for defining aesthetics, followed by geom_text_wordcloud_area but you can also use the ggwordcloud and ggwordcloud2 functions to bypass the need for ggplot grammar and just define elements directly.
There are lots of examples online here. In the example below I reduced the words to the top 100 to plot more quickly and test out code. I also set the shape to be a circle.
nsample<-100
d2<-d[1:nsample,]
ggplot(d2, aes(label = word, size = freq, color = factor(sample.int(8, nsample, replace = TRUE)))) +
geom_text_wordcloud_area(
shape="circle",
eccentricity = 1,
rm_outside = TRUE) +
scale_color_brewer(palette = "Dark2") +
theme_minimal() +
scale_size_area(max_size = 10)
This is about as far as I got. There is a mask option so I could theoretically use the brain mask from before, but I couldn't get that working. There's a lot about this on the forums so it's not just me. I also found that half the useful features were in ggwordcloud (i.e. filtering words based on minimum frequency or capping the number of words), the other half are in ggwordcloud2 (using shapes and masks).
At the end of day, this is works but it's much less impressive (and much slower) than what I created using the free wordcloud generator website. You have to know when to give up and take the easy option =o)
תגובות