Every medical presentation has a section on epidemiology. Instead of making a boring slide with bland numbers, one can come up with an interesting data viz. The problem is , unfortunately, we don’t have the necessary local data (state or even national level) in a form we visualize.
That’s where wikipedia comes to the rescue. It is a veritable source of readily available data on a variety of topics. Best of all, it’s free. In this tutorial we shall see how to leverage wikipedia for data scraping. You might be wondering why anyone should use R (or any programming language) for doing this. Why not just copy paste the data into excel and do the analysis/visualization there?
The advantages of R over other tools are
- Data management – if you want to merge data from several sources (as we shall see here) R is much stronger than any other GUI tool
- Documentation – the cleaning process is well documented and anyone can see how you arrived at your results
- Future use – the R code you write can be stored in a file and can be reused later. After all if many of your presentations have similar requirements it makes sense to have a template
- Robustness- same code even if the source data changes – as may occur with websites which update their data often
The workflow is summarized in the following diagram
As shown here, the visualization is an iterative process, regardless of the tool you use.
Example: You are called to make a presentation on obesity. You want to make a data viz on Indian obesity and see which of our states have the highest obesity rates.
I googled ‘obesity India wikipedia’. The first hit takes me to this page – Obesity in India. The website has data on obesity rates by state, for both males and females. The data comes from national family health survey (NFHS 2007). Sadly there’s no data about pondicherry 😦 .
This will be our primary data source. At this stage, we won’t get into how they defined obesity. (although it is vitally important in a real life scenario).
I will be using a new package called ‘datapasta’ for webscraping. It allows you to simply copy paste data in RStudio and the code is written automatically. For well formatted HTML tables, it works well. As we shall see later, we would need a more complicated tool if the source table is not well formatted.
Data cleaning and visualization
#load the required libraries
library(tidyverse) # for data manipulation, primarily dplyr
library(rvest) # for webscraping
library(scales) #for formatting percentages
# use datapasta - click the addins tab in RStudio and use paste as tribble after
# copying data from the wikipedia page given above
data2 <- tribble(
~States, ~Males...., ~Males.rank, ~Females...., ~Females.rank,
"India", 12.1, 14, 16, 15,
"Punjab", 30.3, 1, 37.5, 1,
"Kerala", 24.3, 2, 34, 2,
"Goa", 20.8, 3, 27, 3,
"Tamil Nadu", 19.8, 4, 24.4, 4,
"Andhra Pradesh", 17.6, 5, 22.7, 10,
"Sikkim", 17.3, 6, 21, 8,
"Mizoram", 16.9, 7, 20.3, 17,
"Himachal Pradesh", 16, 8, 19.5, 12,
"Maharashtra", 15.9, 9, 18.1, 13,
"Gujarat", 15.4, 10, 17.7, 7,
"Haryana", 14.4, 11, 17.6, 6,
"Karnataka", 14, 12, 17.3, 9,
"Manipur", 13.4, 13, 17.1, 11,
"Uttarakhand", 11.4, 15, 14.8, 14,
"Arunachal Pradesh", 10.6, 16, 12.5, 19,
"Uttar Pradesh", 9.9, 17, 12, 18,
"Jammu and Kashmir", 8.7, 18, 11.1, 5,
"Bihar", 8.5, 19, 10.5, 29,
"Nagaland", 8.4, 20, 10.2, 22,
"Rajasthan", 8.4, 20, 9, 20,
"Meghalaya", 8.2, 22, 8.9, 26,
"Odisha", 6.9, 23, 8.6, 25,
"Assam", 6.7, 24, 7.8, 21,
"Chhattisgarh", 6.5, 25, 7.6, 27,
"West Bengal", 6.1, 26, 7.1, 16,
"Madhya Pradesh", 5.4, 27, 6.7, 23,
"Jharkhand", 5.3, 28, 5.9, 28,
"Tripura", 5.2, 29, 5.3, 24,
"Delhi", 45.5, 36, 49.8, 64
#remove India row and rename Males and Females
data <- data2 %>% rename(Males=Males....,Females=Females....) %>%
# draw the plot
plot <- data %>% arrange(Males) %>% ggplot(aes(x=reorder(States,Males,mean),y=Males,fill=Males>mean(Males)))+
ylab("Percentage of Obese Males ") +guides(fill=FALSE)+
The plot uses the black and white theme( a personal choice) and splits the states as those with a Male obesity percentage above average (mean) and below average. The point is not the usefulness of such classification. It is just to show how it is done. The coord_flip makes it horizontal. The resulting plot looks like this
Contrary to popular belief, a choropleth map was harder to read. So I settled on a barplot. Tamilnadu appears the among the top obese states. Curiously the states/ regions at the top also appear to be having a higher development / wealth.
You wonder what would happen if you added the wealth dimension to this simple graph. Far too often people think of 3D plots when speaking about adding another dimension. What I mean is mapping the color of the bars to some indicator of wealth to show its effect. This does NOT mean a 3D plot. (For the love of God avoid making 3D bar graphs whenever possible)
Data acquisition (source 2)
Now you want to add the wealth dimension. I will choose the state’s GDP as a proxy for wealth. Once again this is debatable and probably there are better markers. The focus is on the process here, not the specific choices.
This time we will do the same, search google for state GDP. Head over to the wikipedia page State wise GDP.
This time however, copy pasting in RStudio doesn’t work as intended. It produces a single column dataframe. The time has come to change tactics. We will use the rvest package to scrape data this time. This requires writing some more code and cleaning, but it works.
# Store it the url and use a temporary list to download the data programmatically.
# Then extract the required table from that list.
gdpurl <- "https://en.wikipedia.org/wiki/List_of_Indian_states_and_union_territories_by_GDP"
The resulting data shows some weird symbols – <U+20B9>. This is actually the unicode representation of Indian rupee symbol. The column is also a character vector. We also need to get rid of the description in billions etc. We need just the numbers. This calls for a little bit of data cleaning.
# remove non ASCII character that represents indian rupee symbol
gdp$GSDP <- iconv(gdp$GSDP,"latin1","ASCII",sub="")
# Separate out the lakh crore and other description. Uses tidyr package.
# remove the row that shows India. select only numbers
gdp % separate(col=GSDP,into=c("value","useless"),sep=" ") %>%
filter(State.UT !="India") %>%
#coerce the character vector as numbers
gdp$value % rename(States=States.UT)
#attempt an inner join
fulldata <- inner_join(gdp,data)
In the first step we use the iconv function to weed out nonASCII characters. Then we separate out the value from the characters by the use of tidyr’s separate function. When you use tidyverse package, tidyr is automatically loaded. So you don’t have to load it separately.
We then finally merge the data with the original data set. If you see there are differences between the two datasets. The gdp dataset has data about union territories but not about Jammu and Kashmir. So we are left with 28 states/territories in the final dataset.
Now we again do the visualization. This time we colour the bars by GDP, the states above average and below average.
# redraw the plot in R and color the states as those above mean GDP and those below
plot <- fulldata %>% arrange(Males) %>% ggplot(aes(x=reorder(States,Males,mean),y=Males,fill=value>median(value)))+
ylab("Percentage of Obese Males ")+
labs(fill="Income above average")+
This produces the following plot
It looks like colouring by GDP has helped. Yet, we see some states like Mizoram and Sikkim at the top. And states like WestBengal at the bottom! Something appears strange. Perhaps GDP isn’t a good proxy for income. May be the data source is incorrect. May be the definition of obesity is wrong. May be the data quality is not uniform across the states.We go back to the original question and start the data acquisition part again. ..And the loop continues.
If you don’t like the colors, there are plenty of ways to adjust that in R. Have a look at the ggthemes package. Arguably the same thing can be done much more easily in Stata ( although you have to copy paste the data and therefore cannot be fully automated). Once this script is saved, you can use it for a future talk on say, infertility or anything for that matter.