Academia

Getting started with case reports

A case report is the perfect starting point for a resident new to scholarly publishing. It is easy to write, requires little creativity (after all it is just a documentation of a patient that came to meet the doctor) and though has limited impact, has good educational value. More than anything else, it lowers the barrier to scientific writing.


There is a catch though – case reports are the low hanging fruits. Accordingly there is quite a bit of competition there – lot of people want to write, very few publishers want to publish. This has created a vacuum which has been fulfilled by speciality case report journals. These journals publish only case reports and therefore have a much higher acceptance rate – somewhere in the range of 30 to 70 %. The increased demand also causes a situation where publishers may resort to questionable practices. In fact, almost half the journals are found to be dubious.



How to identify the genuine journals?


The trick is to find those case report journals which are PubMed Indexed. Only one PubMed Indexed journal(published by Baishideng group) is known to indulge in questionable practices[Refer to the Excel file linked at the end of the article]. So a case report journal that is PubMed Indexed is highly likely to be genuine. For example, my first publication was a case report in BMJ case reports.  BMJ case reports has a decent acceptance rate, but in order to submit one of the authors or the institution must have subscription. Individual subscription costs around 185 GBP (around Rs.15000), but just one subscription in a department is more than enough. Be sure to check if your institution has subscription – in which case, you can contact the librarian to get the submission access code. BMJ case reports doesn’t have an impact factor as such (many case report only journals don’t.). However you can use the scimagojr 2 year citations per article as a reasonable proxy.


Of course, case reports are also published by journals that publish other stuff like reviews and original articles. However the acceptance rate is likely to be lower in these journals. If you are confident of your material, it is best to try in a general journal first before trying a case reports only journal. When in doubt, ask an expert.



A master list of case reports only journals can be accessed in Excel format here. Sadly I couldn’t get a master list of submission fees – if you have details on that, do let me know. If you found this post useful, please share with your friends.


Further reading


New journals for publishing medical case reports

Online workflow for writing articles

It has become increasingly common for people to collaborate on writing projects. The tools that enable such collaboration have improved over the years too and currently allow for a completely online workflow. Unfortunately many residents and early career researchers don’t take advantage of the recent developments. In this post, I will outline a completely online workflow for writing articles
This way, you could work with any number of people on the same project and all of you could have access to the same digital library from which you can cite. You might wonder that the functionality of team library has been available for quite some now in popular reference management software like Zotero. However , without going through a few hoops, you can’t get Zotero to work seamlessly with Google docs.
Of late, I am increasingly using Google docs for my document preparation needs. Sure it isn’t MS Word, but few people need the full power of MS Word for their routine documents. The ‘portability’ of a Google docs document is particularly attractive to me since I have computers running different operating systems.
Here’s my completely online workflow. Every component of the workflow is free.(as in free beer).

workflow

The advantages of this online workflow includes

  • No need to install any software
  • You always get the latest and greatest version
  • OS/device independent
  • Collaboration is easy and seamless

The F1000 workspace also has a desktop client and  and you can start working even if you already have a pdf collection. It also has  a word add in, if you prefer to write in MS Word.Try it out for your next article. You will be pleasantly surprised.

5 Rupees medicines and diabetes

 

Story time.
Once upon a time, there lived a conman. He decided to make some quick buck. He sold 1000 lottery tickets, each at a price of  $ 5. The bumper prize was $1000. So you could buy a ticket for 5 and if you are lucky , get $1000. He marketed it, saying if you lost, you just lost 5 $. But think of what you will get if you win – $1000. 200 times your initial investment. Or an insane profit of close to 20000%. There was  a mad rush to buy the lottery. The numbers on the ticket were 8 digits long and were alphanumeric. It was a clever ploy. Had he numbered the tickets sequentially from 1 to 1000, people would easily remember that. His alphanumeric system precluded that possibility. A few days later, the results were released. Photos were flashed of the winner getting a cheque.Many who had bought the tickets were disappointed. But they went about their lives as usual. After all, the loss was tiny. One of the guys who lost it was a little suspicious. No one he knew had won the prize. He decided to dig deeper. When he confronted the conman with questions, he was told to prove his theory in court ( a legal battle that would ruin him financially) or take his $5 as refund. The poor guy got his $ 5, while the conman made  $ 4995. No one actually won the lottery.

Recently I was asked by a patient about the  effectiveness and side effects of two drugs  called BGR -34 and IME-9. I must admit I hadn’t heard of those drugs.As an allopathic doctor and an endocrinologist, the name sounded odd to me. Such names are reminiscent of  candidate molecules being tested. Nevertheless I decided to dig deeper and find out more.

BGR -34 is an ayurvedic medicine developed jointly by National Botanical Research Institute (NBRI) and Central Institute for Medicinal and Aromatic Plant (CIMAP) — both funded by government. BGR 34 stands for Blood Glucose Regulator with 34 active phyto ingredients. The NBRI website unfortunately doesn’t give much details. The drug is touted as having 67 % ‘success rate’ based on animal studies. No human data are available. The senior principal scientist AK Rawat has said, “The drug has extracts from four plants mentioned in Ayurveda and that makes it safe”. I have no idea how that will make a drug safe ! Tell me if you do.

IME 9 stands for Insulin Management Expert. It’s  developed by another government body called CCRAS (Central Council for Research in Ayurvedic Sciences). This one doesn’t have much information listed either.
I decided to check for any publications on BGR 34 and IME 9. Unfortunately I coudn’t find one. Next I proceeded to the AIMIL pharmaceuticals page( the company that is licensed for manufacturing and marketing this drug). When I tried to see the products page, it told me helpfully that I was not authorized to view that page and that I should login! Why should I login to view a company’s drug landing page? I ve never had to ‘login’ to see the details of any drug in the past!

aimil

Even more brazenly, the privately owned entity has used DRDO logo in its website to give itself an ‘official’ veneer. In an event attended by Dr Man Mohan Singh and JP Nadda, the company has also been awarded the AYUSH company of the year!

aimil2

So I turned to the mother of all search engines, Google for help. There were a few blog posts. One of them had lamented that the drug had actually increased the blood glucose in his mother ! Now the ridiculousness of checking blog posts for evidence of efficacy is an injury. To watch youtube videos on the same topic is insult to that injury. Nevertheless, I did that too. The comments on the YouTube videos were mostly negative[There are tools to make formal ‘Sentiment Analysis’, though I haven’t done here]. Yet one common thread was the feeling that native medicines were side effect free. How many more years will it take for people to understand that there is no such thing as a drug without side effects? (Paracelsus said this thousands of years back)

All drugs are poisons. Only dose makes the difference.

                                                                   – Paracelsus

Curiously these drugs are marketed as 5 rupee medicines for diabetes. Metformin and sulphonylureas, the two most commonly used diabetes medications with a huge evidence base, cost much less than 5 rupees. Yet no one had ever marketed them as 5 rupee wonder drugs! There’s  even a Facebook page for this drug with a rating of 4.5 ! Ever seen Facebook pages for drugs before? Me neither. It is available in Amazon, Ebay and Snapdeal, not to mention some less well known online retailers.
Just like we can’t accept anecdotal evidence for the efficacy as valid, we can’t accept anecdotal evidence for the lack of efficacy as valid too. As much as I would hate it, I was forced to say that this drug’s efficacy was ‘unknown’.More importantly the safety was questionable too.
We have no data at all.So who is tasked with regulating this market? Why is our tax money used to fund such projects incompletely? What prevents them from testing the drugs? How are they made to trend in pill selling apps like 1 mg? Why are they marketed as government approved drugs? Why are these drugs marketed to patients directly?Too many questions , too few answers.
At the end of the day, diabetes is as much a disease of behaviour as it is of beta cells. The lure of a cheap drug without side effects with no need for any pesky life style changes , becomes irresistible to the common man. Thus clever marketing always works (just like it works for pharma on doctors). Direct to consumer marketing must be banned, regardless of the system of medicine practiced. For the simple reason that the patient isn’t qualified to make an informed choice. Otherwise we will always have the lottery tickets and dubious drugs.

Science shouldn’t be sacrificed at the altar of business.

Further reading
http://https://www.ayurtimes.com/bgr-34-for-diabetes/

www.aimilpharmaceuticals.com

Scientific publishing: Online platforms for writing

The early computers didn’t have a pretty user interface. They were geared towards the nerds and hobbyists – that was until Steve Jobs laid his eyes on the  GUI(Graphical user interface) developed in Palo Alto Research center by Xerox. The rest as they say is history.
Fast forward a few years. A Stanford computer scientist,Donald Knuth developed an entire document preparation system. It was robust and it was immediately lapped up by the Mathematics community : for it was a great way to format equations beautifully. This system was later made a little more friendly by Leslie Lamport and was called \LaTeX
It had many advantages

  • It delinked content and formatting
  • It could be scripted /automated with macros
  • It was just plain text with markup – thus making version control easy
  • It could output to a variety of formats that could be viewed from practically any device
  • It was free and open source

Unfortunately \LaTeX was and still is a little hard to learn. Consequently the life sciences community heavily uses WSIWYG(What you see is what you get) programs like MS Word for scientific writing.

\LaTeXneeded a front end –

  • one that is easy to use
  • does not require any installation (or at least opensource and cross platform)
  • has an easy way to add tables,figures and citations(which are not so easy to do if you use LaTeX and a customized citation style)

LyX was the first step – but it requires a local \LaTeX installation. Then came the likes of Overleaf and ShareLaTeX. They were both online, thus freeing the user from the need to install anything locally. Unfortunately they still very much retain the \LaTeX flavor and are thus not suitable for the average doctor. Then came Authorea – it had a freemium model, it was online, it was easy to use and almost felt like the long battle to develop an unintimidating face for \LaTeX had succeeded.It has a few templates for life science journals, but the operating word here is few.
I had thought that the whole process was complete. You could write papers online, collaboratively with anyone from anywhere and produce camera ready pdfs !
Well, it turns out life isn’t that simple. Enter the publisher. Each publisher has specific formatting guidelines for clarity, uniqueness and for good design. So most life science publishers prefer to receive the manuscript in MS Word format and wouldn’t want to touch your tex files. (There are of course, some exceptions). This formatting of everything as per inhouse style(including citations) keeps the journal unique and is unlikely to be solved anytime soon. As I said before,LaTeX has a nice front end, it could be scripted and automated. So what if, we could have a web app that can help do everything Authorea does and generate a journal specific pdf at the click of a button? It would solve the problem for the authors as well as the publishers!
Typeset  is one such tool. It is online, free (currently in beta), incredibly easy to use( text,tables,images and citations), can be versioned and can generate a pdf as per the journal requirements at the click of a button! I haven’t been this excited about the possibilities of an academic software in a longtime. I will outline some in the next post. For now, feel free to check out Typeset

Here’s a comparison of these web apps

Feature Overleaf Sharelatex Authorea Typeset
Free option
Online
LaTeX usage
Version control ✔ (premium account only)
For those who don’t know LaTeX
Journal Specific Styles ✔ ( only few for medical journals) ✔ (4500+ journals)
Reproducible research(text+data+code)
Social Tools(comments,chat)

Bonus: If you have a Mac and would prefer an installed app with similar functionality, check out Manuscripts app.

Note: Of course, the most popular and straightforward online option is google doc, but I guess you are already quite proficient in its use.

Virtual Journal Club – a first experience

The journal club is one of the most rewarding experiences of postgraduate training. Almost every department has a journal club, in which the interesting and important articles are discussed. The scope of this discussion though is limited to the department. For instance, if a paper is presented in the Medicine department of a college, only those in Dept of Medicine of that college benefit.

For a long time, there was no way to improve the “scope” of any presentations. With the advent of web, though this has changed dramatically. It is possible today to have a presentation that can be done across the country via the web. This results in greater participation and a cross pollination of ideas.

Several portals can be thought of

Portal Pros Cons
WhatsApp Easy to use Difficult to search
Does not approximate a real presentation
Difficult to index and catalogue
Screencast Approximates a “real” presentation
Allows people from different timezones to contribute to discussion
Technical complexity Time consuming Doesn’t allow direct interaction
Webinar Closest to a real presentation Technical complexity
May require additional person
Forum Content can be indexed and catalogued
Allows people from different timezones to contribute More permanent
Access control
Technically most complex
Hosting and managing expenses
Telemedicine Official
Best twoway interactivity
Requires institutional setup
Expensive
Facebook Group Same as WhatsApp, but no limit on number of members Same as WhatsApp

The idea of a journal club for early career Endocrinologists was mooted in the FRENDOS WhatsApp group. Since the idea was fairly new, we wanted to try out a few different approaches before settling on one. Indeed I am not even sure if we should settle on one – as variety adds spice.

So I made a YouTube presentation yesterday – the first virtual journal club in Indian Endocrine setting. The article is from JAMA. You can access the original article here.

You can access the original journal club here. The original video had weak audio. So I have enhanced the volume and the upgraded link can be accessed by clicking the image below.

presentation

The presentation can be downloaded from here.

We had more than a hundred attendees – a number that is virtually impossible in traditional journal club settings. Indeed this will work for any presentation. Eventually we ended up using the audio-video platform of YouTube and the ease of use of WhatsApp to make the first virtual journal club.

It turns out, making a screencast is a rather enjoyable process. It does require a presentation, but most residents already make one anyway. The technical bottleneck is the familiarity with software for making a screencast and a decent quality audio input (a 30$ headphone is more than enough).

The following are the minimum requirements

  1. Presentation – made in LaTeX beamer. Of course it can be made in PowerPoint or any tool of your choice
  2. A headphone / mic – to record audio
  3. Screen recording software – Camtasia Studio for Windows is a premium all in one package for this. If you re on linux, you can get the same functionality by installing Vokoscreen or SimpleScreenRecorder. Post recording , the video can be edited with Pitivi(if you prefer GNOME desktop environment like me) or KDenLive(if you use KDE as the desktop environment)
  4. Audio Editing – Audacity, an open source software for enhancing audio. This is extremely useful for removing noise and amplifying the voice.
  5. YouTube account – for uploading the video. You can use any of the other video uploading sites as well.

As you can see, the whole toolchain can be completely free and opensource. A screencast can thus be produced and shared at absolutely zero cost. We did the discussions after watching the video on WhatsApp. Since a YouTube link can be shared with anyone, the potential reach can be theoretically very high.

We will be experimenting with other portals soon and hopefully improve next time. If you feel the paper /presentation is useful, please share with your PGs,juniors or friends.

Titulus Insipidus

The first thing we read when we see an academic article is its ‘title’. The attention span of readers has been going down steadily, not because the academic content has somehow taken a downward tumble – but because the other digital content we consume has become more and more catchy. It isn’t uncommon to find people skimming through articles and learning as much as they can from the title. Thus more than anything else, it is the title that should deserve maximum attention while writing an article. Unfortunately this doesn’t always happen.

Take any speciality/general academic journal and go through the titles of articles published in any journal and you are apt to find the symptoms of ‘Titulus Insipidus’ – the malady of bland and boring titles . [checked one item off my bucket list. I always wanted to coin a term ;-)]. One might wonder why a researcher would deliberately choose a boring title when a more interesting title is possible.

Why does this happen?

There are perhaps a few reasons why this might happen

  • Title may not be a conscious choice – the writer defaults to the first idea that strikes him/her. This happens more often than we might think

  • Disciplinary conventions
    Certain disciplines are more rigid in following conventions and the young researcher often learns the tricks of the trade by mimicking the articles in the most popular journal of his field. This encourages conformity and discourages experimenting with stylistic choices.
    The idea that offbeat titles are just a gimmick and a way to sell poor content

  • SEO :Experienced writers are well aware of the need for search engine optimization. This is just a fancy way of saying that your article should be easy to find. As Google’s PageRank algorithm is one of the world’s most closely guarded secrets, any recipe for such search engine optimization is only an educated guess. However in a recent paper, the authors argued that since longer titles and abstract are more search engine friendly and have more ‘hooks’ for the computer, they tend to show up in a search query in the first page or so.Since most people don’t go beyond the first page of search results for their information needs, this creates a situation where writing long and stodgy prose is actually advantageous! This unintentional, but perverse system of discentives might dissuade one from sticking to simple titles.

To give an example, “Aggressive serpentine movement in a controlled aviation environment: A descriptive study” is more likely to be noticed and cited than ‘Snakes on a plane’!. However for regular readers of a journal, the click rates may be higher for the latter.

Assesment of catchiness – learning from dating apps

The essential question is then – how to manage the tradeoff? Should the title be informative or interesting? How does one decide which article is interesting? I chanced upon a recent web app called Papr in R/Shiny by JT Leek. The idea is similar to the dating apps like Tinder. The web app collects information on whether the title is exciting and correct by directly asking the users. A random selection of articles is used. You can even download your likes.By analyzing the characteristics of the likes, one can get an idea of what works and what doesn’t. Sadly this is for bioArxiv not PubMed. (Making one for pubmed will be my personal project for Feb 2017)

Some tips

Disclaimer: Your mileage might vary depending on a lot of factors.

Here are some tips and tricks you can use while titling your next article. I believe these are more important for case reports or small research studies. Titling is neither a science nor an art. It is actually a craft. Hence it can be learnt by anyone who is willing to learn.

  1. Remember that titles signal intent. It can be serious, engaging, challenging or even frivolous. Ask yourself if the title adequately captures your intent.
  2. Get away from your comfort zone – if you have been using a colon in all your published work, get rid of it. Think of a way to say the same thing without a colon. A question, a claim or an unexpected term can be used.
  3. Breach the boundaries of your discipline. Just for the heck of it, visit a hard science journal – mathematics, physics or chemistry. I promise that you won’t regret those two minutes.
  4. Impose a Twitter like character limit on yourself. Don’t use more words than is absolutely necessary.
  5. Take the first title that comes to your mind. Promise yourself that you won’t use it. Actively think of alternatives. If the alternatives all seem to be bad choices, then(and only then) use the original title.
  6. Visit your favorite journal. Scan the page for titles you like. Pause and ponder – why did you like a particular title? What was so interesting about it?
  7. Identify the titles that didn’t interest you in the same journal. Attempt an improvement onf those titles.
  8. Cut yourself some slack – there is sometimes no option but to fall back on the predictable and time-tested approaches. Bide for your time – after all in the larger scheme of things, titles are but, only a small piece of the puzzle. Your academic reputation is far more important and for that no amount of intellectual circus with titling will help. Focus on the science and the good things will follow . Acche din will come soon 😉

Feel free to give your comments and suggestions below.

Additional reading

  1. Kpaka S, Krou-danho N, Lingani S, et al. Relation between online “ hit counts ” and subsequent citations : prospective study of research papers in the BMJ. 2004;318(September):3-4.

2. Weinberger CJ, Evans JA, Allesina S, et al. Ten Simple (Empirical) Rules for Writing Science. PLOS Comput Biol. 2015;11(4):e1004205. doi:10.1371/journal.pcbi.1004205.

3. Letchford A, Preis T, Moat HS. The advantage of simple paper abstracts. J Informetr. 2016;10(1):1-8. doi:10.1016/j.joi.2015.11.001.

Scraping wikipedia for medical presentations

Every medical presentation has a section on epidemiology. Instead of making a boring slide with bland numbers, one can come up with an interesting data viz. The problem is , unfortunately, we don’t have the necessary local data (state or even national level) in a form we visualize.

That’s where wikipedia comes to the rescue. It is a veritable source of readily available data on a variety of topics. Best of all, it’s free. In this tutorial we shall see how to  leverage wikipedia for data scraping. You might be wondering why anyone should use R (or any programming language) for doing this. Why not just copy paste the data into excel and do the analysis/visualization there?

The advantages of R over other tools are

  1. Data management – if you want to merge data from several sources (as we shall see here) R is much stronger than any other GUI tool
  2. Documentation – the cleaning process is well documented and anyone can see how you arrived at your results
  3. Reproducibility
  4. Future use – the R code you write can be stored in a file and can be reused later. After all if many of your presentations have similar requirements it makes sense to have a template
  5. Robustness- same code even if the source data changes – as may occur with websites which update their data often

The workflow is summarized in the following diagram

process

As shown here, the visualization is an iterative process, regardless of the tool you use.

Example: You are called to make a presentation on obesity. You want to make a data viz on Indian obesity and see which of our states have the highest obesity rates.

Data acquisition

I googled ‘obesity India wikipedia’. The first hit takes me to this page – Obesity in India. The website has data on obesity rates by state, for both males and females. The data comes from national family health survey (NFHS 2007). Sadly there’s no data about pondicherry 😦 .

This will be our primary data source. At this stage, we won’t get into how they defined obesity. (although it is vitally important in a real life scenario).

I will be using a new package called ‘datapasta’ for webscraping. It allows you to simply copy paste data in RStudio and the code is written automatically. For well formatted HTML tables, it works well. As we shall see later, we would need a more complicated tool if the source table is not well formatted.

Data cleaning and visualization

#load the required libraries
library(tidyverse)  # for data manipulation, primarily dplyr
library(rvest) # for webscraping
library(scales) #for formatting percentages

# use datapasta - click the addins tab in RStudio and use paste as tribble after
# copying data from the wikipedia page given above
data2 <- tribble(
~States, ~Males...., ~Males.rank, ~Females...., ~Females.rank,
"India", 12.1, 14, 16, 15,
"Punjab", 30.3, 1, 37.5, 1,
"Kerala", 24.3, 2, 34, 2,
"Goa", 20.8, 3, 27, 3,
"Tamil Nadu", 19.8, 4, 24.4, 4,
"Andhra Pradesh", 17.6, 5, 22.7, 10,
"Sikkim", 17.3, 6, 21, 8,
"Mizoram", 16.9, 7, 20.3, 17,
"Himachal Pradesh", 16, 8, 19.5, 12,
"Maharashtra", 15.9, 9, 18.1, 13,
"Gujarat", 15.4, 10, 17.7, 7,
"Haryana", 14.4, 11, 17.6, 6,
"Karnataka", 14, 12, 17.3, 9,
"Manipur", 13.4, 13, 17.1, 11,
"Uttarakhand", 11.4, 15, 14.8, 14,
"Arunachal Pradesh", 10.6, 16, 12.5, 19,
"Uttar Pradesh", 9.9, 17, 12, 18,
"Jammu and Kashmir", 8.7, 18, 11.1, 5,
"Bihar", 8.5, 19, 10.5, 29,
"Nagaland", 8.4, 20, 10.2, 22,
"Rajasthan", 8.4, 20, 9, 20,
"Meghalaya", 8.2, 22, 8.9, 26,
"Odisha", 6.9, 23, 8.6, 25,
"Assam", 6.7, 24, 7.8, 21,
"Chhattisgarh", 6.5, 25, 7.6, 27,
"West Bengal", 6.1, 26, 7.1, 16,
"Madhya Pradesh", 5.4, 27, 6.7, 23,
"Jharkhand", 5.3, 28, 5.9, 28,
"Tripura", 5.2, 29, 5.3, 24,
"Delhi", 45.5, 36, 49.8, 64
)

#remove India row and rename Males and Females
data <- data2 %>% rename(Males=Males....,Females=Females....) %>%
filter(States !="India")

# draw the plot
plot <- data %>% arrange(Males) %>% ggplot(aes(x=reorder(States,Males,mean),y=Males,fill=Males>mean(Males)))+
geom_bar(stat="identity")+coord_flip()+theme_bw()+
xlab("")+
ylab("Percentage of Obese Males ") +guides(fill=FALSE)+
geom_text(aes(label=percent(Males/100)),nudge_y=2,color="black",size=3)
plot

The plot uses the black and white theme( a personal choice) and splits the states as those with a Male obesity percentage above average (mean) and below average. The point is not the usefulness of such classification. It is just to show how it is done. The coord_flip makes it horizontal.  The resulting plot looks like this

baseplot

Contrary to popular belief, a choropleth map was harder to read. So I settled on a barplot. Tamilnadu appears the among the top obese states. Curiously the states/ regions at the top also appear to be having a higher development / wealth.

You wonder what would happen if you added the wealth dimension to this simple graph. Far too often people think of 3D plots when speaking about adding another dimension. What I mean is mapping the color of the bars to some indicator of wealth to show its effect. This does NOT mean a 3D plot. (For the love of God avoid making 3D bar graphs whenever possible)

Data acquisition (source 2)

Now you want to add the wealth dimension. I will choose the state’s GDP as a proxy for wealth. Once again this is debatable and probably there are better markers. The focus is on the process here, not the specific choices.

This time we will do the same, search google for state GDP. Head over to the wikipedia page State wise GDP.

This time however, copy pasting in RStudio doesn’t work as intended. It produces a single column dataframe. The time has come to change tactics. We will use the rvest package to scrape data this time. This requires writing some more code and cleaning, but it works.

# Store it the url and use a temporary list to download the data programmatically.
# Then extract the required table from that list.
gdpurl <- "https://en.wikipedia.org/wiki/List_of_Indian_states_and_union_territories_by_GDP"
temp % 
 html() %>% 
 html_nodes("table")
gdp

The resulting data shows some weird symbols – <U+20B9>. This is actually the unicode representation of Indian rupee symbol. The column is also a character vector. We also need to get rid of the description in billions etc. We need just the numbers. This calls for a little bit of data cleaning.

Data cleaning

# remove non ASCII character that represents indian rupee symbol
gdp$GSDP <- iconv(gdp$GSDP,"latin1","ASCII",sub="")

# Separate out the lakh crore and other description. Uses tidyr package.
# remove the row that shows India. select only numbers
gdp % separate(col=GSDP,into=c("value","useless"),sep=" ") %>% 
 filter(State.UT !="India") %>% 
 select(Rank,State.UT,value)
#coerce the character vector as numbers
gdp$value % rename(States=States.UT)

#attempt an inner join
fulldata <- inner_join(gdp,data)

In the first step  we use the iconv function to weed out nonASCII characters. Then we separate out the value from the characters by the use of tidyr’s separate function. When you use tidyverse package, tidyr is automatically loaded. So you don’t have to load it separately.

We then finally merge the data with the original data set. If you see there are differences between the two datasets. The gdp dataset has data about union territories but not about Jammu and Kashmir. So we are left with 28 states/territories in the final dataset.

Data visualization

Now we again do the visualization. This time we colour the bars by GDP, the states above average and below average.

# redraw the plot in R and color the states as those above mean GDP and those below
plot <- fulldata %>% arrange(Males) %>% ggplot(aes(x=reorder(States,Males,mean),y=Males,fill=value>median(value)))+
 geom_bar(stat="identity")+coord_flip()+theme_bw()+
 xlab("")+
 ylab("Percentage of Obese Males ")+
 geom_text(aes(label=percent(Males/100)),nudge_y=2,color="black",size=3)+
 labs(fill="Income above average")+
 guides(fill=guide_legend(reverse=TRUE))
plot

This produces the following plot

incomeplot

It looks like colouring by GDP has helped. Yet, we see some states like Mizoram and Sikkim at the top. And states like WestBengal at the bottom! Something appears strange. Perhaps GDP isn’t a good proxy for income. May be the data source is incorrect. May be the definition of obesity is wrong. May be the data quality is not uniform across the states.We go back to the original question and start the data acquisition part again. ..And the loop continues.

If you don’t like the colors, there are plenty of ways to adjust that in R. Have a look at the ggthemes package.  Arguably the same thing can be done much more easily in Stata ( although you have to copy paste the data and therefore cannot be fully automated). Once this script is saved, you can use it for a future talk on say, infertility or anything for that matter.