Project Overview

Exploratory data analysis exercise of the drug review dataset that provides patient reviews on specific drugs along with related conditions and a 10 star patient rating reflecting overall patient satisfaction.

Source Code

Introduction

My goal for this project is to exercise my data analysis skills. I have chosen to work with the Drug Reviews dataset (Gräßer et al. 2018) after spending some time in the UCI Machine Learning Repository (Dua and Graff 2017).

I will focus on some illustrative data visualizations, as well as, some word frequency analysis.

Preprocessing and Cleaning

Load and check the data

library(magrittr)

train <- readr::read_tsv(here::here("data-raw", "drugsComTrain_raw.tsv"))
## New names:
## Rows: 161297 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "\t" chr
## (4): drugName, condition, review, date dbl (3): ...1, rating, usefulCount
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
test <- readr::read_tsv(here::here("data-raw", "drugsComTest_raw.tsv"))
## New names:
## Rows: 53766 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "\t" chr
## (4): drugName, condition, review, date dbl (3): ...1, rating, usefulCount
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
full <- 
  dplyr::bind_rows(train, test)

# Check our data
tibble::glimpse(full)
## Rows: 215,063
## Columns: 7
## $ ...1        <dbl> 206461, 95260, 92703, 138000, 35696, 155963, 165907, 10265…
## $ drugName    <chr> "Valsartan", "Guanfacine", "Lybrel", "Ortho Evra", "Bupren…
## $ condition   <chr> "Left Ventricular Dysfunction", "ADHD", "Birth Control", "…
## $ review      <chr> "\"It has no side effect, I take it in combination of Byst…
## $ rating      <dbl> 9, 8, 5, 8, 9, 2, 1, 10, 1, 8, 9, 10, 4, 4, 3, 9, 9, 9, 10…
## $ date        <chr> "May 20, 2012", "April 27, 2010", "December 14, 2009", "No…
## $ usefulCount <dbl> 27, 192, 17, 10, 37, 43, 5, 32, 11, 1, 19, 54, 8, 1, 10, 2…
colnames(full)
## [1] "...1"        "drugName"    "condition"   "review"      "rating"     
## [6] "date"        "usefulCount"

Based on our output, we get 7 variables (or columns) and 215063 observations (or rows). That’s a pretty lengthy dataset!

Our data dictionary is as follows:

Variable Name Description
…1 Identifier
drugName Name of drug
condition Name of condition
review Patient review
rating 10 star patient rating
date Date of review entry
usefulCount Number of users who found review useful

Basic Cleaning

We do some basic cleaning.

The first column, ...1 looks to be an identifier, so we rename that column as and id, and then we do some minor tokenization to enforce consistency. Additionally, we will utilize the {lubridate} package to break down the date column to census month and census year, to see if it will give us some additional insights.

We will also remove any special characters in the drugName and condition columns, turn white spaces into underscores, while keeping the original text for the review for now.

full <-
  full %>%
  dplyr::rename(id = "...1")

full$drugName <- stringr::str_replace_all(full$drugName, "[^[:alnum:]]", "_")
full$condition <- stringr::str_replace_all(full$condition, "[^[:alnum:]]", "_")
full$drugName <- tolower(full$drugName)
full$condition <- tolower(full$condition)
full$date <- lubridate::mdy(full$date)
full$rating <- as.numeric(full$rating)
full$usefulCount <- as.numeric(full$usefulCount)

full <-
  full %>%
  dplyr::mutate(census_month = lubridate::month(date)) %>%
  dplyr::mutate(census_year = as.character(lubridate::year(date))) %>%
  dplyr::rename(full_date = date)

We get a new looking dataset.

full %>%
dplyr::slice_head(n = 5) %>%
tibble::glimpse()
## Rows: 5
## Columns: 9
## $ id           <dbl> 206461, 95260, 92703, 138000, 35696
## $ drugName     <chr> "valsartan", "guanfacine", "lybrel", "ortho_evra", "bupre…
## $ condition    <chr> "left_ventricular_dysfunction", "adhd", "birth_control", …
## $ review       <chr> "\"It has no side effect, I take it in combination of Bys…
## $ rating       <dbl> 9, 8, 5, 8, 9
## $ full_date    <date> 2012-05-20, 2010-04-27, 2009-12-14, 2015-11-03, 2016-11-2…
## $ usefulCount  <dbl> 27, 192, 17, 10, 37
## $ census_month <dbl> 5, 4, 12, 11, 11
## $ census_year  <chr> "2012", "2010", "2009", "2015", "2016"

Summary statistics

Doing some summarizations, we can see count distributions for each column. This will also help us determine if there exists any NA columns.

Drugs

drug_count <-
  full %>%
  dplyr::select(drugName) %>%
  dplyr::group_by(drugName) %>%
  dplyr::summarise(count = dplyr::n())

What is the top 10 most used drug?

drug_count %>%
dplyr::arrange(desc(count)) %>%
dplyr::slice_head(n = 10) %>%
knitr::kable()
drugName count
levonorgestrel 4930
etonogestrel 4421
ethinyl_estradiol___norethindrone 3753
nexplanon 2892
ethinyl_estradiol___norgestimate 2790
ethinyl_estradiol___levonorgestrel 2503
phentermine 2085
sertraline 1868
escitalopram 1747
mirena 1673

What is the top 10 least used drug?

drug_count %>%
dplyr::arrange(count) %>%
dplyr::slice_head(n = 10) %>%
knitr::kable()
drugName count
a___d_cracked_skin_relief 1
abacavir_lamivudine_zidovudine 1
absorbine_jr_ 1
abstral 1
accupril 1
acetaminophen_caffeine_pyrilamine 1
acetaminophen_chlorpheniramine_dextromethorphan___pseudoephedrine 1
acetaminophen_dextromethorphan_doxylamine 1
acetaminophen_dextromethorphan_guaifenesin___phenylephrine 1
acetaminophen_dextromethorphan_guaifenesin___pseudoephedrine 1

The top 10 least used drug is actually misleading, as there are 798 drugs that all have a count of 1.

Are there any NAs or NULLs for drugName?

drug_count %>%
dplyr::filter(is.na(drugName) | is.null(drugName)) %>%
dplyr::select(count) %>%
knitr::kable()
count

Patient Condition

condition_count <-
  full %>%
  dplyr::select(condition) %>%
  dplyr::group_by(condition) %>%
  dplyr::summarise(count = dplyr::n())

What are the top 10 patient conditions?

condition_count %>%
dplyr::arrange(desc(count)) %>%
dplyr::slice_head(n = 10) %>%
knitr::kable()
condition count
birth_control 38436
depression 12164
pain 8245
anxiety 7812
acne 7435
bipolar_disorde 5604
insomnia 4904
weight_loss 4857
obesity 4757
adhd 4509

What are the least patient conditions?

condition_count %>%
dplyr::arrange(count) %>%
dplyr::slice_head(n = 35) %>%
knitr::kable()
condition count
100__span__users_found_this_comment_helpful_ 1
105__span__users_found_this_comment_helpful_ 1
110__span__users_found_this_comment_helpful_ 1
121__span__users_found_this_comment_helpful_ 1
123__span__users_found_this_comment_helpful_ 1
135__span__users_found_this_comment_helpful_ 1
145__span__users_found_this_comment_helpful_ 1
146__span__users_found_this_comment_helpful_ 1
26__span__users_found_this_comment_helpful_ 1
30__span__users_found_this_comment_helpful_ 1
37__span__users_found_this_comment_helpful_ 1
38__span__users_found_this_comment_helpful_ 1
40__span__users_found_this_comment_helpful_ 1
47__span__users_found_this_comment_helpful_ 1
48__span__users_found_this_comment_helpful_ 1
54__span__users_found_this_comment_helpful_ 1
61__span__users_found_this_comment_helpful_ 1
62__span__users_found_this_comment_helpful_ 1
63__span__users_found_this_comment_helpful_ 1
64__span__users_found_this_comment_helpful_ 1
70__span__users_found_this_comment_helpful_ 1
72__span__users_found_this_comment_helpful_ 1
76__span__users_found_this_comment_helpful_ 1
77__span__users_found_this_comment_helpful_ 1
79__span__users_found_this_comment_helpful_ 1
83__span__users_found_this_comment_helpful_ 1
84__span__users_found_this_comment_helpful_ 1
92__span__users_found_this_comment_helpful_ 1
94__span__users_found_this_comment_helpful_ 1
95__span__users_found_this_comment_helpful_ 1
98__span__users_found_this_comment_helpful_ 1
amyotrophic_lateral_sclerosis 1
anti_nmda_receptor_encephalitis 1
asystole 1
av_heart_block 1

I am unsure what to make of the *_span_users_found_this_comment_helpful_, so we will be removing those later. But ignoring that weirdness, we can see that amyotrophic_lateral_sclerosis is the least mentioned condition, followed by anti_nmda_receptor_encephalitis.

Are there any NAs or NULLs for patient condition?

na_condition <-
  condition_count %>%
  dplyr::filter(is.na(condition) | is.null(condition)) %>%
  dplyr::select(count)

There appears to be 1194 NA or NULL conditions. Since there are so many, I have decided to leave them in and just convert each NA/NULL conditions to unknown_condition value.

full$condition <- tidyr::replace_na(full$condition, "unknown_condition")

# Check
full %>%
dplyr::filter(condition == "unknown_condition") %>%
dplyr::count() %>%
knitr::kable()
n
1194

Ratings

Since the ratings are 1-10, we will categorize them into 3 levels:

  • low_rating = 0-3
  • mid_rating = 4-7
  • high_rating = 8-10

First, we create our row levels, then we add a new drug_rating column.

rating_row_lvl <- 
  c("Low [0-3]", 
    "Mid [4-7]", 
    "High [8-10]", 
    "Unknown")

full_proc_rating <-
  full %>%
  dplyr::mutate(
    drug_rating = factor(
      dplyr::case_when(
        rating > 0 & rating <= 3 ~ rating_row_lvl[1],
        rating >= 4 & rating <= 7 ~ rating_row_lvl[2],
        rating >= 8 & rating <= 10 ~ rating_row_lvl[3],
        TRUE ~ rating_row_lvl[4]
      ), levels = rating_row_lvl))

Then, we utilize {gtsummary} package to see count distribution and frequency distributions.

full_proc_rating %>%
dplyr::select(drug_rating) %>%
gtsummary::tbl_summary(label = list(drug_rating ~ "Drug Rating")) %>%
gtsummary::bold_labels() %>%
gtsummary::modify_footnote(update = dplyr::everything() ~ NA)
Characteristic N = 215,063
Drug Rating
    Low [0-3] 46,901 (22%)
    Mid [4-7] 38,403 (18%)
    High [8-10] 129,759 (60%)
    Unknown 0 (0%)

Additionally, we can plot it as well.

full_proc_rating %>%
ggplot2::ggplot(ggplot2::aes(x = drug_rating)) +
ggplot2::geom_bar(fill = "#0099f9") + 
ggplot2::theme_classic() +
ggplot2::scale_y_continuous(
  labels = scales::label_number(suffix = "K", scale = 1e-3)
) +
ggplot2::labs(
  title = "Patient rating distribution",
  x = "Drug Rating Categories",
  y = "Patient Count"
)

We see that the general consensus leans towards a high rating for each drugs. Let’s drill down further to see what is the highest overall rating.

full %>%
dplyr::mutate(rating__char = as.character(rating)) %>%
dplyr::select(rating__char) %>%
gtsummary::tbl_summary(label = list(rating__char ~ "Drug Rating")) %>%
gtsummary::bold_labels() %>%
gtsummary::modify_footnote(update = dplyr::everything() ~ NA)
Characteristic N = 215,063
Drug Rating
    1 28,918 (13%)
    10 68,005 (32%)
    2 9,265 (4.3%)
    3 8,718 (4.1%)
    4 6,671 (3.1%)
    5 10,723 (5.0%)
    6 8,462 (3.9%)
    7 12,547 (5.8%)
    8 25,046 (12%)
    9 36,708 (17%)

Let’s plot our summaries.

full %>%
ggplot2::ggplot(ggplot2::aes(x = rating)) + 
ggplot2::geom_bar(fill = "#0099f9") +
ggplot2::theme_classic() +
ggplot2::labs(
  title = "Patient rating distribution",
  x = "Drug Ratings",
  y = "Patient Count"
)

rating__sums <-
  full %>%
  dplyr::mutate(rating__char = as.character(rating)) %>%
  dplyr::group_by(rating__char) %>%
  dplyr::summarise(count = dplyr::n())

max_tmp_cnt <- max(rating__sums$count)
max_tmp_rating <-
  rating__sums %>%
  dplyr::filter(count == max_tmp_cnt) %>%
  dplyr::select(rating__char)

We see that patients mostly rated their drugs 10, with a maximum count of 68005. This lets us know that the attitude towards each patient’s drug is positive.

Review Dates

Using the census_month and census_year, we should be able to see count distributions per month and per year. To make our analysis easier, we will summarize the census month and year separately.

First, we convert each month numerical representation to their correct string representation, and then summarize with {gtsummary}.

review_census_month <-
  full %>%
  dplyr::mutate(
    census_month__abbr = factor(month.abb[census_month], levels = month.abb)
  )

review_census_month %>%
dplyr::select(census_month__abbr) %>%
gtsummary::tbl_summary(
  label = list(
    census_month__abbr ~ "Census Month of Review"
  )
) %>%
gtsummary::bold_labels() %>%
gtsummary::modify_footnote(update = dplyr::everything() ~ NA)
Characteristic N = 215,063
Census Month of Review
    Jan 18,347 (8.5%)
    Feb 16,133 (7.5%)
    Mar 18,675 (8.7%)
    Apr 18,224 (8.5%)
    May 17,673 (8.2%)
    Jun 16,939 (7.9%)
    Jul 18,504 (8.6%)
    Aug 19,491 (9.1%)
    Sep 17,880 (8.3%)
    Oct 18,636 (8.7%)
    Nov 18,430 (8.6%)
    Dec 16,131 (7.5%)
review_census_month %>%
ggplot2::ggplot(ggplot2::aes(x = census_month__abbr)) + 
ggplot2::geom_bar(fill = "#0099f9") + 
ggplot2::theme_classic() +
ggplot2::scale_y_continuous(
  labels = scales::label_number(suffix = "K", scale = 1e-3)
) +
ggplot2::labs(
  title = "Patient Review per Month",
  x = "Census Month",
  y = "Patient Count"
)

Based on the summary table, reviews look pretty evenly distributed throughout the months. Although, it looks like 8 has the most reviews, with 19491 review counts.

Next, we take a look at the census year.

full %>%
dplyr::select(census_year) %>%
gtsummary::tbl_summary(
  label = list(census_year ~ "Census Year of Review")
) %>%
gtsummary::bold_labels() %>%
gtsummary::modify_footnote(update = dplyr::everything() ~ NA)
Characteristic N = 215,063
Census Year of Review
    2008 6,785 (3.2%)
    2009 15,642 (7.3%)
    2010 11,227 (5.2%)
    2011 15,454 (7.2%)
    2012 13,382 (6.2%)
    2013 16,359 (7.6%)
    2014 16,104 (7.5%)
    2015 36,192 (17%)
    2016 46,607 (22%)
    2017 37,311 (17%)
full %>%
dplyr::select(census_year) %>%
ggplot2::ggplot(ggplot2::aes(x = census_year)) + 
ggplot2::theme_classic() +
ggplot2::scale_y_continuous(
  labels = scales::label_number(suffix = "K", scale = 1e-3)
) +
ggplot2::geom_bar(fill = "#0099f9") + 
ggplot2::labs(
  title = "Patient Review per Year",
  x = "Census Year",
  y = "Patient Count"
)

At first glance, it looks like we see a steady increase in the number of patients review. We especially see a jump from 2014 to 2015, with year 2016 having the most reviews of 46607.

Census year and month

Taking a look at the census year and months together, we can see which month for each year the distribution of patient reviews.

Let’s first look at the trend of reviews.

census_month_year <-
  full %>%
  dplyr::select(census_month, census_year) %>%
  dplyr::mutate(
    census_month = factor(month.abb[census_month], levels = month.abb))

census_month_year %>%
dplyr::group_by(census_month, census_year) %>%
dplyr::summarise(rev_count = dplyr::n(), .groups = "keep") %>%
dplyr::arrange(census_year, census_month) %>%
dplyr::ungroup() %>%
dplyr::mutate(time = dplyr::row_number()) %>%
ggplot2::ggplot(ggplot2::aes(x = time, y = rev_count)) +
ggplot2::geom_point(shape = 21, color = "black", fill = "#69b3a2", size = 2) + 
ggplot2::geom_line(color = "black") +
ggplot2::theme_classic() +
ggplot2::scale_y_continuous(
  labels = scales::label_number(suffix = "K", scale = 1e-3)
) +
ggplot2::labs(
  title = "Patient Review count overtime",
  x = "Time period (t)",
  y = "Review Count",
)

We see a couple of peaks and valleys here and there. At around t=23, we see the first peak, but we can also see that by t=25, we get the biggest drop in patient responses. Then the biggest climb from t=80 up to t=90 (approximately).

Let’s look at some summaries.

census_month_year %>%
gtsummary::tbl_summary(
  by = census_year,
  label = list(census_month ~ "Census Month")
) %>%
gtsummary::bold_labels() %>%
gtsummary::modify_footnote(update = dplyr::everything() ~ NA)
Characteristic 2008, N = 6,785 2009, N = 15,642 2010, N = 11,227 2011, N = 15,454 2012, N = 13,382 2013, N = 16,359 2014, N = 16,104 2015, N = 36,192 2016, N = 46,607 2017, N = 37,311
Census Month
    Jan 0 (0%) 731 (4.7%) 1,239 (11%) 1,226 (7.9%) 1,670 (12%) 1,489 (9.1%) 1,293 (8.0%) 1,949 (5.4%) 4,508 (9.7%) 4,242 (11%)
    Feb 142 (2.1%) 949 (6.1%) 839 (7.5%) 994 (6.4%) 1,154 (8.6%) 1,170 (7.2%) 1,350 (8.4%) 1,808 (5.0%) 4,021 (8.6%) 3,706 (9.9%)
    Mar 789 (12%) 1,174 (7.5%) 956 (8.5%) 1,120 (7.2%) 1,311 (9.8%) 1,300 (7.9%) 1,264 (7.8%) 2,570 (7.1%) 4,039 (8.7%) 4,152 (11%)
    Apr 716 (11%) 1,221 (7.8%) 944 (8.4%) 1,121 (7.3%) 1,341 (10%) 1,190 (7.3%) 1,226 (7.6%) 2,788 (7.7%) 3,910 (8.4%) 3,767 (10%)
    May 608 (9.0%) 1,164 (7.4%) 937 (8.3%) 1,121 (7.3%) 1,094 (8.2%) 1,349 (8.2%) 1,378 (8.6%) 2,809 (7.8%) 3,560 (7.6%) 3,653 (9.8%)
    Jun 648 (9.6%) 1,195 (7.6%) 853 (7.6%) 1,195 (7.7%) 974 (7.3%) 1,285 (7.9%) 1,294 (8.0%) 2,739 (7.6%) 3,676 (7.9%) 3,080 (8.3%)
    Jul 673 (9.9%) 1,405 (9.0%) 888 (7.9%) 1,234 (8.0%) 1,119 (8.4%) 1,620 (9.9%) 1,328 (8.2%) 3,366 (9.3%) 3,922 (8.4%) 2,949 (7.9%)
    Aug 664 (9.8%) 1,483 (9.5%) 1,052 (9.4%) 1,393 (9.0%) 1,082 (8.1%) 1,514 (9.3%) 1,220 (7.6%) 3,612 (10.0%) 4,326 (9.3%) 3,145 (8.4%)
    Sep 606 (8.9%) 1,620 (10%) 880 (7.8%) 1,446 (9.4%) 948 (7.1%) 1,302 (8.0%) 1,202 (7.5%) 3,537 (9.8%) 3,657 (7.8%) 2,682 (7.2%)
    Oct 663 (9.8%) 1,748 (11%) 912 (8.1%) 1,637 (11%) 914 (6.8%) 1,577 (9.6%) 1,356 (8.4%) 3,772 (10%) 3,577 (7.7%) 2,480 (6.6%)
    Nov 667 (9.8%) 1,645 (11%) 827 (7.4%) 1,578 (10%) 804 (6.0%) 1,554 (9.5%) 1,552 (9.6%) 3,491 (9.6%) 3,669 (7.9%) 2,643 (7.1%)
    Dec 609 (9.0%) 1,307 (8.4%) 900 (8.0%) 1,389 (9.0%) 971 (7.3%) 1,009 (6.2%) 1,641 (10%) 3,751 (10%) 3,742 (8.0%) 812 (2.2%)
census_month_year %>%
dplyr::group_by(census_month, census_year) %>%
dplyr::summarise(rev_count = dplyr::n(), .groups = "keep") %>%
ggplot2::ggplot(
  ggplot2::aes(x = census_year, y = census_month, fill = rev_count)
) +
ggplot2::geom_tile() +
ggplot2::theme_classic() +
ggplot2::scale_fill_gradient(low="white", high="blue") +
ggplot2::labs(
  title = "Patient Review per Month per Year",
  x = "Census Year",
  y = "Census Month",
  fill = "Review Count"
)

The heatmap provides a great visualization of when we see an increase in patient responses. From March of 2015 all the way to May of 2017, we see strong numbers of responses. Then, at about April of 2017, we see some decline all the way to November of that same year.

Key Takeaways

Excluding the usefulCount column (I would like to clump that together with the patient reviews), and outside the sentiment of the reviews, we can create the following conclusions:

  • Levonorgestrel, a progestin, is the most reviewed drug, followed by Etonogestrel.
  • Most patients rated their drug pretty high, with mostly ratings of 8, 9, and 10. With 32% of all ratings being a 10/10.
  • Reviews are happening consistently every month.
  • We see an increase in reviews, however, in years 2015, 2016, and 2017, with 2016 being the year with most reviews.

Patient reviews

Since there are so many observations, when it comes to tokenization of the reviews, my laptop will not be able to handle all of that computing power. So, I have decided to only capture the first 1,000 observations in each year and work with those.

full <-
  full %>%
  dplyr::group_by(census_year) %>%
  dplyr::slice_head(n = 1000) %>%
  dplyr::ungroup()

Let’s take a look at 3 sample patients’ review text to se what we are dealing with.

sample_txt <-
  full %>%
  dplyr::select(id, review) %>%
  dplyr::slice_sample(n = 5)

Patient 1’s review reads: “Ritalin made me able to block out everything and focus on reading my boring, tedious nursing books. Take it at least 12 hours before you go to sleep, or else it will keep you up at night with thoughts racing. It suppresses appetite and makes most people lose weight- I actually gained weight because I would be starving and binge eat at the end of the night. I did like Ritalin, but it made me too irritable so I switched to Adderall. That is much more pleasant and even more effective. Focalin 10 mg is good for people with mild ADHD symptoms.”

Patient 2’s review reads: “Harvoni Rules !!!!! I was cured in 4 weeks of treatment !!!!!”

Patient 3’s review reads: “I was real scared about getting Implanon after reading about it. The process about 15 mins. They numb your arm and it does burn going in for about a min. Then they use another needle to insert. I felt the pressure of them putting it in. I turned my head the whole time so I didn't see anything. Afterwards I had to sit for a couple of minutes to make sure I wouldn't have an allergic reaction. A couple of days later, the small hole in my arm started to itch really bad because it was healing. The first month nothing happened. I had it inserted while on my period. The second month I was on period for a whole month but it wasn't a regular period. It wasn't as heavy but it was darker than usual. I called my doctor and she said that was normal. I recommend Implanon.”

Patient 4’s review reads: “This medicine (along with talk therapy) has helped me get my life back on track after being 98% disabled.”

Patient 5’s review reads: “After taking Ceftin (Cefuroxime) for 3 days, my throat swelled and my chest became more constricted. It felt as though I could not swallow well. I was dizzy, and anxious. I consulted the pharmacist who told me to take a Benadryl and if I didn't improve quickly, to get myself to the hospital. Within 5 minutes of taking the Benadryl I could feel my throat relaxing. After relating this situation to my daughter, she told me she had a similar experience with this drug when it was prescribed for her a year ago.”

From this, we get the following initial observations:

  • Capitalization of names.
  • We get a mix of numeric and character representation of numbers (i.e., 5 Mg vs two weeks).
  • Word contractions – it’s, didn’t, I’m, etc.
  • Special characters – dash (-), ellipsis (…), parentheses.
  • Handle HTML character encodings.

Basic Preprocessing

For some basic string cleaning, we can do the following:

  • Remove trailing and leading whitespaces.
  • Remove any newlines (\n) or tabs (\t).
  • Normalize capitalizations.
  • We convert HTML character encoding to their proper representations using textutils and enforce consistent character encoding – UTF-8.
  • Convert ellipsis
sample_txt_cleanup <- sample_txt[[2]][[3]]

cleaned_txt <-
  sample_txt_cleanup %>%
  textutils::HTMLdecode() %>%
  iconv("UTF-8", sub = "byte") %>%
  tolower() %>%
  stringr::str_replace_all("[^[:alnum:][.]['][!]]", " ") %>%
  stringr::str_trim() %>%
  stringr::str_squish() %>%
  stringr::str_replace_all("[.]{2,3}", ".") # convert ellipsis to a period

The following is an example of a cleaned text of a patient’s review:

Original: “I was real scared about getting Implanon after reading about it. The process about 15 mins. They numb your arm and it does burn going in for about a min. Then they use another needle to insert. I felt the pressure of them putting it in. I turned my head the whole time so I didn't see anything. Afterwards I had to sit for a couple of minutes to make sure I wouldn't have an allergic reaction. A couple of days later, the small hole in my arm started to itch really bad because it was healing. The first month nothing happened. I had it inserted while on my period. The second month I was on period for a whole month but it wasn't a regular period. It wasn't as heavy but it was darker than usual. I called my doctor and she said that was normal. I recommend Implanon.”

Cleaned: i was real scared about getting implanon after reading about it. the process about 15 mins. they numb your arm and it does burn going in for about a min. then they use another needle to insert. i felt the pressure of them putting it in. i turned my head the whole time so i didn’t see anything. afterwards i had to sit for a couple of minutes to make sure i wouldn’t have an allergic reaction. a couple of days later the small hole in my arm started to itch really bad because it was healing. the first month nothing happened. i had it inserted while on my period. the second month i was on period for a whole month but it wasn’t a regular period. it wasn’t as heavy but it was darker than usual. i called my doctor and she said that was normal. i recommend implanon.

cleaned_review <-
  full$review %>%
  textutils::HTMLdecode() %>%
  iconv("UTF-8", sub = "byte") %>%
  tolower() %>%
  stringr::str_replace_all("[^[:alnum:][.]['][!]]", " ") %>%
  stringr::str_trim() %>%
  stringr::str_squish() %>%
  stringr::str_replace_all("[.]{2,3}", ".") # convert ellipsis to a period

Advanced Preprocessing

For the rest of the our preprocessing tasks, we will leverage the {quanteda} package. {quanteda} is an amazing suite for text analytic functions (Benoit et al. 2018).

Tokenization

Tokenization is the process of converting long strings into a token of words. For example, the sentence “Let’s go to N.Y.!” can be tokenized to the following tokens: {Let, 's, go, to, N.Y., !}

Why do we do it?

Computers don’t understand language, but they are really good at counting pieces of language. The most informative pieces are (often) words.

Stemming

Reducing words to their stem.

Why do we do it?

When we do not want to dstinguish between different verb forms (walk, walk-ing, walk-ed, walk-s) and singular-plural (cat, cat-s).

Lemmatization

Refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

For sentiment analysis, I think it would be more important to lemmatize words instead of stem them so that we do not lose the meaning of targeted words.

Lemmatization is more accurate but it requires data about language, and it also takes much more time.

original <- 
  cleaned_txt %>%
  quanteda::tokens()

# Stemmed texts
lemmas <- 
  quanteda::tokens_replace(
    original, 
    pattern = lexicon::hash_lemmas$token, 
    replacement = lexicon::hash_lemmas$lemma
  ) %>%
  paste0()

Original: i, was, real, scared, about, getting, implanon, after, reading, about, it, ., the, process, about, 15, mins, ., they, numb, your, arm, and, it, does, burn, going, in, for, about, a, min, ., then, they, use, another, needle, to, insert, ., i, felt, the, pressure, of, them, putting, it, in, ., i, turned, my, head, the, whole, time, so, i, didn’t, see, anything, ., afterwards, i, had, to, sit, for, a, couple, of, minutes, to, make, sure, i, wouldn’t, have, an, allergic, reaction, ., a, couple, of, days, later, the, small, hole, in, my, arm, started, to, itch, really, bad, because, it, was, healing, ., the, first, month, nothing, happened, ., i, had, it, inserted, while, on, my, period, ., the, second, month, i, was, on, period, for, a, whole, month, but, it, wasn’t, a, regular, period, ., it, wasn’t, as, heavy, but, it, was, darker, than, usual, ., i, called, my, doctor, and, she, said, that, was, normal, ., i, recommend, implanon, .

Lemmatized: i, be, real, scare, about, get, implanon, after, read, about, it, ., the, process, about, 15, min, ., they, numb, your, arm, and, it, do, burn, go, in, for, about, a, min, ., then, they, use, another, needle, to, insert, ., i, feel, the, pressure, of, them, putting, it, in, ., i, turn, my, head, the, whole, time, so, i, didn’t, see, anything, ., afterwards, i, have, to, sit, for, a, couple, of, minute, to, make, sure, i, wouldn’t, have, a, allergic, reaction, ., a, couple, of, day, late, the, small, hole, in, my, arm, start, to, itch, really, bad, because, it, be, heal, ., the, first, month, nothing, happen, ., i, have, it, insert, while, on, my, period, ., the, 2, month, i, be, on, period, for, a, whole, month, but, it, wasn’t, a, regular, period, ., it, wasn’t, as, heavy, but, it, be, dark, than, usual, ., i, call, my, doctor, and, she, say, that, be, normal, ., i, recommend, implanon, .

Stopwords removal

This is done to remove stopwords.

Why do we do it?

Some words are simply not interesting, but do occur often. For instance, words such as i, me, my, myself, is, was, ours, ourselves, you, your, etc.

stopwords_removed <-
  quanteda::tokens_remove(original, quanteda::stopwords("english"))

Original: i, was, real, scared, about, getting, implanon, after, reading, about, it, ., the, process, about, 15, mins, ., they, numb, your, arm, and, it, does, burn, going, in, for, about, a, min, ., then, they, use, another, needle, to, insert, ., i, felt, the, pressure, of, them, putting, it, in, ., i, turned, my, head, the, whole, time, so, i, didn’t, see, anything, ., afterwards, i, had, to, sit, for, a, couple, of, minutes, to, make, sure, i, wouldn’t, have, an, allergic, reaction, ., a, couple, of, days, later, the, small, hole, in, my, arm, started, to, itch, really, bad, because, it, was, healing, ., the, first, month, nothing, happened, ., i, had, it, inserted, while, on, my, period, ., the, second, month, i, was, on, period, for, a, whole, month, but, it, wasn’t, a, regular, period, ., it, wasn’t, as, heavy, but, it, was, darker, than, usual, ., i, called, my, doctor, and, she, said, that, was, normal, ., i, recommend, implanon, .

Stopwords removed: real, scared, getting, implanon, reading, ., process, 15, mins, ., numb, arm, burn, going, min, ., use, another, needle, insert, ., felt, pressure, putting, ., turned, head, whole, time, see, anything, ., afterwards, sit, couple, minutes, make, sure, allergic, reaction, ., couple, days, later, small, hole, arm, started, itch, really, bad, healing, ., first, month, nothing, happened, ., inserted, period, ., second, month, period, whole, month, regular, period, ., heavy, darker, usual, ., called, doctor, said, normal, ., recommend, implanon, .

Full processing application

We apply all concepts into our working dataset.

reviews <-
  cleaned_review %>%
  quanteda::tokens(remove_punct = TRUE) %>%
  quanteda::tokens_replace(
    pattern = lexicon::hash_lemmas$token,
    replacement = lexicon::hash_lemmas$lemma
  ) %>% 
  quanteda::tokens_remove(quanteda::stopwords("english"))

reviews_df <- 
  data.frame(
    id = seq_along(reviews),
    text = sapply(reviews, paste, collapse = " "),
    row.names = NULL
  )
saveRDS(full, here::here("data", "cleaned_review.rds"))
readr::write_csv(reviews_df, here::here("data", "review_tokens.csv"))

Corpus Statistics

For this section, we utilize {quanteda} to do some corpus statistics. We first have to build our corpus from our tokens that we cleaned from 03_patient_reviews.Rmd notebook.

tokens_csv <- readr::read_csv(here::here("data", "review_tokens.csv"))
## Rows: 10000 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): text
## dbl (1): id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
corpus <- quanteda::corpus(tokens_csv, text_field = "text")

Document-term Matrix

Given our corpus, we are going to create a Document-term Matrix. A document-term matrix is simply a mathematical matrix which describing the frequency of terms that occur in our text.

{quanteda} uses the quanteda::dfm() function to generate a document-term matrix. DFM stands for Document-Feature Matrix, which is similar to a document-term matrix, but just different terminologies.

dtm <- quanteda::dfm(quanteda::tokens(corpus))

dtm
## Document-feature matrix of: 10,000 documents, 12,978 features (99.74% sparse) and 1 docvar.
##        features
## docs    medication change life panic attack control barely able leave house
##   text1          1      3    1     1      1       1      1    1     1     1
##   text2          0      0    0     0      0       1      0    0     0     0
##   text3          0      0    0     2      2       0      0    0     0     0
##   text4          0      0    0     0      0       0      0    0     0     0
##   text5          0      0    0     0      0       0      0    0     0     0
##   text6          0      0    0     0      0       0      0    0     0     0
## [ reached max_ndoc ... 9,994 more documents, reached max_nfeat ... 12,968 more features ]

We see that we have 10,000 documents, with 12,978 features. Pretty cool! We also see that we have 99.74% sparse, which refers to how sparse the matrix is. In our case, 99.74% of all the cells in our document-term matrix contains the value 0. This makes a lot of sense because we have 12,978 unique words that have been used in our corpus and many documents only contain a very small portion of all of these words.

Trim less frequent words

Next, it is also a good idea to remove the least frequent words. That way, we can just do our corpus statistics on the more interesting ones.

I think for now, we’ll remove those terms in our dtm that occurs less than 10 times.

trimmed_dtm <- 
  dtm %>%
  quanteda::dfm_trim(min_termfreq = 10)

trimmed_dtm
## Document-feature matrix of: 10,000 documents, 2,874 features (98.90% sparse) and 1 docvar.
##        features
## docs    medication change life panic attack control barely able leave house
##   text1          1      3    1     1      1       1      1    1     1     1
##   text2          0      0    0     0      0       1      0    0     0     0
##   text3          0      0    0     2      2       0      0    0     0     0
##   text4          0      0    0     0      0       0      0    0     0     0
##   text5          0      0    0     0      0       0      0    0     0     0
##   text6          0      0    0     0      0       0      0    0     0     0
## [ reached max_ndoc ... 9,994 more documents, reached max_nfeat ... 2,864 more features ]

We now see that we our features decreased from 12,978 to 2,874. It is usually useful to remove these least frequent words because they tend to be less informative than the most frequent ones. And sometimes for text analysis, it is also good to not have too many terms.

Analysis

For our analysis, we are going to be doing some word clouds. Word clouds are visual representations of text that give greater rank to words that appear more frequently. For this, we will use {quanteda}’s textplot_wordcloud(). To use the textplot_wordcloud() function, we need to install {quanteda.textplots}.

trimmed_dtm %>%
quanteda.textplots::textplot_wordcloud(max_words = 50, color = c("blue", "red"))

The red words are the most frequent words, and the blue ones are the least frequent. Additionally, we can also look at text frequency as a dataframe. For this one, we install {quanteda.textstats}

txt_freq <-
  trimmed_dtm %>%
  quanteda.textstats::textstat_frequency(n = 20)

So, we see the 20 most frequent words:

feature frequency rank docfreq group
take 7984 1 4926 all
day 5619 2 3542 all
get 4874 3 3285 all
good 4278 4 3324 all
year 4058 5 3087 all
feel 3994 6 2738 all
month 3906 7 2739 all
go 3845 8 2892 all
work 3843 9 3036 all
much 3776 10 2802 all
effect 3418 11 2795 all
now 3263 12 2638 all
week 3252 13 2375 all
side 3247 14 2736 all
start 3223 15 2459 all
time 2955 16 2334 all
pain 2943 17 1768 all
2 2684 18 2126 all
first 2491 19 2056 all
use 2468 20 1883 all

Plotting this, we can see the following:

txt_freq %>%
ggplot2::ggplot(ggplot2::aes(x = reorder(feature, frequency), y = frequency)) +
ggplot2::geom_point() +
ggplot2::coord_flip() + 
ggplot2::labs(x = NULL, y = "Frequency")

Conclusion

I think this is where we will finish this project. Based on our corpus statistic, out of the 1,000 reviews we selected per year, we can see that take is the most frequent word used. And this makes sense because our corpus is on drug reviews, and the act of taking a drug is probably the only thing you do with medicine. Unless, of course the medicine was taken as a form of injection like a vaccination. However, it still makes sense because these are reviews of patients of drugs that they administered themselves.

All in all, if I had more computing power, perhaps I can do this on the full data set. Perhaps in the future, the computation itself can be done in the cloud, as well as any storage I may need.

Packages Used

Package Version
dplyr 1.0.10
ggplot2 3.3.6
gtsummary 1.6.2
here 1.0.1
knitr 1.40
lexicon 1.2.1
lubridate 1.8.0
magrittr 2.0.3
quanteda 3.2.3
quanteda.textplots 0.94.2
quanteda.textstats 0.96
readr 2.1.3
renv 0.13.1
rmarkdown 2.17
scales 1.2.1
stringr 1.4.0
textutils 0.2-1
tibble 3.1.8
tidyr 1.2.1

References

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. 2018. “Quanteda: An R Package for the Quantitative Analysis of Textual Data.” Journal of Open Source Software 3 (30): 774. https://doi.org/10.21105/joss.00774.

Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.

Gräßer, Felix, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. 2018. “Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning.” In Proceedings of the 2018 International Conference on Digital Health, 121–25. DH ’18. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3194658.3194677.