Project Overview

Exploratory data analysis exercise of the drug review dataset that provides patient reviews on specific drugs along with related conditions and a 10 star patient rating reflecting overall patient satisfaction.

Source Code

Introduction

My goal for this project is to exercise my data analysis skills. I have chosen to work with the Drug Reviews dataset (Gräßer et al. 2018) after spending some time in the UCI Machine Learning Repository (Dua and Graff 2017).

I will focus on some illustrative data visualizations, as well as, some word frequency analysis.

Preprocessing and Cleaning

Load and check the data

library(magrittr)

train <- readr::read_tsv(here::here("data-raw", "drugsComTrain_raw.tsv"))

## New names:
## Rows: 161297 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "\t" chr
## (4): drugName, condition, review, date dbl (3): ...1, rating, usefulCount
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

test <- readr::read_tsv(here::here("data-raw", "drugsComTest_raw.tsv"))

## New names:
## Rows: 53766 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "\t" chr
## (4): drugName, condition, review, date dbl (3): ...1, rating, usefulCount
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

full <- 
  dplyr::bind_rows(train, test)

# Check our data
tibble::glimpse(full)

## Rows: 215,063
## Columns: 7
## $ ...1        <dbl> 206461, 95260, 92703, 138000, 35696, 155963, 165907, 10265…
## $ drugName    <chr> "Valsartan", "Guanfacine", "Lybrel", "Ortho Evra", "Bupren…
## $ condition   <chr> "Left Ventricular Dysfunction", "ADHD", "Birth Control", "…
## $ review      <chr> "\"It has no side effect, I take it in combination of Byst…
## $ rating      <dbl> 9, 8, 5, 8, 9, 2, 1, 10, 1, 8, 9, 10, 4, 4, 3, 9, 9, 9, 10…
## $ date        <chr> "May 20, 2012", "April 27, 2010", "December 14, 2009", "No…
## $ usefulCount <dbl> 27, 192, 17, 10, 37, 43, 5, 32, 11, 1, 19, 54, 8, 1, 10, 2…

colnames(full)

## [1] "...1"        "drugName"    "condition"   "review"      "rating"     
## [6] "date"        "usefulCount"

Based on our output, we get 7 variables (or columns) and 215063 observations (or rows). That’s a pretty lengthy dataset!

Our data dictionary is as follows:

Variable Name	Description
…1	Identifier
drugName	Name of drug
condition	Name of condition
review	Patient review
rating	10 star patient rating
date	Date of review entry
usefulCount	Number of users who found review useful

Basic Cleaning

We do some basic cleaning.

The first column, ...1 looks to be an identifier, so we rename that column as and id, and then we do some minor tokenization to enforce consistency. Additionally, we will utilize the {lubridate} package to break down the date column to census month and census year, to see if it will give us some additional insights.

We will also remove any special characters in the drugName and condition columns, turn white spaces into underscores, while keeping the original text for the review for now.

full <-
  full %>%
  dplyr::rename(id = "...1")

full$drugName <- stringr::str_replace_all(full$drugName, "[^[:alnum:]]", "_")
full$condition <- stringr::str_replace_all(full$condition, "[^[:alnum:]]", "_")
full$drugName <- tolower(full$drugName)
full$condition <- tolower(full$condition)
full$date <- lubridate::mdy(full$date)
full$rating <- as.numeric(full$rating)
full$usefulCount <- as.numeric(full$usefulCount)

full <-
  full %>%
  dplyr::mutate(census_month = lubridate::month(date)) %>%
  dplyr::mutate(census_year = as.character(lubridate::year(date))) %>%
  dplyr::rename(full_date = date)

We get a new looking dataset.

full %>%
dplyr::slice_head(n = 5) %>%
tibble::glimpse()

## Rows: 5
## Columns: 9
## $ id           <dbl> 206461, 95260, 92703, 138000, 35696
## $ drugName     <chr> "valsartan", "guanfacine", "lybrel", "ortho_evra", "bupre…
## $ condition    <chr> "left_ventricular_dysfunction", "adhd", "birth_control", …
## $ review       <chr> "\"It has no side effect, I take it in combination of Bys…
## $ rating       <dbl> 9, 8, 5, 8, 9
## $ full_date    <date> 2012-05-20, 2010-04-27, 2009-12-14, 2015-11-03, 2016-11-2…
## $ usefulCount  <dbl> 27, 192, 17, 10, 37
## $ census_month <dbl> 5, 4, 12, 11, 11
## $ census_year  <chr> "2012", "2010", "2009", "2015", "2016"

Summary statistics

Doing some summarizations, we can see count distributions for each column. This will also help us determine if there exists any NA columns.

Drugs

drug_count <-
  full %>%
  dplyr::select(drugName) %>%
  dplyr::group_by(drugName) %>%
  dplyr::summarise(count = dplyr::n())

What is the top 10 most used drug?

drug_count %>%
dplyr::arrange(desc(count)) %>%
dplyr::slice_head(n = 10) %>%
knitr::kable()

drugName	count
levonorgestrel	4930
etonogestrel	4421
ethinyl_estradiol___norethindrone	3753
nexplanon	2892
ethinyl_estradiol___norgestimate	2790
ethinyl_estradiol___levonorgestrel	2503
phentermine	2085
sertraline	1868
escitalopram	1747
mirena	1673

What is the top 10 least used drug?

drug_count %>%
dplyr::arrange(count) %>%
dplyr::slice_head(n = 10) %>%
knitr::kable()

drugName	count
a___d_cracked_skin_relief	1
abacavir_lamivudine_zidovudine	1
absorbine_jr_	1
abstral	1
accupril	1
acetaminophen_caffeine_pyrilamine	1
acetaminophen_chlorpheniramine_dextromethorphan___pseudoephedrine	1
acetaminophen_dextromethorphan_doxylamine	1
acetaminophen_dextromethorphan_guaifenesin___phenylephrine	1
acetaminophen_dextromethorphan_guaifenesin___pseudoephedrine	1

The top 10 least used drug is actually misleading, as there are 798 drugs that all have a count of 1.

Are there any NAs or NULLs for drugName?

drug_count %>%
dplyr::filter(is.na(drugName) | is.null(drugName)) %>%
dplyr::select(count) %>%
knitr::kable()

count

Patient Condition

condition_count <-
  full %>%
  dplyr::select(condition) %>%
  dplyr::group_by(condition) %>%
  dplyr::summarise(count = dplyr::n())

What are the top 10 patient conditions?

condition_count %>%
dplyr::arrange(desc(count)) %>%
dplyr::slice_head(n = 10) %>%
knitr::kable()

condition	count
birth_control	38436
depression	12164
pain	8245
anxiety	7812
acne	7435
bipolar_disorde	5604
insomnia	4904
weight_loss	4857
obesity	4757
adhd	4509

What are the least patient conditions?

condition_count %>%
dplyr::arrange(count) %>%
dplyr::slice_head(n = 35) %>%
knitr::kable()

condition	count
100__span__users_found_this_comment_helpful_	1
105__span__users_found_this_comment_helpful_	1
110__span__users_found_this_comment_helpful_	1
121__span__users_found_this_comment_helpful_	1
123__span__users_found_this_comment_helpful_	1
135__span__users_found_this_comment_helpful_	1
145__span__users_found_this_comment_helpful_	1
146__span__users_found_this_comment_helpful_	1
26__span__users_found_this_comment_helpful_	1
30__span__users_found_this_comment_helpful_	1
37__span__users_found_this_comment_helpful_	1
38__span__users_found_this_comment_helpful_	1
40__span__users_found_this_comment_helpful_	1
47__span__users_found_this_comment_helpful_	1
48__span__users_found_this_comment_helpful_	1
54__span__users_found_this_comment_helpful_	1
61__span__users_found_this_comment_helpful_	1
62__span__users_found_this_comment_helpful_	1
63__span__users_found_this_comment_helpful_	1
64__span__users_found_this_comment_helpful_	1
70__span__users_found_this_comment_helpful_	1
72__span__users_found_this_comment_helpful_	1
76__span__users_found_this_comment_helpful_	1
77__span__users_found_this_comment_helpful_	1
79__span__users_found_this_comment_helpful_	1
83__span__users_found_this_comment_helpful_	1
84__span__users_found_this_comment_helpful_	1
92__span__users_found_this_comment_helpful_	1
94__span__users_found_this_comment_helpful_	1
95__span__users_found_this_comment_helpful_	1
98__span__users_found_this_comment_helpful_	1
amyotrophic_lateral_sclerosis	1
anti_nmda_receptor_encephalitis	1
asystole	1
av_heart_block	1

I am unsure what to make of the *_span_users_found_this_comment_helpful_, so we will be removing those later. But ignoring that weirdness, we can see that amyotrophic_lateral_sclerosis is the least mentioned condition, followed by anti_nmda_receptor_encephalitis.

Are there any NAs or NULLs for patient condition?

na_condition <-
  condition_count %>%
  dplyr::filter(is.na(condition) | is.null(condition)) %>%
  dplyr::select(count)

There appears to be 1194 NA or NULL conditions. Since there are so many, I have decided to leave them in and just convert each NA/NULL conditions to unknown_condition value.

full$condition <- tidyr::replace_na(full$condition, "unknown_condition")

# Check
full %>%
dplyr::filter(condition == "unknown_condition") %>%
dplyr::count() %>%
knitr::kable()

n
1194

Ratings

Since the ratings are 1-10, we will categorize them into 3 levels:

low_rating = 0-3
mid_rating = 4-7
high_rating = 8-10

First, we create our row levels, then we add a new drug_rating column.

rating_row_lvl <- 
  c("Low [0-3]", 
    "Mid [4-7]", 
    "High [8-10]", 
    "Unknown")

full_proc_rating <-
  full %>%
  dplyr::mutate(
    drug_rating = factor(
      dplyr::case_when(
        rating > 0 & rating <= 3 ~ rating_row_lvl[1],
        rating >= 4 & rating <= 7 ~ rating_row_lvl[2],
        rating >= 8 & rating <= 10 ~ rating_row_lvl[3],
        TRUE ~ rating_row_lvl[4]
      ), levels = rating_row_lvl))

Then, we utilize {gtsummary} package to see count distribution and frequency distributions.

full_proc_rating %>%
dplyr::select(drug_rating) %>%
gtsummary::tbl_summary(label = list(drug_rating ~ "Drug Rating")) %>%
gtsummary::bold_labels() %>%
gtsummary::modify_footnote(update = dplyr::everything() ~ NA)

Characteristic	N = 215,063
Drug Rating
Low [0-3]	46,901 (22%)
Mid [4-7]	38,403 (18%)
High [8-10]	129,759 (60%)
Unknown	0 (0%)

Additionally, we can plot it as well.

full_proc_rating %>%
ggplot2::ggplot(ggplot2::aes(x = drug_rating)) +
ggplot2::geom_bar(fill = "#0099f9") + 
ggplot2::theme_classic() +
ggplot2::scale_y_continuous(
  labels = scales::label_number(suffix = "K", scale = 1e-3)
) +
ggplot2::labs(
  title = "Patient rating distribution",
  x = "Drug Rating Categories",
  y = "Patient Count"
)

We see that the general consensus leans towards a high rating for each drugs. Let’s drill down further to see what is the highest overall rating.

full %>%
dplyr::mutate(rating__char = as.character(rating)) %>%
dplyr::select(rating__char) %>%
gtsummary::tbl_summary(label = list(rating__char ~ "Drug Rating")) %>%
gtsummary::bold_labels() %>%
gtsummary::modify_footnote(update = dplyr::everything() ~ NA)

Characteristic	N = 215,063
Drug Rating
1	28,918 (13%)
10	68,005 (32%)
2	9,265 (4.3%)
3	8,718 (4.1%)
4	6,671 (3.1%)
5	10,723 (5.0%)
6	8,462 (3.9%)
7	12,547 (5.8%)
8	25,046 (12%)
9	36,708 (17%)

Let’s plot our summaries.

full %>%
ggplot2::ggplot(ggplot2::aes(x = rating)) + 
ggplot2::geom_bar(fill = "#0099f9") +
ggplot2::theme_classic() +
ggplot2::labs(
  title = "Patient rating distribution",
  x = "Drug Ratings",
  y = "Patient Count"
)

rating__sums <-
  full %>%
  dplyr::mutate(rating__char = as.character(rating)) %>%
  dplyr::group_by(rating__char) %>%
  dplyr::summarise(count = dplyr::n())

max_tmp_cnt <- max(rating__sums$count)
max_tmp_rating <-
  rating__sums %>%
  dplyr::filter(count == max_tmp_cnt) %>%
  dplyr::select(rating__char)

We see that patients mostly rated their drugs 10, with a maximum count of 68005. This lets us know that the attitude towards each patient’s drug is positive.

Review Dates

Using the census_month and census_year, we should be able to see count distributions per month and per year. To make our analysis easier, we will summarize the census month and year separately.

First, we convert each month numerical representation to their correct string representation, and then summarize with {gtsummary}.

review_census_month <-
  full %>%
  dplyr::mutate(
    census_month__abbr = factor(month.abb[census_month], levels = month.abb)
  )

review_census_month %>%
dplyr::select(census_month__abbr) %>%
gtsummary::tbl_summary(
  label = list(
    census_month__abbr ~ "Census Month of Review"
  )
) %>%
gtsummary::bold_labels() %>%
gtsummary::modify_footnote(update = dplyr::everything() ~ NA)

Characteristic	N = 215,063
Census Month of Review
Jan	18,347 (8.5%)
Feb	16,133 (7.5%)
Mar	18,675 (8.7%)
Apr	18,224 (8.5%)
May	17,673 (8.2%)
Jun	16,939 (7.9%)
Jul	18,504 (8.6%)
Aug	19,491 (9.1%)
Sep	17,880 (8.3%)
Oct	18,636 (8.7%)
Nov	18,430 (8.6%)
Dec	16,131 (7.5%)

review_census_month %>%
ggplot2::ggplot(ggplot2::aes(x = census_month__abbr)) + 
ggplot2::geom_bar(fill = "#0099f9") + 
ggplot2::theme_classic() +
ggplot2::scale_y_continuous(
  labels = scales::label_number(suffix = "K", scale = 1e-3)
) +
ggplot2::labs(
  title = "Patient Review per Month",
  x = "Census Month",
  y = "Patient Count"
)

Based on the summary table, reviews look pretty evenly distributed throughout the months. Although, it looks like 8 has the most reviews, with 19491 review counts.

Next, we take a look at the census year.

full %>%
dplyr::select(census_year) %>%
gtsummary::tbl_summary(
  label = list(census_year ~ "Census Year of Review")
) %>%
gtsummary::bold_labels() %>%
gtsummary::modify_footnote(update = dplyr::everything() ~ NA)

Characteristic	N = 215,063
Census Year of Review
2008	6,785 (3.2%)
2009	15,642 (7.3%)
2010	11,227 (5.2%)
2011	15,454 (7.2%)
2012	13,382 (6.2%)
2013	16,359 (7.6%)
2014	16,104 (7.5%)
2015	36,192 (17%)
2016	46,607 (22%)
2017	37,311 (17%)

full %>%
dplyr::select(census_year) %>%
ggplot2::ggplot(ggplot2::aes(x = census_year)) + 
ggplot2::theme_classic() +
ggplot2::scale_y_continuous(
  labels = scales::label_number(suffix = "K", scale = 1e-3)
) +
ggplot2::geom_bar(fill = "#0099f9") + 
ggplot2::labs(
  title = "Patient Review per Year",
  x = "Census Year",
  y = "Patient Count"
)

At first glance, it looks like we see a steady increase in the number of patients review. We especially see a jump from 2014 to 2015, with year 2016 having the most reviews of 46607.

Census year and month

Taking a look at the census year and months together, we can see which month for each year the distribution of patient reviews.

Let’s first look at the trend of reviews.

census_month_year <-
  full %>%
  dplyr::select(census_month, census_year) %>%
  dplyr::mutate(
    census_month = factor(month.abb[census_month], levels = month.abb))

census_month_year %>%
dplyr::group_by(census_month, census_year) %>%
dplyr::summarise(rev_count = dplyr::n(), .groups = "keep") %>%
dplyr::arrange(census_year, census_month) %>%
dplyr::ungroup() %>%
dplyr::mutate(time = dplyr::row_number()) %>%
ggplot2::ggplot(ggplot2::aes(x = time, y = rev_count)) +
ggplot2::geom_point(shape = 21, color = "black", fill = "#69b3a2", size = 2) + 
ggplot2::geom_line(color = "black") +
ggplot2::theme_classic() +
ggplot2::scale_y_continuous(
  labels = scales::label_number(suffix = "K", scale = 1e-3)
) +
ggplot2::labs(
  title = "Patient Review count overtime",
  x = "Time period (t)",
  y = "Review Count",
)

We see a couple of peaks and valleys here and there. At around t=23, we see the first peak, but we can also see that by t=25, we get the biggest drop in patient responses. Then the biggest climb from t=80 up to t=90 (approximately).

Let’s look at some summaries.

census_month_year %>%
gtsummary::tbl_summary(
  by = census_year,
  label = list(census_month ~ "Census Month")
) %>%
gtsummary::bold_labels() %>%
gtsummary::modify_footnote(update = dplyr::everything() ~ NA)

Characteristic	2008, N = 6,785	2009, N = 15,642	2010, N = 11,227	2011, N = 15,454	2012, N = 13,382	2013, N = 16,359	2014, N = 16,104	2015, N = 36,192	2016, N = 46,607	2017, N = 37,311
Census Month
Jan	0 (0%)	731 (4.7%)	1,239 (11%)	1,226 (7.9%)	1,670 (12%)	1,489 (9.1%)	1,293 (8.0%)	1,949 (5.4%)	4,508 (9.7%)	4,242 (11%)
Feb	142 (2.1%)	949 (6.1%)	839 (7.5%)	994 (6.4%)	1,154 (8.6%)	1,170 (7.2%)	1,350 (8.4%)	1,808 (5.0%)	4,021 (8.6%)	3,706 (9.9%)
Mar	789 (12%)	1,174 (7.5%)	956 (8.5%)	1,120 (7.2%)	1,311 (9.8%)	1,300 (7.9%)	1,264 (7.8%)	2,570 (7.1%)	4,039 (8.7%)	4,152 (11%)
Apr	716 (11%)	1,221 (7.8%)	944 (8.4%)	1,121 (7.3%)	1,341 (10%)	1,190 (7.3%)	1,226 (7.6%)	2,788 (7.7%)	3,910 (8.4%)	3,767 (10%)
May	608 (9.0%)	1,164 (7.4%)	937 (8.3%)	1,121 (7.3%)	1,094 (8.2%)	1,349 (8.2%)	1,378 (8.6%)	2,809 (7.8%)	3,560 (7.6%)	3,653 (9.8%)
Jun	648 (9.6%)	1,195 (7.6%)	853 (7.6%)	1,195 (7.7%)	974 (7.3%)	1,285 (7.9%)	1,294 (8.0%)	2,739 (7.6%)	3,676 (7.9%)	3,080 (8.3%)
Jul	673 (9.9%)	1,405 (9.0%)	888 (7.9%)	1,234 (8.0%)	1,119 (8.4%)	1,620 (9.9%)	1,328 (8.2%)	3,366 (9.3%)	3,922 (8.4%)	2,949 (7.9%)
Aug	664 (9.8%)	1,483 (9.5%)	1,052 (9.4%)	1,393 (9.0%)	1,082 (8.1%)	1,514 (9.3%)	1,220 (7.6%)	3,612 (10.0%)	4,326 (9.3%)	3,145 (8.4%)
Sep	606 (8.9%)	1,620 (10%)	880 (7.8%)	1,446 (9.4%)	948 (7.1%)	1,302 (8.0%)	1,202 (7.5%)	3,537 (9.8%)	3,657 (7.8%)	2,682 (7.2%)
Oct	663 (9.8%)	1,748 (11%)	912 (8.1%)	1,637 (11%)	914 (6.8%)	1,577 (9.6%)	1,356 (8.4%)	3,772 (10%)	3,577 (7.7%)	2,480 (6.6%)
Nov	667 (9.8%)	1,645 (11%)	827 (7.4%)	1,578 (10%)	804 (6.0%)	1,554 (9.5%)	1,552 (9.6%)	3,491 (9.6%)	3,669 (7.9%)	2,643 (7.1%)
Dec	609 (9.0%)	1,307 (8.4%)	900 (8.0%)	1,389 (9.0%)	971 (7.3%)	1,009 (6.2%)	1,641 (10%)	3,751 (10%)	3,742 (8.0%)	812 (2.2%)

census_month_year %>%
dplyr::group_by(census_month, census_year) %>%
dplyr::summarise(rev_count = dplyr::n(), .groups = "keep") %>%
ggplot2::ggplot(
  ggplot2::aes(x = census_year, y = census_month, fill = rev_count)
) +
ggplot2::geom_tile() +
ggplot2::theme_classic() +
ggplot2::scale_fill_gradient(low="white", high="blue") +
ggplot2::labs(
  title = "Patient Review per Month per Year",
  x = "Census Year",
  y = "Census Month",
  fill = "Review Count"
)

The heatmap provides a great visualization of when we see an increase in patient responses. From March of 2015 all the way to May of 2017, we see strong numbers of responses. Then, at about April of 2017, we see some decline all the way to November of that same year.

Key Takeaways

Excluding the usefulCount column (I would like to clump that together with the patient reviews), and outside the sentiment of the reviews, we can create the following conclusions:

Levonorgestrel, a progestin, is the most reviewed drug, followed by Etonogestrel.
Most patients rated their drug pretty high, with mostly ratings of 8, 9, and 10. With 32% of all ratings being a 10/10.
Reviews are happening consistently every month.
We see an increase in reviews, however, in years 2015, 2016, and 2017, with 2016 being the year with most reviews.

Patient reviews

Since there are so many observations, when it comes to tokenization of the reviews, my laptop will not be able to handle all of that computing power. So, I have decided to only capture the first 1,000 observations in each year and work with those.

full <-
  full %>%
  dplyr::group_by(census_year) %>%
  dplyr::slice_head(n = 1000) %>%
  dplyr::ungroup()

Let’s take a look at 3 sample patients’ review text to se what we are dealing with.

sample_txt <-
  full %>%
  dplyr::select(id, review) %>%
  dplyr::slice_sample(n = 5)

Patient 1’s review reads: “Ritalin made me able to block out everything and focus on reading my boring, tedious nursing books. Take it at least 12 hours before you go to sleep, or else it will keep you up at night with thoughts racing. It suppresses appetite and makes most people lose weight- I actually gained weight because I would be starving and binge eat at the end of the night. I did like Ritalin, but it made me too irritable so I switched to Adderall. That is much more pleasant and even more effective. Focalin 10 mg is good for people with mild ADHD symptoms.”

Patient 2’s review reads: “Harvoni Rules !!!!! I was cured in 4 weeks of treatment !!!!!”

Patient 3’s review reads: “I was real scared about getting Implanon after reading about it. The process about 15 mins. They numb your arm and it does burn going in for about a min. Then they use another needle to insert. I felt the pressure of them putting it in. I turned my head the whole time so I didn't see anything. Afterwards I had to sit for a couple of minutes to make sure I wouldn't have an allergic reaction. A couple of days later, the small hole in my arm started to itch really bad because it was healing. The first month nothing happened. I had it inserted while on my period. The second month I was on period for a whole month but it wasn't a regular period. It wasn't as heavy but it was darker than usual. I called my doctor and she said that was normal. I recommend Implanon.”

Patient 4’s review reads: “This medicine (along with talk therapy) has helped me get my life back on track after being 98% disabled.”

Patient 5’s review reads: “After taking Ceftin (Cefuroxime) for 3 days, my throat swelled and my chest became more constricted. It felt as though I could not swallow well. I was dizzy, and anxious. I consulted the pharmacist who told me to take a Benadryl and if I didn't improve quickly, to get myself to the hospital. Within 5 minutes of taking the Benadryl I could feel my throat relaxing. After relating this situation to my daughter, she told me she had a similar experience with this drug when it was prescribed for her a year ago.”

From this, we get the following initial observations:

Capitalization of names.
We get a mix of numeric and character representation of numbers (i.e., 5 Mg vs two weeks).
Word contractions – it’s, didn’t, I’m, etc.
Special characters – dash (-), ellipsis (…), parentheses.
Handle HTML character encodings.

Basic Preprocessing

For some basic string cleaning, we can do the following:

Remove trailing and leading whitespaces.
Remove any newlines (\n) or tabs (\t).
Normalize capitalizations.
We convert HTML character encoding to their proper representations using textutils and enforce consistent character encoding – UTF-8.
Convert ellipsis

sample_txt_cleanup <- sample_txt[[2]][[3]]

cleaned_txt <-
  sample_txt_cleanup %>%
  textutils::HTMLdecode() %>%
  iconv("UTF-8", sub = "byte") %>%
  tolower() %>%
  stringr::str_replace_all("[^[:alnum:][.]['][!]]", " ") %>%
  stringr::str_trim() %>%
  stringr::str_squish() %>%
  stringr::str_replace_all("[.]{2,3}", ".") # convert ellipsis to a period

The following is an example of a cleaned text of a patient’s review:

Original: “I was real scared about getting Implanon after reading about it. The process about 15 mins. They numb your arm and it does burn going in for about a min. Then they use another needle to insert. I felt the pressure of them putting it in. I turned my head the whole time so I didn't see anything. Afterwards I had to sit for a couple of minutes to make sure I wouldn't have an allergic reaction. A couple of days later, the small hole in my arm started to itch really bad because it was healing. The first month nothing happened. I had it inserted while on my period. The second month I was on period for a whole month but it wasn't a regular period. It wasn't as heavy but it was darker than usual. I called my doctor and she said that was normal. I recommend Implanon.”

Cleaned: i was real scared about getting implanon after reading about it. the process about 15 mins. they numb your arm and it does burn going in for about a min. then they use another needle to insert. i felt the pressure of them putting it in. i turned my head the whole time so i didn’t see anything. afterwards i had to sit for a couple of minutes to make sure i wouldn’t have an allergic reaction. a couple of days later the small hole in my arm started to itch really bad because it was healing. the first month nothing happened. i had it inserted while on my period. the second month i was on period for a whole month but it wasn’t a regular period. it wasn’t as heavy but it was darker than usual. i called my doctor and she said that was normal. i recommend implanon.

cleaned_review <-
  full$review %>%
  textutils::HTMLdecode() %>%
  iconv("UTF-8", sub = "byte") %>%
  tolower() %>%
  stringr::str_replace_all("[^[:alnum:][.]['][!]]", " ") %>%
  stringr::str_trim() %>%
  stringr::str_squish() %>%
  stringr::str_replace_all("[.]{2,3}", ".") # convert ellipsis to a period

Advanced Preprocessing

For the rest of the our preprocessing tasks, we will leverage the {quanteda} package. {quanteda} is an amazing suite for text analytic functions (Benoit et al. 2018).

Tokenization

Tokenization is the process of converting long strings into a token of words. For example, the sentence “Let’s go to N.Y.!” can be tokenized to the following tokens: {Let, 's, go, to, N.Y., !}

Why do we do it?

Computers don’t understand language, but they are really good at counting pieces of language. The most informative pieces are (often) words.

Stemming

Reducing words to their stem.

Why do we do it?

When we do not want to dstinguish between different verb forms (walk, walk-ing, walk-ed, walk-s) and singular-plural (cat, cat-s).

Lemmatization

Refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

For sentiment analysis, I think it would be more important to lemmatize words instead of stem them so that we do not lose the meaning of targeted words.

Lemmatization is more accurate but it requires data about language, and it also takes much more time.

original <- 
  cleaned_txt %>%
  quanteda::tokens()

# Stemmed texts
lemmas <- 
  quanteda::tokens_replace(
    original, 
    pattern = lexicon::hash_lemmas$token, 
    replacement = lexicon::hash_lemmas$lemma
  ) %>%
  paste0()

Original: i, was, real, scared, about, getting, implanon, after, reading, about, it, ., the, process, about, 15, mins, ., they, numb, your, arm, and, it, does, burn, going, in, for, about, a, min, ., then, they, use, another, needle, to, insert, ., i, felt, the, pressure, of, them, putting, it, in, ., i, turned, my, head, the, whole, time, so, i, didn’t, see, anything, ., afterwards, i, had, to, sit, for, a, couple, of, minutes, to, make, sure, i, wouldn’t, have, an, allergic, reaction, ., a, couple, of, days, later, the, small, hole, in, my, arm, started, to, itch, really, bad, because, it, was, healing, ., the, first, month, nothing, happened, ., i, had, it, inserted, while, on, my, period, ., the, second, month, i, was, on, period, for, a, whole, month, but, it, wasn’t, a, regular, period, ., it, wasn’t, as, heavy, but, it, was, darker, than, usual, ., i, called, my, doctor, and, she, said, that, was, normal, ., i, recommend, implanon, .

Lemmatized: i, be, real, scare, about, get, implanon, after, read, about, it, ., the, process, about, 15, min, ., they, numb, your, arm, and, it, do, burn, go, in, for, about, a, min, ., then, they, use, another, needle, to, insert, ., i, feel, the, pressure, of, them, putting, it, in, ., i, turn, my, head, the, whole, time, so, i, didn’t, see, anything, ., afterwards, i, have, to, sit, for, a, couple, of, minute, to, make, sure, i, wouldn’t, have, a, allergic, reaction, ., a, couple, of, day, late, the, small, hole, in, my, arm, start, to, itch, really, bad, because, it, be, heal, ., the, first, month, nothing, happen, ., i, have, it, insert, while, on, my, period, ., the, 2, month, i, be, on, period, for, a, whole, month, but, it, wasn’t, a, regular, period, ., it, wasn’t, as, heavy, but, it, be, dark, than, usual, ., i, call, my, doctor, and, she, say, that, be, normal, ., i, recommend, implanon, .

Stopwords removal

This is done to remove stopwords.

Why do we do it?

Some words are simply not interesting, but do occur often. For instance, words such as i, me, my, myself, is, was, ours, ourselves, you, your, etc.

stopwords_removed <-
  quanteda::tokens_remove(original, quanteda::stopwords("english"))

Stopwords removed: real, scared, getting, implanon, reading, ., process, 15, mins, ., numb, arm, burn, going, min, ., use, another, needle, insert, ., felt, pressure, putting, ., turned, head, whole, time, see, anything, ., afterwards, sit, couple, minutes, make, sure, allergic, reaction, ., couple, days, later, small, hole, arm, started, itch, really, bad, healing, ., first, month, nothing, happened, ., inserted, period, ., second, month, period, whole, month, regular, period, ., heavy, darker, usual, ., called, doctor, said, normal, ., recommend, implanon, .

Full processing application

We apply all concepts into our working dataset.

reviews <-
  cleaned_review %>%
  quanteda::tokens(remove_punct = TRUE) %>%
  quanteda::tokens_replace(
    pattern = lexicon::hash_lemmas$token,
    replacement = lexicon::hash_lemmas$lemma
  ) %>% 
  quanteda::tokens_remove(quanteda::stopwords("english"))

reviews_df <- 
  data.frame(
    id = seq_along(reviews),
    text = sapply(reviews, paste, collapse = " "),
    row.names = NULL
  )

saveRDS(full, here::here("data", "cleaned_review.rds"))
readr::write_csv(reviews_df, here::here("data", "review_tokens.csv"))

Corpus Statistics

For this section, we utilize {quanteda} to do some corpus statistics. We first have to build our corpus from our tokens that we cleaned from 03_patient_reviews.Rmd notebook.

tokens_csv <- readr::read_csv(here::here("data", "review_tokens.csv"))

## Rows: 10000 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): text
## dbl (1): id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

corpus <- quanteda::corpus(tokens_csv, text_field = "text")

Document-term Matrix

Given our corpus, we are going to create a Document-term Matrix. A document-term matrix is simply a mathematical matrix which describing the frequency of terms that occur in our text.

{quanteda} uses the quanteda::dfm() function to generate a document-term matrix. DFM stands for Document-Feature Matrix, which is similar to a document-term matrix, but just different terminologies.

dtm <- quanteda::dfm(quanteda::tokens(corpus))

dtm

## Document-feature matrix of: 10,000 documents, 12,978 features (99.74% sparse) and 1 docvar.
##        features
## docs    medication change life panic attack control barely able leave house
##   text1          1      3    1     1      1       1      1    1     1     1
##   text2          0      0    0     0      0       1      0    0     0     0
##   text3          0      0    0     2      2       0      0    0     0     0
##   text4          0      0    0     0      0       0      0    0     0     0
##   text5          0      0    0     0      0       0      0    0     0     0
##   text6          0      0    0     0      0       0      0    0     0     0
## [ reached max_ndoc ... 9,994 more documents, reached max_nfeat ... 12,968 more features ]

We see that we have 10,000 documents, with 12,978 features. Pretty cool! We also see that we have 99.74% sparse, which refers to how sparse the matrix is. In our case, 99.74% of all the cells in our document-term matrix contains the value 0. This makes a lot of sense because we have 12,978 unique words that have been used in our corpus and many documents only contain a very small portion of all of these words.

Trim less frequent words

Next, it is also a good idea to remove the least frequent words. That way, we can just do our corpus statistics on the more interesting ones.

I think for now, we’ll remove those terms in our dtm that occurs less than 10 times.

trimmed_dtm <- 
  dtm %>%
  quanteda::dfm_trim(min_termfreq = 10)

trimmed_dtm

## Document-feature matrix of: 10,000 documents, 2,874 features (98.90% sparse) and 1 docvar.
##        features
## docs    medication change life panic attack control barely able leave house
##   text1          1      3    1     1      1       1      1    1     1     1
##   text2          0      0    0     0      0       1      0    0     0     0
##   text3          0      0    0     2      2       0      0    0     0     0
##   text4          0      0    0     0      0       0      0    0     0     0
##   text5          0      0    0     0      0       0      0    0     0     0
##   text6          0      0    0     0      0       0      0    0     0     0
## [ reached max_ndoc ... 9,994 more documents, reached max_nfeat ... 2,864 more features ]

We now see that we our features decreased from 12,978 to 2,874. It is usually useful to remove these least frequent words because they tend to be less informative than the most frequent ones. And sometimes for text analysis, it is also good to not have too many terms.

Analysis

For our analysis, we are going to be doing some word clouds. Word clouds are visual representations of text that give greater rank to words that appear more frequently. For this, we will use {quanteda}’s textplot_wordcloud(). To use the textplot_wordcloud() function, we need to install {quanteda.textplots}.

trimmed_dtm %>%
quanteda.textplots::textplot_wordcloud(max_words = 50, color = c("blue", "red"))

The red words are the most frequent words, and the blue ones are the least frequent. Additionally, we can also look at text frequency as a dataframe. For this one, we install {quanteda.textstats}

txt_freq <-
  trimmed_dtm %>%
  quanteda.textstats::textstat_frequency(n = 20)

So, we see the 20 most frequent words:

feature	frequency	rank	docfreq	group
take	7984	1	4926	all
day	5619	2	3542	all
get	4874	3	3285	all
good	4278	4	3324	all
year	4058	5	3087	all
feel	3994	6	2738	all
month	3906	7	2739	all
go	3845	8	2892	all
work	3843	9	3036	all
much	3776	10	2802	all
effect	3418	11	2795	all
now	3263	12	2638	all
week	3252	13	2375	all
side	3247	14	2736	all
start	3223	15	2459	all
time	2955	16	2334	all
pain	2943	17	1768	all
2	2684	18	2126	all
first	2491	19	2056	all
use	2468	20	1883	all

Plotting this, we can see the following:

txt_freq %>%
ggplot2::ggplot(ggplot2::aes(x = reorder(feature, frequency), y = frequency)) +
ggplot2::geom_point() +
ggplot2::coord_flip() + 
ggplot2::labs(x = NULL, y = "Frequency")

Conclusion

I think this is where we will finish this project. Based on our corpus statistic, out of the 1,000 reviews we selected per year, we can see that take is the most frequent word used. And this makes sense because our corpus is on drug reviews, and the act of taking a drug is probably the only thing you do with medicine. Unless, of course the medicine was taken as a form of injection like a vaccination. However, it still makes sense because these are reviews of patients of drugs that they administered themselves.

All in all, if I had more computing power, perhaps I can do this on the full data set. Perhaps in the future, the computation itself can be done in the cloud, as well as any storage I may need.

Packages Used

Package	Version
dplyr	1.0.10
ggplot2	3.3.6
gtsummary	1.6.2
here	1.0.1
knitr	1.40
lexicon	1.2.1
lubridate	1.8.0
magrittr	2.0.3
quanteda	3.2.3
quanteda.textplots	0.94.2
quanteda.textstats	0.96
readr	2.1.3
renv	0.13.1
rmarkdown	2.17
scales	1.2.1
stringr	1.4.0
textutils	0.2-1
tibble	3.1.8
tidyr	1.2.1

References

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. 2018. “Quanteda: An R Package for the Quantitative Analysis of Textual Data.” Journal of Open Source Software 3 (30): 774. https://doi.org/10.21105/joss.00774.

Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.

Gräßer, Felix, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. 2018. “Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning.” In Proceedings of the 2018 International Conference on Digital Health, 121–25. DH ’18. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3194658.3194677.

Drug Reviews - Exploratory Data Analysis and Sentiment Analysis

Benzon Carlitos Salazar

December 21, 2025

Project Overview

Introduction

Preprocessing and Cleaning

Load and check the data

Basic Cleaning

Summary statistics

Drugs

Patient Condition

Ratings

Review Dates

Census year and month

Key Takeaways

Patient reviews

Basic Preprocessing

Advanced Preprocessing

Tokenization

Why do we do it?

Stemming

Why do we do it?

Lemmatization

Stopwords removal

Why do we do it?

Full processing application

Corpus Statistics

Document-term Matrix

Trim less frequent words

Analysis

Conclusion

Packages Used

References