Exploratory data analysis exercise of the drug review dataset that provides patient reviews on specific drugs along with related conditions and a 10 star patient rating reflecting overall patient satisfaction.
My goal for this project is to exercise my data analysis skills. I have chosen to work with the Drug Reviews dataset (Gräßer et al. 2018) after spending some time in the UCI Machine Learning Repository (Dua and Graff 2017).
I will focus on some illustrative data visualizations, as well as, some word frequency analysis.
library(magrittr)
train <- readr::read_tsv(here::here("data-raw", "drugsComTrain_raw.tsv"))
## New names:
## Rows: 161297 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "\t" chr
## (4): drugName, condition, review, date dbl (3): ...1, rating, usefulCount
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
test <- readr::read_tsv(here::here("data-raw", "drugsComTest_raw.tsv"))
## New names:
## Rows: 53766 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "\t" chr
## (4): drugName, condition, review, date dbl (3): ...1, rating, usefulCount
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
full <-
dplyr::bind_rows(train, test)
# Check our data
tibble::glimpse(full)
## Rows: 215,063
## Columns: 7
## $ ...1 <dbl> 206461, 95260, 92703, 138000, 35696, 155963, 165907, 10265…
## $ drugName <chr> "Valsartan", "Guanfacine", "Lybrel", "Ortho Evra", "Bupren…
## $ condition <chr> "Left Ventricular Dysfunction", "ADHD", "Birth Control", "…
## $ review <chr> "\"It has no side effect, I take it in combination of Byst…
## $ rating <dbl> 9, 8, 5, 8, 9, 2, 1, 10, 1, 8, 9, 10, 4, 4, 3, 9, 9, 9, 10…
## $ date <chr> "May 20, 2012", "April 27, 2010", "December 14, 2009", "No…
## $ usefulCount <dbl> 27, 192, 17, 10, 37, 43, 5, 32, 11, 1, 19, 54, 8, 1, 10, 2…
colnames(full)
## [1] "...1" "drugName" "condition" "review" "rating"
## [6] "date" "usefulCount"
Based on our output, we get 7 variables (or columns) and 215063 observations (or rows). That’s a pretty lengthy dataset!
Our data dictionary is as follows:
| Variable Name | Description |
|---|---|
| …1 | Identifier |
| drugName | Name of drug |
| condition | Name of condition |
| review | Patient review |
| rating | 10 star patient rating |
| date | Date of review entry |
| usefulCount | Number of users who found review useful |
We do some basic cleaning.
The first column, ...1 looks to be an identifier, so we rename that column as and id, and then we do some minor tokenization to enforce consistency. Additionally, we will utilize the {lubridate} package to break down the date column to census month and census year, to see if it will give us some additional insights.
We will also remove any special characters in the drugName and condition columns, turn white spaces into underscores, while keeping the original text for the review for now.
full <-
full %>%
dplyr::rename(id = "...1")
full$drugName <- stringr::str_replace_all(full$drugName, "[^[:alnum:]]", "_")
full$condition <- stringr::str_replace_all(full$condition, "[^[:alnum:]]", "_")
full$drugName <- tolower(full$drugName)
full$condition <- tolower(full$condition)
full$date <- lubridate::mdy(full$date)
full$rating <- as.numeric(full$rating)
full$usefulCount <- as.numeric(full$usefulCount)
full <-
full %>%
dplyr::mutate(census_month = lubridate::month(date)) %>%
dplyr::mutate(census_year = as.character(lubridate::year(date))) %>%
dplyr::rename(full_date = date)
We get a new looking dataset.
full %>%
dplyr::slice_head(n = 5) %>%
tibble::glimpse()
## Rows: 5
## Columns: 9
## $ id <dbl> 206461, 95260, 92703, 138000, 35696
## $ drugName <chr> "valsartan", "guanfacine", "lybrel", "ortho_evra", "bupre…
## $ condition <chr> "left_ventricular_dysfunction", "adhd", "birth_control", …
## $ review <chr> "\"It has no side effect, I take it in combination of Bys…
## $ rating <dbl> 9, 8, 5, 8, 9
## $ full_date <date> 2012-05-20, 2010-04-27, 2009-12-14, 2015-11-03, 2016-11-2…
## $ usefulCount <dbl> 27, 192, 17, 10, 37
## $ census_month <dbl> 5, 4, 12, 11, 11
## $ census_year <chr> "2012", "2010", "2009", "2015", "2016"
Doing some summarizations, we can see count distributions for each column. This will also help us determine if there exists any NA columns.
drug_count <-
full %>%
dplyr::select(drugName) %>%
dplyr::group_by(drugName) %>%
dplyr::summarise(count = dplyr::n())
What is the top 10 most used drug?
drug_count %>%
dplyr::arrange(desc(count)) %>%
dplyr::slice_head(n = 10) %>%
knitr::kable()
| drugName | count |
|---|---|
| levonorgestrel | 4930 |
| etonogestrel | 4421 |
| ethinyl_estradiol___norethindrone | 3753 |
| nexplanon | 2892 |
| ethinyl_estradiol___norgestimate | 2790 |
| ethinyl_estradiol___levonorgestrel | 2503 |
| phentermine | 2085 |
| sertraline | 1868 |
| escitalopram | 1747 |
| mirena | 1673 |
What is the top 10 least used drug?
drug_count %>%
dplyr::arrange(count) %>%
dplyr::slice_head(n = 10) %>%
knitr::kable()
| drugName | count |
|---|---|
| a___d_cracked_skin_relief | 1 |
| abacavir_lamivudine_zidovudine | 1 |
| absorbine_jr_ | 1 |
| abstral | 1 |
| accupril | 1 |
| acetaminophen_caffeine_pyrilamine | 1 |
| acetaminophen_chlorpheniramine_dextromethorphan___pseudoephedrine | 1 |
| acetaminophen_dextromethorphan_doxylamine | 1 |
| acetaminophen_dextromethorphan_guaifenesin___phenylephrine | 1 |
| acetaminophen_dextromethorphan_guaifenesin___pseudoephedrine | 1 |
The top 10 least used drug is actually misleading, as there are 798 drugs that all have a count of 1.
Are there any NAs or NULLs for drugName?
drug_count %>%
dplyr::filter(is.na(drugName) | is.null(drugName)) %>%
dplyr::select(count) %>%
knitr::kable()
| count |
|---|
condition_count <-
full %>%
dplyr::select(condition) %>%
dplyr::group_by(condition) %>%
dplyr::summarise(count = dplyr::n())
What are the top 10 patient conditions?
condition_count %>%
dplyr::arrange(desc(count)) %>%
dplyr::slice_head(n = 10) %>%
knitr::kable()
| condition | count |
|---|---|
| birth_control | 38436 |
| depression | 12164 |
| pain | 8245 |
| anxiety | 7812 |
| acne | 7435 |
| bipolar_disorde | 5604 |
| insomnia | 4904 |
| weight_loss | 4857 |
| obesity | 4757 |
| adhd | 4509 |
What are the least patient conditions?
condition_count %>%
dplyr::arrange(count) %>%
dplyr::slice_head(n = 35) %>%
knitr::kable()
| condition | count |
|---|---|
| 100__span__users_found_this_comment_helpful_ | 1 |
| 105__span__users_found_this_comment_helpful_ | 1 |
| 110__span__users_found_this_comment_helpful_ | 1 |
| 121__span__users_found_this_comment_helpful_ | 1 |
| 123__span__users_found_this_comment_helpful_ | 1 |
| 135__span__users_found_this_comment_helpful_ | 1 |
| 145__span__users_found_this_comment_helpful_ | 1 |
| 146__span__users_found_this_comment_helpful_ | 1 |
| 26__span__users_found_this_comment_helpful_ | 1 |
| 30__span__users_found_this_comment_helpful_ | 1 |
| 37__span__users_found_this_comment_helpful_ | 1 |
| 38__span__users_found_this_comment_helpful_ | 1 |
| 40__span__users_found_this_comment_helpful_ | 1 |
| 47__span__users_found_this_comment_helpful_ | 1 |
| 48__span__users_found_this_comment_helpful_ | 1 |
| 54__span__users_found_this_comment_helpful_ | 1 |
| 61__span__users_found_this_comment_helpful_ | 1 |
| 62__span__users_found_this_comment_helpful_ | 1 |
| 63__span__users_found_this_comment_helpful_ | 1 |
| 64__span__users_found_this_comment_helpful_ | 1 |
| 70__span__users_found_this_comment_helpful_ | 1 |
| 72__span__users_found_this_comment_helpful_ | 1 |
| 76__span__users_found_this_comment_helpful_ | 1 |
| 77__span__users_found_this_comment_helpful_ | 1 |
| 79__span__users_found_this_comment_helpful_ | 1 |
| 83__span__users_found_this_comment_helpful_ | 1 |
| 84__span__users_found_this_comment_helpful_ | 1 |
| 92__span__users_found_this_comment_helpful_ | 1 |
| 94__span__users_found_this_comment_helpful_ | 1 |
| 95__span__users_found_this_comment_helpful_ | 1 |
| 98__span__users_found_this_comment_helpful_ | 1 |
| amyotrophic_lateral_sclerosis | 1 |
| anti_nmda_receptor_encephalitis | 1 |
| asystole | 1 |
| av_heart_block | 1 |
I am unsure what to make of the *_span_users_found_this_comment_helpful_, so we will be removing those later. But ignoring that weirdness, we can see that amyotrophic_lateral_sclerosis is the least mentioned condition, followed by anti_nmda_receptor_encephalitis.
Are there any NAs or NULLs for patient condition?
na_condition <-
condition_count %>%
dplyr::filter(is.na(condition) | is.null(condition)) %>%
dplyr::select(count)
There appears to be 1194 NA or NULL conditions. Since there are so many, I have decided to leave them in and just convert each NA/NULL conditions to unknown_condition value.
full$condition <- tidyr::replace_na(full$condition, "unknown_condition")
# Check
full %>%
dplyr::filter(condition == "unknown_condition") %>%
dplyr::count() %>%
knitr::kable()
| n |
|---|
| 1194 |
Since the ratings are 1-10, we will categorize them into 3 levels:
First, we create our row levels, then we add a new drug_rating column.
rating_row_lvl <-
c("Low [0-3]",
"Mid [4-7]",
"High [8-10]",
"Unknown")
full_proc_rating <-
full %>%
dplyr::mutate(
drug_rating = factor(
dplyr::case_when(
rating > 0 & rating <= 3 ~ rating_row_lvl[1],
rating >= 4 & rating <= 7 ~ rating_row_lvl[2],
rating >= 8 & rating <= 10 ~ rating_row_lvl[3],
TRUE ~ rating_row_lvl[4]
), levels = rating_row_lvl))
Then, we utilize {gtsummary} package to see count distribution and frequency distributions.
full_proc_rating %>%
dplyr::select(drug_rating) %>%
gtsummary::tbl_summary(label = list(drug_rating ~ "Drug Rating")) %>%
gtsummary::bold_labels() %>%
gtsummary::modify_footnote(update = dplyr::everything() ~ NA)
| Characteristic | N = 215,063 |
|---|---|
| Drug Rating | |
| Low [0-3] | 46,901 (22%) |
| Mid [4-7] | 38,403 (18%) |
| High [8-10] | 129,759 (60%) |
| Unknown | 0 (0%) |
Additionally, we can plot it as well.
full_proc_rating %>%
ggplot2::ggplot(ggplot2::aes(x = drug_rating)) +
ggplot2::geom_bar(fill = "#0099f9") +
ggplot2::theme_classic() +
ggplot2::scale_y_continuous(
labels = scales::label_number(suffix = "K", scale = 1e-3)
) +
ggplot2::labs(
title = "Patient rating distribution",
x = "Drug Rating Categories",
y = "Patient Count"
)
We see that the general consensus leans towards a high rating for each drugs. Let’s drill down further to see what is the highest overall rating.
full %>%
dplyr::mutate(rating__char = as.character(rating)) %>%
dplyr::select(rating__char) %>%
gtsummary::tbl_summary(label = list(rating__char ~ "Drug Rating")) %>%
gtsummary::bold_labels() %>%
gtsummary::modify_footnote(update = dplyr::everything() ~ NA)
| Characteristic | N = 215,063 |
|---|---|
| Drug Rating | |
| 1 | 28,918 (13%) |
| 10 | 68,005 (32%) |
| 2 | 9,265 (4.3%) |
| 3 | 8,718 (4.1%) |
| 4 | 6,671 (3.1%) |
| 5 | 10,723 (5.0%) |
| 6 | 8,462 (3.9%) |
| 7 | 12,547 (5.8%) |
| 8 | 25,046 (12%) |
| 9 | 36,708 (17%) |
Let’s plot our summaries.
full %>%
ggplot2::ggplot(ggplot2::aes(x = rating)) +
ggplot2::geom_bar(fill = "#0099f9") +
ggplot2::theme_classic() +
ggplot2::labs(
title = "Patient rating distribution",
x = "Drug Ratings",
y = "Patient Count"
)
rating__sums <-
full %>%
dplyr::mutate(rating__char = as.character(rating)) %>%
dplyr::group_by(rating__char) %>%
dplyr::summarise(count = dplyr::n())
max_tmp_cnt <- max(rating__sums$count)
max_tmp_rating <-
rating__sums %>%
dplyr::filter(count == max_tmp_cnt) %>%
dplyr::select(rating__char)
We see that patients mostly rated their drugs 10, with a maximum count of 68005. This lets us know that the attitude towards each patient’s drug is positive.
Using the census_month and census_year, we should be able to see count distributions per month and per year. To make our analysis easier, we will summarize the census month and year separately.
First, we convert each month numerical representation to their correct string representation, and then summarize with {gtsummary}.
review_census_month <-
full %>%
dplyr::mutate(
census_month__abbr = factor(month.abb[census_month], levels = month.abb)
)
review_census_month %>%
dplyr::select(census_month__abbr) %>%
gtsummary::tbl_summary(
label = list(
census_month__abbr ~ "Census Month of Review"
)
) %>%
gtsummary::bold_labels() %>%
gtsummary::modify_footnote(update = dplyr::everything() ~ NA)
| Characteristic | N = 215,063 |
|---|---|
| Census Month of Review | |
| Jan | 18,347 (8.5%) |
| Feb | 16,133 (7.5%) |
| Mar | 18,675 (8.7%) |
| Apr | 18,224 (8.5%) |
| May | 17,673 (8.2%) |
| Jun | 16,939 (7.9%) |
| Jul | 18,504 (8.6%) |
| Aug | 19,491 (9.1%) |
| Sep | 17,880 (8.3%) |
| Oct | 18,636 (8.7%) |
| Nov | 18,430 (8.6%) |
| Dec | 16,131 (7.5%) |
review_census_month %>%
ggplot2::ggplot(ggplot2::aes(x = census_month__abbr)) +
ggplot2::geom_bar(fill = "#0099f9") +
ggplot2::theme_classic() +
ggplot2::scale_y_continuous(
labels = scales::label_number(suffix = "K", scale = 1e-3)
) +
ggplot2::labs(
title = "Patient Review per Month",
x = "Census Month",
y = "Patient Count"
)
Based on the summary table, reviews look pretty evenly distributed throughout the months. Although, it looks like 8 has the most reviews, with 19491 review counts.
Next, we take a look at the census year.
full %>%
dplyr::select(census_year) %>%
gtsummary::tbl_summary(
label = list(census_year ~ "Census Year of Review")
) %>%
gtsummary::bold_labels() %>%
gtsummary::modify_footnote(update = dplyr::everything() ~ NA)
| Characteristic | N = 215,063 |
|---|---|
| Census Year of Review | |
| 2008 | 6,785 (3.2%) |
| 2009 | 15,642 (7.3%) |
| 2010 | 11,227 (5.2%) |
| 2011 | 15,454 (7.2%) |
| 2012 | 13,382 (6.2%) |
| 2013 | 16,359 (7.6%) |
| 2014 | 16,104 (7.5%) |
| 2015 | 36,192 (17%) |
| 2016 | 46,607 (22%) |
| 2017 | 37,311 (17%) |
full %>%
dplyr::select(census_year) %>%
ggplot2::ggplot(ggplot2::aes(x = census_year)) +
ggplot2::theme_classic() +
ggplot2::scale_y_continuous(
labels = scales::label_number(suffix = "K", scale = 1e-3)
) +
ggplot2::geom_bar(fill = "#0099f9") +
ggplot2::labs(
title = "Patient Review per Year",
x = "Census Year",
y = "Patient Count"
)
At first glance, it looks like we see a steady increase in the number of patients review. We especially see a jump from 2014 to 2015, with year 2016 having the most reviews of 46607.
Taking a look at the census year and months together, we can see which month for each year the distribution of patient reviews.
Let’s first look at the trend of reviews.
census_month_year <-
full %>%
dplyr::select(census_month, census_year) %>%
dplyr::mutate(
census_month = factor(month.abb[census_month], levels = month.abb))
census_month_year %>%
dplyr::group_by(census_month, census_year) %>%
dplyr::summarise(rev_count = dplyr::n(), .groups = "keep") %>%
dplyr::arrange(census_year, census_month) %>%
dplyr::ungroup() %>%
dplyr::mutate(time = dplyr::row_number()) %>%
ggplot2::ggplot(ggplot2::aes(x = time, y = rev_count)) +
ggplot2::geom_point(shape = 21, color = "black", fill = "#69b3a2", size = 2) +
ggplot2::geom_line(color = "black") +
ggplot2::theme_classic() +
ggplot2::scale_y_continuous(
labels = scales::label_number(suffix = "K", scale = 1e-3)
) +
ggplot2::labs(
title = "Patient Review count overtime",
x = "Time period (t)",
y = "Review Count",
)
We see a couple of peaks and valleys here and there. At around t=23, we see the first peak, but we can also see that by t=25, we get the biggest drop in patient responses. Then the biggest climb from t=80 up to t=90 (approximately).
Let’s look at some summaries.
census_month_year %>%
gtsummary::tbl_summary(
by = census_year,
label = list(census_month ~ "Census Month")
) %>%
gtsummary::bold_labels() %>%
gtsummary::modify_footnote(update = dplyr::everything() ~ NA)
| Characteristic | 2008, N = 6,785 | 2009, N = 15,642 | 2010, N = 11,227 | 2011, N = 15,454 | 2012, N = 13,382 | 2013, N = 16,359 | 2014, N = 16,104 | 2015, N = 36,192 | 2016, N = 46,607 | 2017, N = 37,311 |
|---|---|---|---|---|---|---|---|---|---|---|
| Census Month | ||||||||||
| Jan | 0 (0%) | 731 (4.7%) | 1,239 (11%) | 1,226 (7.9%) | 1,670 (12%) | 1,489 (9.1%) | 1,293 (8.0%) | 1,949 (5.4%) | 4,508 (9.7%) | 4,242 (11%) |
| Feb | 142 (2.1%) | 949 (6.1%) | 839 (7.5%) | 994 (6.4%) | 1,154 (8.6%) | 1,170 (7.2%) | 1,350 (8.4%) | 1,808 (5.0%) | 4,021 (8.6%) | 3,706 (9.9%) |
| Mar | 789 (12%) | 1,174 (7.5%) | 956 (8.5%) | 1,120 (7.2%) | 1,311 (9.8%) | 1,300 (7.9%) | 1,264 (7.8%) | 2,570 (7.1%) | 4,039 (8.7%) | 4,152 (11%) |
| Apr | 716 (11%) | 1,221 (7.8%) | 944 (8.4%) | 1,121 (7.3%) | 1,341 (10%) | 1,190 (7.3%) | 1,226 (7.6%) | 2,788 (7.7%) | 3,910 (8.4%) | 3,767 (10%) |
| May | 608 (9.0%) | 1,164 (7.4%) | 937 (8.3%) | 1,121 (7.3%) | 1,094 (8.2%) | 1,349 (8.2%) | 1,378 (8.6%) | 2,809 (7.8%) | 3,560 (7.6%) | 3,653 (9.8%) |
| Jun | 648 (9.6%) | 1,195 (7.6%) | 853 (7.6%) | 1,195 (7.7%) | 974 (7.3%) | 1,285 (7.9%) | 1,294 (8.0%) | 2,739 (7.6%) | 3,676 (7.9%) | 3,080 (8.3%) |
| Jul | 673 (9.9%) | 1,405 (9.0%) | 888 (7.9%) | 1,234 (8.0%) | 1,119 (8.4%) | 1,620 (9.9%) | 1,328 (8.2%) | 3,366 (9.3%) | 3,922 (8.4%) | 2,949 (7.9%) |
| Aug | 664 (9.8%) | 1,483 (9.5%) | 1,052 (9.4%) | 1,393 (9.0%) | 1,082 (8.1%) | 1,514 (9.3%) | 1,220 (7.6%) | 3,612 (10.0%) | 4,326 (9.3%) | 3,145 (8.4%) |
| Sep | 606 (8.9%) | 1,620 (10%) | 880 (7.8%) | 1,446 (9.4%) | 948 (7.1%) | 1,302 (8.0%) | 1,202 (7.5%) | 3,537 (9.8%) | 3,657 (7.8%) | 2,682 (7.2%) |
| Oct | 663 (9.8%) | 1,748 (11%) | 912 (8.1%) | 1,637 (11%) | 914 (6.8%) | 1,577 (9.6%) | 1,356 (8.4%) | 3,772 (10%) | 3,577 (7.7%) | 2,480 (6.6%) |
| Nov | 667 (9.8%) | 1,645 (11%) | 827 (7.4%) | 1,578 (10%) | 804 (6.0%) | 1,554 (9.5%) | 1,552 (9.6%) | 3,491 (9.6%) | 3,669 (7.9%) | 2,643 (7.1%) |
| Dec | 609 (9.0%) | 1,307 (8.4%) | 900 (8.0%) | 1,389 (9.0%) | 971 (7.3%) | 1,009 (6.2%) | 1,641 (10%) | 3,751 (10%) | 3,742 (8.0%) | 812 (2.2%) |
census_month_year %>%
dplyr::group_by(census_month, census_year) %>%
dplyr::summarise(rev_count = dplyr::n(), .groups = "keep") %>%
ggplot2::ggplot(
ggplot2::aes(x = census_year, y = census_month, fill = rev_count)
) +
ggplot2::geom_tile() +
ggplot2::theme_classic() +
ggplot2::scale_fill_gradient(low="white", high="blue") +
ggplot2::labs(
title = "Patient Review per Month per Year",
x = "Census Year",
y = "Census Month",
fill = "Review Count"
)
The heatmap provides a great visualization of when we see an increase in patient responses. From March of 2015 all the way to May of 2017, we see strong numbers of responses. Then, at about April of 2017, we see some decline all the way to November of that same year.
Excluding the usefulCount column (I would like to clump that together with the patient reviews), and outside the sentiment of the reviews, we can create the following conclusions:
Since there are so many observations, when it comes to tokenization of the reviews, my laptop will not be able to handle all of that computing power. So, I have decided to only capture the first 1,000 observations in each year and work with those.
full <-
full %>%
dplyr::group_by(census_year) %>%
dplyr::slice_head(n = 1000) %>%
dplyr::ungroup()
Let’s take a look at 3 sample patients’ review text to se what we are dealing with.
sample_txt <-
full %>%
dplyr::select(id, review) %>%
dplyr::slice_sample(n = 5)
Patient 1’s review reads: “Ritalin made me able to block out everything and focus on reading my boring, tedious nursing books. Take it at least 12 hours before you go to sleep, or else it will keep you up at night with thoughts racing. It suppresses appetite and makes most people lose weight- I actually gained weight because I would be starving and binge eat at the end of the night. I did like Ritalin, but it made me too irritable so I switched to Adderall. That is much more pleasant and even more effective. Focalin 10 mg is good for people with mild ADHD symptoms.”
Patient 2’s review reads: “Harvoni Rules !!!!! I was cured in 4 weeks of treatment !!!!!”
Patient 3’s review reads: “I was real scared about getting Implanon after reading about it. The process about 15 mins. They numb your arm and it does burn going in for about a min. Then they use another needle to insert. I felt the pressure of them putting it in. I turned my head the whole time so I didn't see anything. Afterwards I had to sit for a couple of minutes to make sure I wouldn't have an allergic reaction. A couple of days later, the small hole in my arm started to itch really bad because it was healing. The first month nothing happened. I had it inserted while on my period. The second month I was on period for a whole month but it wasn't a regular period. It wasn't as heavy but it was darker than usual. I called my doctor and she said that was normal. I recommend Implanon.”
Patient 4’s review reads: “This medicine (along with talk therapy) has helped me get my life back on track after being 98% disabled.”
Patient 5’s review reads: “After taking Ceftin (Cefuroxime) for 3 days, my throat swelled and my chest became more constricted. It felt as though I could not swallow well. I was dizzy, and anxious. I consulted the pharmacist who told me to take a Benadryl and if I didn't improve quickly, to get myself to the hospital. Within 5 minutes of taking the Benadryl I could feel my throat relaxing. After relating this situation to my daughter, she told me she had a similar experience with this drug when it was prescribed for her a year ago.”
From this, we get the following initial observations:
For some basic string cleaning, we can do the following:
\n) or tabs (\t).textutils and enforce consistent character encoding – UTF-8.sample_txt_cleanup <- sample_txt[[2]][[3]]
cleaned_txt <-
sample_txt_cleanup %>%
textutils::HTMLdecode() %>%
iconv("UTF-8", sub = "byte") %>%
tolower() %>%
stringr::str_replace_all("[^[:alnum:][.]['][!]]", " ") %>%
stringr::str_trim() %>%
stringr::str_squish() %>%
stringr::str_replace_all("[.]{2,3}", ".") # convert ellipsis to a period
The following is an example of a cleaned text of a patient’s review:
Original: “I was real scared about getting Implanon after reading about it. The process about 15 mins. They numb your arm and it does burn going in for about a min. Then they use another needle to insert. I felt the pressure of them putting it in. I turned my head the whole time so I didn't see anything. Afterwards I had to sit for a couple of minutes to make sure I wouldn't have an allergic reaction. A couple of days later, the small hole in my arm started to itch really bad because it was healing. The first month nothing happened. I had it inserted while on my period. The second month I was on period for a whole month but it wasn't a regular period. It wasn't as heavy but it was darker than usual. I called my doctor and she said that was normal. I recommend Implanon.”
Cleaned: i was real scared about getting implanon after reading about it. the process about 15 mins. they numb your arm and it does burn going in for about a min. then they use another needle to insert. i felt the pressure of them putting it in. i turned my head the whole time so i didn’t see anything. afterwards i had to sit for a couple of minutes to make sure i wouldn’t have an allergic reaction. a couple of days later the small hole in my arm started to itch really bad because it was healing. the first month nothing happened. i had it inserted while on my period. the second month i was on period for a whole month but it wasn’t a regular period. it wasn’t as heavy but it was darker than usual. i called my doctor and she said that was normal. i recommend implanon.
cleaned_review <-
full$review %>%
textutils::HTMLdecode() %>%
iconv("UTF-8", sub = "byte") %>%
tolower() %>%
stringr::str_replace_all("[^[:alnum:][.]['][!]]", " ") %>%
stringr::str_trim() %>%
stringr::str_squish() %>%
stringr::str_replace_all("[.]{2,3}", ".") # convert ellipsis to a period
For the rest of the our preprocessing tasks, we will leverage the {quanteda} package. {quanteda} is an amazing suite for text analytic functions (Benoit et al. 2018).
Tokenization is the process of converting long strings into a token of words. For example, the sentence “Let’s go to N.Y.!” can be tokenized to the following tokens: {Let, 's, go, to, N.Y., !}
Computers don’t understand language, but they are really good at counting pieces of language. The most informative pieces are (often) words.
Reducing words to their stem.
When we do not want to dstinguish between different verb forms (walk, walk-ing, walk-ed, walk-s) and singular-plural (cat, cat-s).
Refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
For sentiment analysis, I think it would be more important to lemmatize words instead of stem them so that we do not lose the meaning of targeted words.
Lemmatization is more accurate but it requires data about language, and it also takes much more time.
original <-
cleaned_txt %>%
quanteda::tokens()
# Stemmed texts
lemmas <-
quanteda::tokens_replace(
original,
pattern = lexicon::hash_lemmas$token,
replacement = lexicon::hash_lemmas$lemma
) %>%
paste0()
Original: i, was, real, scared, about, getting, implanon, after, reading, about, it, ., the, process, about, 15, mins, ., they, numb, your, arm, and, it, does, burn, going, in, for, about, a, min, ., then, they, use, another, needle, to, insert, ., i, felt, the, pressure, of, them, putting, it, in, ., i, turned, my, head, the, whole, time, so, i, didn’t, see, anything, ., afterwards, i, had, to, sit, for, a, couple, of, minutes, to, make, sure, i, wouldn’t, have, an, allergic, reaction, ., a, couple, of, days, later, the, small, hole, in, my, arm, started, to, itch, really, bad, because, it, was, healing, ., the, first, month, nothing, happened, ., i, had, it, inserted, while, on, my, period, ., the, second, month, i, was, on, period, for, a, whole, month, but, it, wasn’t, a, regular, period, ., it, wasn’t, as, heavy, but, it, was, darker, than, usual, ., i, called, my, doctor, and, she, said, that, was, normal, ., i, recommend, implanon, .
Lemmatized: i, be, real, scare, about, get, implanon, after, read, about, it, ., the, process, about, 15, min, ., they, numb, your, arm, and, it, do, burn, go, in, for, about, a, min, ., then, they, use, another, needle, to, insert, ., i, feel, the, pressure, of, them, putting, it, in, ., i, turn, my, head, the, whole, time, so, i, didn’t, see, anything, ., afterwards, i, have, to, sit, for, a, couple, of, minute, to, make, sure, i, wouldn’t, have, a, allergic, reaction, ., a, couple, of, day, late, the, small, hole, in, my, arm, start, to, itch, really, bad, because, it, be, heal, ., the, first, month, nothing, happen, ., i, have, it, insert, while, on, my, period, ., the, 2, month, i, be, on, period, for, a, whole, month, but, it, wasn’t, a, regular, period, ., it, wasn’t, as, heavy, but, it, be, dark, than, usual, ., i, call, my, doctor, and, she, say, that, be, normal, ., i, recommend, implanon, .
This is done to remove stopwords.
Some words are simply not interesting, but do occur often. For instance, words such as i, me, my, myself, is, was, ours, ourselves, you, your, etc.
stopwords_removed <-
quanteda::tokens_remove(original, quanteda::stopwords("english"))
Original: i, was, real, scared, about, getting, implanon, after, reading, about, it, ., the, process, about, 15, mins, ., they, numb, your, arm, and, it, does, burn, going, in, for, about, a, min, ., then, they, use, another, needle, to, insert, ., i, felt, the, pressure, of, them, putting, it, in, ., i, turned, my, head, the, whole, time, so, i, didn’t, see, anything, ., afterwards, i, had, to, sit, for, a, couple, of, minutes, to, make, sure, i, wouldn’t, have, an, allergic, reaction, ., a, couple, of, days, later, the, small, hole, in, my, arm, started, to, itch, really, bad, because, it, was, healing, ., the, first, month, nothing, happened, ., i, had, it, inserted, while, on, my, period, ., the, second, month, i, was, on, period, for, a, whole, month, but, it, wasn’t, a, regular, period, ., it, wasn’t, as, heavy, but, it, was, darker, than, usual, ., i, called, my, doctor, and, she, said, that, was, normal, ., i, recommend, implanon, .
Stopwords removed: real, scared, getting, implanon, reading, ., process, 15, mins, ., numb, arm, burn, going, min, ., use, another, needle, insert, ., felt, pressure, putting, ., turned, head, whole, time, see, anything, ., afterwards, sit, couple, minutes, make, sure, allergic, reaction, ., couple, days, later, small, hole, arm, started, itch, really, bad, healing, ., first, month, nothing, happened, ., inserted, period, ., second, month, period, whole, month, regular, period, ., heavy, darker, usual, ., called, doctor, said, normal, ., recommend, implanon, .
We apply all concepts into our working dataset.
reviews <-
cleaned_review %>%
quanteda::tokens(remove_punct = TRUE) %>%
quanteda::tokens_replace(
pattern = lexicon::hash_lemmas$token,
replacement = lexicon::hash_lemmas$lemma
) %>%
quanteda::tokens_remove(quanteda::stopwords("english"))
reviews_df <-
data.frame(
id = seq_along(reviews),
text = sapply(reviews, paste, collapse = " "),
row.names = NULL
)
saveRDS(full, here::here("data", "cleaned_review.rds"))
readr::write_csv(reviews_df, here::here("data", "review_tokens.csv"))
For this section, we utilize {quanteda} to do some corpus statistics. We first have to build our corpus from our tokens that we cleaned from 03_patient_reviews.Rmd notebook.
tokens_csv <- readr::read_csv(here::here("data", "review_tokens.csv"))
## Rows: 10000 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): text
## dbl (1): id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
corpus <- quanteda::corpus(tokens_csv, text_field = "text")
Given our corpus, we are going to create a Document-term Matrix. A document-term matrix is simply a mathematical matrix which describing the frequency of terms that occur in our text.
{quanteda} uses the quanteda::dfm() function to generate a document-term matrix. DFM stands for Document-Feature Matrix, which is similar to a document-term matrix, but just different terminologies.
dtm <- quanteda::dfm(quanteda::tokens(corpus))
dtm
## Document-feature matrix of: 10,000 documents, 12,978 features (99.74% sparse) and 1 docvar.
## features
## docs medication change life panic attack control barely able leave house
## text1 1 3 1 1 1 1 1 1 1 1
## text2 0 0 0 0 0 1 0 0 0 0
## text3 0 0 0 2 2 0 0 0 0 0
## text4 0 0 0 0 0 0 0 0 0 0
## text5 0 0 0 0 0 0 0 0 0 0
## text6 0 0 0 0 0 0 0 0 0 0
## [ reached max_ndoc ... 9,994 more documents, reached max_nfeat ... 12,968 more features ]
We see that we have 10,000 documents, with 12,978 features. Pretty cool! We also see that we have 99.74% sparse, which refers to how sparse the matrix is. In our case, 99.74% of all the cells in our document-term matrix contains the value 0. This makes a lot of sense because we have 12,978 unique words that have been used in our corpus and many documents only contain a very small portion of all of these words.
Next, it is also a good idea to remove the least frequent words. That way, we can just do our corpus statistics on the more interesting ones.
I think for now, we’ll remove those terms in our dtm that occurs less than 10 times.
trimmed_dtm <-
dtm %>%
quanteda::dfm_trim(min_termfreq = 10)
trimmed_dtm
## Document-feature matrix of: 10,000 documents, 2,874 features (98.90% sparse) and 1 docvar.
## features
## docs medication change life panic attack control barely able leave house
## text1 1 3 1 1 1 1 1 1 1 1
## text2 0 0 0 0 0 1 0 0 0 0
## text3 0 0 0 2 2 0 0 0 0 0
## text4 0 0 0 0 0 0 0 0 0 0
## text5 0 0 0 0 0 0 0 0 0 0
## text6 0 0 0 0 0 0 0 0 0 0
## [ reached max_ndoc ... 9,994 more documents, reached max_nfeat ... 2,864 more features ]
We now see that we our features decreased from 12,978 to 2,874. It is usually useful to remove these least frequent words because they tend to be less informative than the most frequent ones. And sometimes for text analysis, it is also good to not have too many terms.
For our analysis, we are going to be doing some word clouds. Word clouds are visual representations of text that give greater rank to words that appear more frequently. For this, we will use {quanteda}’s textplot_wordcloud(). To use the textplot_wordcloud() function, we need to install {quanteda.textplots}.
trimmed_dtm %>%
quanteda.textplots::textplot_wordcloud(max_words = 50, color = c("blue", "red"))
The red words are the most frequent words, and the blue ones are the least frequent. Additionally, we can also look at text frequency as a dataframe. For this one, we install {quanteda.textstats}
txt_freq <-
trimmed_dtm %>%
quanteda.textstats::textstat_frequency(n = 20)
So, we see the 20 most frequent words:
| feature | frequency | rank | docfreq | group |
|---|---|---|---|---|
| take | 7984 | 1 | 4926 | all |
| day | 5619 | 2 | 3542 | all |
| get | 4874 | 3 | 3285 | all |
| good | 4278 | 4 | 3324 | all |
| year | 4058 | 5 | 3087 | all |
| feel | 3994 | 6 | 2738 | all |
| month | 3906 | 7 | 2739 | all |
| go | 3845 | 8 | 2892 | all |
| work | 3843 | 9 | 3036 | all |
| much | 3776 | 10 | 2802 | all |
| effect | 3418 | 11 | 2795 | all |
| now | 3263 | 12 | 2638 | all |
| week | 3252 | 13 | 2375 | all |
| side | 3247 | 14 | 2736 | all |
| start | 3223 | 15 | 2459 | all |
| time | 2955 | 16 | 2334 | all |
| pain | 2943 | 17 | 1768 | all |
| 2 | 2684 | 18 | 2126 | all |
| first | 2491 | 19 | 2056 | all |
| use | 2468 | 20 | 1883 | all |
Plotting this, we can see the following:
txt_freq %>%
ggplot2::ggplot(ggplot2::aes(x = reorder(feature, frequency), y = frequency)) +
ggplot2::geom_point() +
ggplot2::coord_flip() +
ggplot2::labs(x = NULL, y = "Frequency")
I think this is where we will finish this project. Based on our corpus statistic, out of the 1,000 reviews we selected per year, we can see that take is the most frequent word used. And this makes sense because our corpus is on drug reviews, and the act of taking a drug is probably the only thing you do with medicine. Unless, of course the medicine was taken as a form of injection like a vaccination. However, it still makes sense because these are reviews of patients of drugs that they administered themselves.
All in all, if I had more computing power, perhaps I can do this on the full data set. Perhaps in the future, the computation itself can be done in the cloud, as well as any storage I may need.
| Package | Version |
|---|---|
| dplyr | 1.0.10 |
| ggplot2 | 3.3.6 |
| gtsummary | 1.6.2 |
| here | 1.0.1 |
| knitr | 1.40 |
| lexicon | 1.2.1 |
| lubridate | 1.8.0 |
| magrittr | 2.0.3 |
| quanteda | 3.2.3 |
| quanteda.textplots | 0.94.2 |
| quanteda.textstats | 0.96 |
| readr | 2.1.3 |
| renv | 0.13.1 |
| rmarkdown | 2.17 |
| scales | 1.2.1 |
| stringr | 1.4.0 |
| textutils | 0.2-1 |
| tibble | 3.1.8 |
| tidyr | 1.2.1 |
Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. 2018. “Quanteda: An R Package for the Quantitative Analysis of Textual Data.” Journal of Open Source Software 3 (30): 774. https://doi.org/10.21105/joss.00774.
Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.
Gräßer, Felix, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. 2018. “Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning.” In Proceedings of the 2018 International Conference on Digital Health, 121–25. DH ’18. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3194658.3194677.