A little about Bellabeat…

Bellabeat is a company that is focused on women’s health, it manufactures health-focused wearable smart devices (Ivy, Leaf Urban, and Leaf Chakra), and develops mobile software (Bellabeat Wellness Coach App). Bellabeat offers a subscription-based membership program for users giving them access to personalized guidance on having a healthy lifestyle through their mobile app. Bellabeat has positioned itself as a tech-driven wellness company for women that puts design and fashion alongside functionality.

The goals of this EDA on wearable smart devices:

Better understand smart device user behaviors and preferences
Log which variables are being currently tracked by non-Bellabeat smart devices
Use transformed data to guide new product features and refine existing product features
Choose a Bellabeat product to receive increased marketing spend for Q2 2023

Key stakeholders for this project include:

Urška Sršen, Chief Creative Officer
Sando Mur, Executive Officer
Bellabeat Marketing Analytics Team

Overview of data sets used in this EDA:

Kaggle data set: FitBit Fitness Tracker Data (CC0: Public Domain, file types: .csv)

* dailyActivity_merged.csv(111.29 kB)
* dailyCalories_merged.csv(25.13 kB)
* dailySteps_merged.csv(25.18 kB)
* sleepDay_merged.csv(18.1 kB)
* weightLogInfo_merged.csv(6.73 kB)

Non-Bellabeat wearable smart device data collected from 30 consenting FitBit users includes various output for daily physical activity, hourly calories, hourly steps, sleep monitoring and weight logs. Data have been collected using Amazon Mechanical Turk, produced by Möbius. Open data set on Kaggle. [Collection Begin: 04-12-2016, End: 05-12-2016]

Reliability : LOW - ds has potential sample bias issues (small sample size)
Originality : LOW - aggregated third party quantitative data
Comprehensive : MEDIUM - ds has one month range of date/time variables
Current : LOW - data 6+ years old and in terms of current popular technology it is dated
Cited : HIGH - data collection and source are documented

DataverseNL data set: User and Device Agency for understanding human and computer interaction (CC0: Public Domain, file type: xlsx)

* UserRelationshipWithTheirSmartDevice_Dataset.xlsx(181.7 kB)
* Q2.xlsx(14.6 kB)
* Q11.xlxs(25.9 kB)

Survey data collected directly from 811 consenting participants. data set is mostly qualitative data and focuses on how users describe smart device models and brands they use regularly and how they would describe the relationship with those smart devices. Data have been collected using Qualtrics, produced by Tillburg University. [Begin date: 1-05-2020, End date: 1-07-2020]

Reliability : MEDIUM - ds has good sample size
Originality : HIGH - directly collected qualitative data using survey
Comprehensive : MEDIUM - limited scope of survey questions for this analysis
Current : LOW- data is 3+ years old and in terms of current popular technology is dated
Cited : HIGH - data collection and source are well documented

Setup in R: Kaggle data set

R programming, RStudio

Set up the code environment by loading these R packages that will be used for data importing, exploring, organizing, cleaning, analyzing, visualizing, code reporting/reproducing (R Markdown), and sharing/presenting (file format type, aesthetics).

## call R packages

library(tidyverse)
library(scales)
library(vctrs)
library(knitr)
library(rmarkdown)
library(prettydoc)

Data Preparation/Processing: Kaggle data set-Part 1

R programming, tidyverse package and other R packages used to import, and explore the data set.

Kaggle data set: FitBit Fitness Tracker Data Data sets are imported using tidyverse and data frame objects are created/named.

## set working directory to where data set files are located

setwd("Fitabase Data 4.12.16-5.12.16")

## use readr package from tidyverse to import data and assign data frame

da <- read_csv("dailyActivity_merged.csv")
sleep <- read_csv("sleepDay_merged.csv")
hcal <- read_csv("hourlyCalories_merged.csv")
hsteps <- read_csv("hourlySteps_merged.csv")
weight <- read_csv("weightLogInfo_merged.csv")

It is time to explore the data frames using head, glimpse, and/or tibble functions to get a feel for the structure, data types, variables, categorical variables, and number of observations.

## head function to view data structure and data types

head(da)
head(sleep)
head(hcal)
head(hsteps)
head(weight)

## glimpse function to view data structure and data types

glimpse(da)
glimpse(sleep)
glimpse(hcal)
glimpse(hsteps)
glimpse(weight)

## tibble function to view data structure and data types

tibble(da)
tibble(sleep)
tibble(hcal)
tibble(hsteps)
tibble(weight)

After exploring the data set, the data validation goals are to increase ease of use, reliability, consistency, completeness of the data set.

First, hide duplicate values of the ‘Id’ variable while showing the total number of unique values in each data frame using ‘Id’ as the criterion (because it is a shared variable in the data set). This will give me an idea of the initial completeness (or lack thereof) for the data set as a whole.

I will be considering sample size bias and other data merging issues to see which data frames I will be using or omitting going forward.

## nested length(unique()) for total unique 'Id' values for each data frame
## a lot of missing/NA values is a red flag

length(unique(da$Id))

## [1] 33

length(unique(sleep$Id))

## [1] 24

length(unique(hcal$Id))

## [1] 33

length(unique(hsteps$Id))

## [1] 33

length(unique(weight$Id))

## [1] 8

Generally, when there is a relatively large sample size I would likely omit missing/NA values from the analysis but because in this case the sample size is relatively small (33 unique Ids), I will either replace missing/NA values with the mean of the total respective variable observations or omit the data frame entirely from the analysis. The determining factor for using or omitting data frames will be based on avoiding sample size bias.

Looking at the sleep data frame unique ‘Id’ result, we can see only ~27% of the unique ‘Id’ result are missing/NA values and I consider it not significant enough to omit the data frame entirely from the analysis. I will make an assumption that it is mostly likely participant or smart device error. In this case, I will replace missing/NA values with the mean of each respective variable.

However, looking at the weight data frame unique ‘Id’ result, we can see ~76% of the unique ‘Id’ result are missing/NA values. Because this data frame has so many missing/NA values, and filling in missing values is unreliable due to small sample size it is best to omit the data frame from the analysis going forward and stick with the others that are more valid and complete.

## remove the weight df to reduce clutter in objects pan.

remove(weight)

Data Preparation/Processing: Kaggle data set-Part 2

R programming, tidyverse package and other R packages used to organize, clean, wrangle, validate and transform the data set.

Separate and rename the sleep data frame variable ‘SleepDay’ so it matches the da data frame variable ‘ActivityDate’ which could help simplify when merging data frames. Also, removed unused variables from the data frames before merging.

sleep <- sleep |>
  separate(SleepDay, c("ActivityDate", "SleepdaystartHour"), " ") |>
  select(1:2, 5:6)

da <- da |>
  select(1:3, 11:15)

Separate and rename the hcal and hsteps data frames variable ‘ActivityHour’ into ‘ActivityDate’, ‘ActivityHour’, and ‘AMPM’. Then unite ‘ActivityHour’ and ‘AMPM as ’ActivityHour’. Finally, select variables to be merged.

hcal <- hcal |>
  separate(ActivityHour, c("ActivityDate", "ActivityHour", "AMPM"), " ") |>
  unite("ActivityHour", ActivityHour:AMPM, sep = " ", remove = FALSE) |>
  select(1:3, 5)

hsteps <- hsteps |>
  separate(ActivityHour, c("ActivityDate", "ActivityHour", "AMPM"), " ") |>
  unite("ActivityHour", ActivityHour:AMPM, sep = " ", remove = FALSE) |>
  select(1:3, 5)

After reviewing hcal, hsteps, da, sleep data frames individually, I will merge (inner_join) da and sleep then hcal and hsteps respectively.

## da and sleep by 2 shared variables

da_sleep_merged <- 
  merge(x = da, y = sleep, by = c("Id", "ActivityDate"),
        all = TRUE)

The hourly data frame hcal_hsteps_merged will have a mutate function added in so that the time variable ‘ActivityHour’ can be correctly visualized later in charts, plots, etc. and is organized for how humans read time.

## hcal and hsteps by 3 shared variables
## mutate allowed for 24hr or 12 hr ampm clock time for correct visualization later on
hcal_hsteps_merged <-
  merge(x = hcal, y = hsteps, by = c("Id", "ActivityDate", "ActivityHour"),
        all = TRUE) |>
    mutate(ActivityHour = strptime(ActivityHour, format="%I:%M:%S %p") %>% format(., "%H:%M:%p"))

Use summary function to inspect the reliability, consistency, and completeness of the merged data frames and check for missing/NA values.

## da_sleep_merged has 2 variables with 530 missing/NA values after merge

summary(da_sleep_merged)

The summary function da_sleep_merged results show 530 missing/NA values out of 943 total observations, for variables ‘TotalMinutesAsleep’ and ‘TotalTimeInBed’ after the merge of da and sleep. Both variables consist of 56% missing/NA values and I am consciously choosing to still include them in the analysis. I will complete the variables using what values are there because variables related to user sleeping patterns are crucial to that part of the analysis, I will admit to some degree of sample size bias there but I think it is worth including despite that.

## hcal_hsteps_merged has no missing/NA values after merge

summary(hcal_hsteps_merged)

The summary function hcal_hsteps_merged results show there are no missing/NA values. The data frame is ready for the next phase.

The da_sleep_merged data frame that is missing/NA values will be completed using the mean of the total number of values for each respective variable and will be named new_da_sleep_merged.

## new df will contain no missing/NA values

new_da_sleep_merged <- da_sleep_merged

new_da_sleep_merged$TotalMinutesAsleep <- ifelse(is.na(new_da_sleep_merged$TotalMinutesAsleep),
                                         mean(new_da_sleep_merged$TotalMinutesAsleep,
                                              na.rm = TRUE),
                                         new_da_sleep_merged$TotalMinutesAsleep)

new_da_sleep_merged$TotalTimeInBed <- ifelse(is.na(new_da_sleep_merged$TotalTimeInBed),
                                         mean(new_da_sleep_merged$TotalTimeInBed,
                                              na.rm = TRUE),
                                         new_da_sleep_merged$TotalTimeInBed)

The new_da_sleep_merged summary function results show it is complete with no missing/NA values.

## there are no missing/NA values

summary(new_da_sleep_merged)

Now, I will manipulate the new_da_sleep_merged data frame to create new variables, ‘TimeUntilSleep’ and ‘StepsPerCalorie’, that will be defined as (‘TotalTimeInBed’ - ‘TotalMinutesAsleep’) and (‘TotalSteps’ / ‘Calories’), respectively. The transformed data frame will be called v2_new_da_sleep and I will use this data frame for the analysis going forward.

v2_new_da_sleep <- new_da_sleep_merged |>
  mutate(TimeUntilSleep = (TotalTimeInBed - TotalMinutesAsleep))

v2_new_da_sleep <- v2_new_da_sleep |>
  mutate(StepsPerCalorie = (TotalSteps / Calories))

Inspect new variables in v2_new_da_sleep data frame using summary function and results show the new variable, ‘StepsPerCalorie’, has 4 missing/NA values.

## there are 4 missing/NA values in v2_da_sleep_merged$StepsPerCalorie

summary(v2_new_da_sleep)

## could have omitted the missing/NA values but I already had this script ready so yeah

v2_new_da_sleep$StepsPerCalorie <- ifelse(is.na(v2_new_da_sleep$StepsPerCalorie),
                                         mean(v2_new_da_sleep$StepsPerCalorie,
                                              na.rm = TRUE),
                                         v2_new_da_sleep$StepsPerCalorie)

The v2_new_da_sleep summary function results show variable ‘StepsPerCalorie’ does not have missing/NA values.

## quick summary on variable that had missing/NA values to confirm completeness

summary(v2_new_da_sleep$StepsPerCalorie)

The last thing I want to do is create a new variable called ‘IdGroupby’ that groups observations by unique User Id which will come in handy later when performing analysis. I will transform both v2_new_da_sleep and hcal_hsteps_merged data frames so they each have this new variable.

## dplyr package to mutate and groupby 'Id'
## to show each individual user over data collection period

v2_new_da_sleep_groupby_Id <- v2_new_da_sleep |>
  group_by(Id) |>
  mutate(IdGroupby = cur_group_id())

## dplyr to mutate and groupby 'Id'
## to show each individual user over data collection period

hcal_hsteps_merged_groupby_Id <- hcal_hsteps_merged |>
  group_by(Id) |>
  mutate(IdGroupby = cur_group_id())

The v2_new_da_sleep and hcal_hsteps_merged data frames are validated, complete, consistent and ready for the next phase and will be the versions used in the analysis going forward.

Data Preparation/Processing: DataverseNL data set

MS Excel used to validate data and organize variables in to individual files that will be uploaded in to tag generator <wordclouds.com>

DataverseNL data set: User and Device Agency for understanding human and computer interaction

The DataverseNL data set is mostly qualitative and the scope relatively wide but I will explore and select any variables that are relative and usable in this analysis using MS Excel (that is how the data set files were published/distributed) but next time I will import using R package(readr::read_csv).
The selected variables from this data set will be used to get a high level idea of what users are saying about their smart device brand preferences (some are wearable smart device manufacturers) and user product feature preferences (some are/could be implemented in wearable smart devices).

The DataverseNL data set is loaded in to MS Excel. Use the remove duplicates and trim functions to begin data validation and organization. Select column Q2 and Q11 only and separate into their own workbooks and save workbooks as Q2.xlsx and Q11.xlsx files so they can be uploaded individually in to the tag generator <wordclouds.com> and we can get two different data vizzes.

Variables selected:

“Q2: Please specify the product that you have in mind, including the brand name:”
“Q11: We know that we are really testing your patience at this point. However, if you feel that something was not covered by this survey and you want to further describe your relationship with your device, you are welcome to write a couple of lines (or more) here below.”

Data Analysis: Kaggle data set

R programming, tidyverse package and other R packages used for data analysis.

Kaggle data set: FitBit Fitness Tracker Data

The validated v2_new_da_sleep and hcal_hsteps_merged data frames can now be analyzed and described thoroughly using summary function, histograms, and other basic plots/charts.

The goal in this phase is to quickly comb for insights or ability to infer something useful and also hone in on what aspects of the data frames could be visualized in a more polished way later on.

summary(v2_new_da_sleep)

##        Id            ActivityDate         TotalSteps    VeryActiveMinutes
##  Min.   :1.504e+09   Length:943         Min.   :    0   Min.   :  0.00   
##  1st Qu.:2.320e+09   Class :character   1st Qu.: 3795   1st Qu.:  0.00   
##  Median :4.445e+09   Mode  :character   Median : 7439   Median :  4.00   
##  Mean   :4.858e+09                      Mean   : 7652   Mean   : 21.24   
##  3rd Qu.:6.962e+09                      3rd Qu.:10734   3rd Qu.: 32.00   
##  Max.   :8.878e+09                      Max.   :36019   Max.   :210.00   
##  FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes    Calories   
##  Min.   :  0.00      Min.   :  0          Min.   :   0.0   Min.   :   0  
##  1st Qu.:  0.00      1st Qu.:127          1st Qu.: 729.0   1st Qu.:1830  
##  Median :  7.00      Median :199          Median :1057.0   Median :2140  
##  Mean   : 13.63      Mean   :193          Mean   : 990.4   Mean   :2308  
##  3rd Qu.: 19.00      3rd Qu.:264          3rd Qu.:1229.0   3rd Qu.:2796  
##  Max.   :143.00      Max.   :518          Max.   :1440.0   Max.   :4900  
##  TotalMinutesAsleep TotalTimeInBed  TimeUntilSleep   StepsPerCalorie 
##  Min.   : 58.0      Min.   : 61.0   Min.   :  0.00   Min.   : 0.000  
##  1st Qu.:419.5      1st Qu.:458.6   1st Qu.: 28.50   1st Qu.: 1.991  
##  Median :419.5      Median :458.6   Median : 39.17   Median : 3.194  
##  Mean   :419.5      Mean   :458.6   Mean   : 39.17   Mean   : 3.233  
##  3rd Qu.:419.5      3rd Qu.:458.6   3rd Qu.: 39.17   3rd Qu.: 4.412  
##  Max.   :796.0      Max.   :961.0   Max.   :371.00   Max.   :14.346

First, lets examine the frequency of certain observations of a given variable using histograms to see how the distribution looks at a high level. Then we can compare it with the mean and see if there are outliers that warrant a deeper look. We will be using the hist function from the base r ‘graphics’ package to do this.

## hist() daily TotalSteps

histTotalSteps <- hist(v2_new_da_sleep$TotalSteps,
     col = "lightblue",
     main = "Total Daily Observations Frequency from 'TotalSteps' Variable",
     xlab = "TOTAL DAILY STEPS PER USER")

Looking at the histogram of the daily ‘TotalSteps’ variable we can see a near even split of the top 3 most frequent observations. I am going to assume an average user has a mix each of these top 3 frequent observations. I can then conclude that an average user is very likely to have one of three possible ranges for daily total steps on any given day. They either have 5,000 or less total daily steps, 5,000 to 10,000 total daily steps, or 10,000 to 15,000 total daily steps with the 5,000 to 10,000 total daily steps range being slightly more likely than the other two ranges. Users with more than 15,000 total daily steps are very unlikely to be observed.

Using the histogram to compare the most likely observation with the mean results from the summary function, there seems to be some data reinforcement. The mean result of 7,652 daily total steps from the ‘TotalSteps’ variable falls within the most likely observation outcome.

## hist() daily Calories

histCalories <- hist(v2_new_da_sleep$Calories,
     col = "lightgreen",
     main = "Total Daily Observations Frequency from 'Calories' Variable",
     xlab = "TOTAL DAILY CALORIES PER USER")

Now, lets analyze the histogram of the total daily ‘Calories’ variable. This distribution is more of a bell curve and I will make the same assumption as with the ‘TotalSteps’ histogram regarding possible observation outcomes for the average user. Using the same logic we can see what ranges the average user is likely to have on any given day. They will likely spend between 1,500 to 2,000 daily total calories, 2,000 to 2,500 daily total calories, or 2,500 to 3,000 daily total calories with the 1,500 to 2,000 daily total calories range being slightly more likely than the other two ranges. Users with less than 1,500 daily total calories or more than 3,000 daily total calories are significantly less likely to be observed.

When using the histogram to compare the most frequent observations with the mean results from the summary function, there is a small discrepancy. The mean result of 2,308 daily total calories spent is further to the right than the most frequent observation range.

The v2_new_da_sleep data frame, is currently being used for these plots, and contains two new variables, ‘TimeUntilAsleep’ and ‘StepsPerCalorie’.

## hist() daily TimeUntilSleep

histTimeUntilSleep <- hist(v2_new_da_sleep$TimeUntilSleep,
     col = "violet",
     main = "Total Daily Observations Frequency from 'TimeUntilSleep' Variable",
     xlab = "MINUTES UNTIL ASLEEP")

Because the frequency of certain observations are so clustered I will use the log function to “zoom” in on the ‘TimeUntilSleep’ variable and see if there is anything to glean from most frequent observations.

##  hist(log()) daily TimeUntilSleep

histTimeUntilSleep <- hist(log(v2_new_da_sleep$TimeUntilSleep),
     col = "purple",
     main = "Most Frequent Daily Observations from 'TimeUntilSleep' Variable",
     xlab = "MINUTES UNTIL ASLEEP(x10)")

Looking at the histogram of the ‘TimeUntilSleep’ variable we can see that most users take between 20 to 40 minutes to fall asleep with a significant amount of users taking 35 to 40 minutes as that is the most frequent observation range by far.

Also the mean results of 39.17 minutes until asleep from the ‘TimeUntilSleep’ variable does fall within the most frequent observation range.

## hist() daily StepsPerCalorie

histStepsPerCalorie <- hist(v2_new_da_sleep$StepsPerCalorie,
     col = "lavender",
     main = "Total Daily Observations Frequency from 'StepsPerCalorie' Variable",
     xlab = "TOTAL DAILY STEPS PER CALORIE")

Looking at the histogram of the daily group ‘StepsPerCalorie’ variable, the majority of observations are between 0 to 6 steps per calorie with a relatively equal frequency distribution. When compared with the mean result of 3.23 StepsPerCalorie we can see it falls within the range of the most frequent observations.

0 to 1 steps per calorie could indicate the user could be engaging in physical activity that doesn’t track steps like weight lifting or sitting at a desk for work. 2 to 3 steps per calorie could indicate the user is likely walking a good portion of the day. Instances where 4 to 6 or more steps per calorie are shown could indicate the user went jogging/running that day.

I will expand on the ‘StepsPerCalorie’ variable and test my hypothesis below and I will plot the total daily ‘Calories’ variable relative with the ‘StepsPerCalorie’ variable to see if there is some correlation. Using the lm function to create a slope line we can see the slight trend.

“Does increased total daily calories spent mean a higher total daily steps per calorie value?”

## daily StepsPerCalorie vs Calories

plot(v2_new_da_sleep$Calories ~ v2_new_da_sleep$StepsPerCalorie,
     col = v2_new_da_sleep$StepsPerCalorie,
     main = "Total Daily 'Calories' Spent Relative to 'StepsPerCalorie'",
     ylab = "TOTAL DAILY CALORIES",
     xlab = "DAILY STEPS PER CALORIE") +
abline(lm(v2_new_da_sleep$Calories ~ v2_new_da_sleep$StepsPerCalorie), col = "blue")

Looking at the plot above we can see that even when total daily calories spent is higher there are still plenty of instances where ‘StepsPerCalorie’ observations are between 1 to 2 steps per calorie which supports the assumption that maybe some users are weight lifting or sitting at a desk working or doing some activity that doesn’t require many steps in order to achieve those higher total daily calories spent.

But, overall I can positively assume that increased total daily calories spent means higher daily steps per calorie values.

To test the next hypothesis I will plot the daily ‘TimeUntilSleep’ variable relative to the total daily ‘Calories’ variable and see if there is correlation. Using the lm function to create a slope line we can see the slight trend.

“Does increased total daily calories spent mean a lower time until asleep value?”

## daily Calories vs TimeUntilSleep

plot(v2_new_da_sleep$Calories ~ v2_new_da_sleep$TimeUntilSleep,
     col = v2_new_da_sleep$Calories,
     main = "Total Daily 'Calories' Spent Relative to 'TimeUntilSleep'",
     xlab = "MINUTES UNTIL ASLEEP",
     ylab = "TOTAL DAILY CALORIES") +
abline(lm(v2_new_da_sleep$Calories ~ v2_new_da_sleep$TimeUntilSleep), col = "blue")

Looking at the plot above we can see that most users fall asleep in less than 40 minutes regardless of total daily calories spent.

However, the data also shows that there are far less observations with a higher total daily calories spent that take longer than 40 minutes to fall asleep compared with observations that range from 1,500 to 2,000 total daily calories spent.

Now, lets “zoom” in using log function on ‘TimeUntilSleep’ variable and see if there is anything to glean from most frequent observations.

## daily Calories vs log(TimeUntilSleep)

plot(v2_new_da_sleep$Calories ~ log(v2_new_da_sleep$TimeUntilSleep),
     col = v2_new_da_sleep$Calories,
     main = "Most Frequent Total Daily 'Calories' Spent Relative to 'TimeUntilSleep'",
     xlab = "MINUTES UNTIL ASLEEP (x10)",
     ylab = "TOTAL DAILY CALORIES") +
abline(lm(v2_new_da_sleep$Calories ~ v2_new_da_sleep$TimeUntilSleep), col = "blue")

Looking at the “zoomed in” plot above we can see that there is not anything new to learn from the data. It still shows that higher total daily calories spent mean a better chance at falling asleep in less than 40 minutes compared to observations in the 1,500 to 2,000 total daily calories spent range.

Based on the data, I can assume that increased total daily calories spent means lower time untill asleep values.

Now, lets examine the hourly hcal_hsteps_merged data frame using summary function, histograms, and other basic plots/charts and see what information about the data could be useful im the analysis.

## analysis and description of hcal_hsteps_merged data frame

summary(hcal_hsteps_merged)

##        Id            ActivityDate       ActivityHour          Calories     
##  Min.   :1.504e+09   Length:22099       Length:22099       Min.   : 42.00  
##  1st Qu.:2.320e+09   Class :character   Class :character   1st Qu.: 63.00  
##  Median :4.445e+09   Mode  :character   Mode  :character   Median : 83.00  
##  Mean   :4.848e+09                                         Mean   : 97.39  
##  3rd Qu.:6.962e+09                                         3rd Qu.:108.00  
##  Max.   :8.878e+09                                         Max.   :948.00  
##    StepTotal      
##  Min.   :    0.0  
##  1st Qu.:    0.0  
##  Median :   40.0  
##  Mean   :  320.2  
##  3rd Qu.:  357.0  
##  Max.   :10554.0

The summary function results of the hourly hcal_hsteps_merged data frame shows a mean of 97.39 calories spent per hour and 320.2 steps per hour. To compare the mean results of hourly ‘Calories’ and ‘StepTotal’ I will plot each of them using hist function. If there is something out of the ordinary that sticks out it can be investigated further.

## hist() hourly Calories

histHourlyCalories <- hist(hcal_hsteps_merged$Calories,
     col = "green",
     main = "Total Hourly Observations Frequency from 'Calories' Variable",
     xlab = "TOTAL CALORIES BURNED PER HOUR")

Looking at the histogram of the hourly ‘Calories’ variable we can see that most of the observations fall in the range of 50 to 100 calories spent per hour and then significant drop off in frequency when observations are in the 100 to 200 calories spent per hour which is not surprising because sustaining the level of activity to spend that that amount of calories per hour is just not humanly possible. Users spending more than 100 calories per hour are likely engaging in some activity like working out.

Now, lets “zoom” in using log function on Hourly ‘Calories’ variable and see if there is anything to glean from most frequent observations.

## hist(log()) hourly Calories

histHourlyCalories <- hist(log(hcal_hsteps_merged$Calories),
     col = "darkgreen",
     main = "Most Frequent Hourly Observations from 'Calories' Variable",
     xlab = "TOTAL CALORIES BURNED PER HOUR (x10)")

Looking at the “zoomed in” Hourly ‘Calories’ plot above we can see that the mean value of 97.39 from the ‘Calories’ variable is greater than the most frequent observations. I think this view of the data shows a more precise understanding of the resting or workday hourly calories spent but it does not show the workout or active hourly calories spent very well.

## hist() hourly StepTotal

histHourlyStepTotal <- hist(hcal_hsteps_merged$StepTotal,
     col = "blue",
     main = "Total Hourly Observations Frequency from 'StepTotal' Variable",
     xlab = "TOTAL STEPS PER HOUR")

The histogram of the total hourly ‘StepTotal’ variable shows that most users take less than 1,000 steps per day, which means that there are very few long distance runners.

Because the frequency of certain observations are so clustered I will use the log function to “zoom” in on that cluster and see what the data can tell us.

## hist(log()) hourly StepTotal 

histHourlylogStepTotal <- hist(log(hcal_hsteps_merged$StepTotal),
     col = "darkblue",
     main = "Most Frequent Hourly Observations from 'StepTotal' Variable",
     xlab = "TOTAL STEPS PER HOUR(x10)")

Looking at the hourly ‘StepTotal’ variable using a histogram we can get a closer look at the majority of observations and we can see that the mean value of 320.2 from the ‘StepTotal’ variable is greater than the most frequent observations. We can see that most users are taking about 40 to 75 steps on any given hour when not running or mall walking for a workout.

I think this view of the data shows a more precise understanding of the resting or workday hourly step total but it does not show the workout or active hourly step total very well.

Now, I want to test the hypothesis that the majority of users engage in some physical activity that increases hourly calories spent and reject my null hypothesis that only a few users have very high hourly calories spent and are skewing the data.

“Did the majority of users track their workout (at least more than once) during the data collection period?”

Plot total hourly ‘Calories’ variable relative to unique ‘Id’ variable to compare the ranges and total observations of total hourly calories burned for each participant during the data collection period. This plot is a visual representation of how active each participant was during the time this data was being recorded, assuming they tracked their workout.

## hourly Calories vs IdGroupby TOTAL NUMBER OF HOURLY OBSERVATIONS

plot(hcal_hsteps_merged$Calories ~ hcal_hsteps_merged_groupby_Id$IdGroupby,
     col = hcal_hsteps_merged_groupby_Id$IdGroupby,
     main = "Total Observations of Hourly 'Calories' Spent Relative to User 'Id'",
     xlab = "UNIQUE USER ID REPRESENTED BY COLUMNS",
     ylab = "HOURLY 'CALORIES' SPENT OBSERVATIONS FREQUENCY") +
abline(lm(hcal_hsteps_merged$Calories ~ hcal_hsteps_merged_groupby_Id$IdGroupby), col = "red")

We can see that most of the participants had instances where they burned over 200 Calories in a given hour which I assume is them engaging in some kind of physical activity like working out or running/walking for a workout. So, we can assume that wearable smart devices are used by people who tend to workout at least some of the time and now I can reject the null hypothesis stated earlier.

Data Analysis: DataverseNL data set

MS Excel files uploaded to Tag generator <wordclouds.com> and used for ad-hoc analysis using wordclouds

DataverseNL data set: User and Device Agency for understanding human and computer interaction

The scope of this data set set is broader than we want for the purposes of this analysis so I will use the tag generator filters and parameters to emphasize wearable smart devices rather than the general category of smart devices if possible.

This data set may not be 100% directly related to wearable smart devices we can still get a glimpse into how users interact with and use their smart devices in general and brands are at the top of the users mind. This could be important for looking at what users expect of their smart devices (including wearable smart devices).

Use the wordcloud to compare current Bellabeat products and features with wearable and non-wearable smart devices that are highly ranked in the word cloud.

Data Visualizations: Kaggle data set

R programming, tidyverse, ggplot2 and other R packages used for data visualizations.

Kaggle data set: FitBit Fitness Tracker Data

## ggplot() daily Calories vs TimeUntilSleep

ggTimeUntilSleepvsCalories <- 
  ggplot(v2_new_da_sleep, mapping = aes(log(TimeUntilSleep), Calories)) +
  geom_point(size = 1, shape = 3, color = v2_new_da_sleep_groupby_Id$IdGroupby) +
  labs(title = "Daily User Activity: Calories Spent Relative to Time In Bed Before Sleep", 
       subtitle = "Users with higher daily total calories spent likely fall asleep quicker",
       x = "MINUTES UNTIL ASLEEP (x10)",
       y = "TOTAL DAILY CALORIES")

ggTimeUntilSleepvsCalories

This is may seem intuitive but it can tracked and shown visually to users. Metrics regarding sleep are high value to most users and thus could be implemented as a future product development or subsciption-based premium feature.

After examining the summary results, I am curious about the relationship between daily total ‘Calories’ spent and daily average ‘StepsPerCalorie’. We will use a plot in the data viz phase to “eye” test the relationship between Daily Total ‘Calories’ spent and increased or decreased ‘StepsPerCalorie’.]

## ggplot() daily Calories vs StepsPerCalorie

ggStepsPerCalorievsCalories <- v2_new_da_sleep |>
  ggplot(aes(StepsPerCalorie, Calories)) +
  geom_point(size = 1, shape = 3, color = v2_new_da_sleep_groupby_Id$IdGroupby) +
  labs(title = "Daily User Activity: Total Calories Spent Relative to Steps Per Calorie",
       subtitle = "Users with higher daily total calories spent likely have higher steps per calorie",
       x = "STEPS PER CALORIE",
       y = "DAILY TOTAL CALORIES",)

ggStepsPerCalorievsCalories

Does increased or decreased Daily Total ‘Calories’ spent have an effect on ‘StepsPerCalorie’ variable? We will use a plot to “eye” test the correlation between Daily Total ‘Calories’ spent and increased or decreased ‘StepsPerCalorie’ to see if further investigation is required.

The data viz shows that Daily Total Calories Spent is positively correlated with Daily Steps Per Calorie.

This data could be used to identify both the group and individual user’s Steps Per Calorie and track it over time to see trends. Also, if specific changes made by the user effect the trend in some way. This could all be tracked in the Bellabeat app. It could implemented as a future product development or subscription-based premium feature.

The following visualization shows the total hourly Calories burned by hour of day over the course of the data collection period.

## ggplot() hourly Calories vs ActivityHour

ggHourlyCaloriesvsActivityHour <- hcal_hsteps_merged |>
  ggplot(aes(ActivityHour, Calories))+
  geom_point(size=.75, color = "darkorange") +
  labs(title= "Hourly Group Activity: Calories Relative to Hour of Day(24hrs)",
       y = "TOTAL HOURLY CALORIES SPENT",
       x = "HOUR OF DAY(24HRS)") +
  theme(axis.text.x = element_text(angle = 50, size = 7))

ggHourlyCaloriesvsActivityHour

The following visualization shows the total hourly steps taken by hour of day over the course of the data collection period.

## ggplot() hourly StepTotal vs ActivityHour

ggHourlyStepTotalvsActivityHour <- hcal_hsteps_merged |> # This is a pipe
  ggplot(aes(ActivityHour, StepTotal))+
  geom_point(size=.75, color = "darkgreen")+
  labs(title= "Hourly Group Activity: Total Steps Relative to Hour of Day (24hrs)", # This is a title
       y = "TOTAL HOURLY STEPS",
       x = "HOUR OF DAY(24HRS)") + # This is a label
    theme(axis.text.x = element_text(angle = 50, size = 7))

ggHourlyStepTotalvsActivityHour

Data Visualizations: DataverseNL data set

Tag generator <wordclouds.com> used to create visualization

DataverseNL data set: User and Device Agency for understanding human and computer interaction

Q2.xlsx file uploaded to <wordclouds.com> to visualize which smart devices and brands the users taking this survey value most. After uploading to website aesthetic changes were made and some additional filtering of words with low frequency to reduce over plotting and make a cleaner viz.

Word cloud using Q2.xlsx

This “target user” word cloud illustrates technology and device brands that are popular with smart device users from the survey. We can see which brands are prevalent in the market but also which are emerging in the space.
There should be a separate analysis of Apple, Samsung, Garmin, FitBit products and services that are directly competing with Bellabeat in the wearable smart device market since they are at the top of users minds.

Q11.xlsx file uploaded to <wordclouds.com> to visualize keywords that describe how smart device users view interactions with their smart devices and what aspects of device functionality are important to them.

Word cloud using Q11.xlsx

This “wearable smart device” word cloud visually describes functions smart device users are interacting with or value.
We can see that users value a complex mix of functions and characteristics like simplicity, time, privacy, smart phone connectivity and others.
When we compare these top results with Bellabeat product offerings and see what is similar or different and ask ourselves “is that aligning with what smart device users are saying is important to them?”.

Executive Summary, Conclusions and Actionable Next Steps:

R Programming, R Markdown, knitr, prettydoc and other R packages used for code reproduce-ability (R Markdown), and Reporting (file format, aesthetics).

The data supports several takeaways that I think are relevant to the initial questions we set out to answer. We will address each question and what the data is telling us. I will do my best to my personal views and opinions aside, being as objective and data-driven as possible while keeping the organization’s interest at the forefront.

Better understand user behaviors and preferences:
- Although the weight data frame was intentionally omitted from this analysis, the data (or lack thereof) suggests some high level assumptions could still be reliably made to improve future and current Bellabeat data collection operations. From privacy disclosures to collection practices, there is a lot we can glean from the lack of weight-related observations and perhaps why users are apparently reluctant to share it. My hypothesis is that body weight logging is uncomfortable for some if not most of the participants when they know it will be shared and made public (even though it is anonymous). Body weight logging is, not surprisingly, sensitive data for most people and who they trust with that kind data is very selective. User privacy preference should be factored in when Bellabeat carries out its own user data collection campaign. On a side note, I think Bellabeat should monitor the percentage of users who allow for their body weight logging to be tracked and/or shared anonymously and use it as a possible metric for user confidence in Bellabeat regarding their personal or “sensitive” data management.
- We created interesting yet quick and light graphic plots using cleaned and organized data. The daily data are used to show interesting user sleep metrics and the hourly data are used to show what times of day users are most likely to be active and much more but those are some of the biggest takeaways. Please refer to the “Data Analysis: Kaggle data set” and “Data Visualizations: Kaggle data set” sections for more information on process, code reproduction, and explanations.
- Word Cloud visualizations are one way to get a high level view of string type data with characters in it and is even more effective when the data is closely related to the specific questions in the analysis. The data used in this analysis is not a perfect fit but it still has some relevance to the analysis. Please refer to the “Data Visualizations: DataverseNL data set” section.
Log which variables are being currently tracked by non-Bellabeat smart devices and use currently tracked variables to develop new product features and refine existing product features:
- The data used are archived and logged for future use. There was also an opportunity to manipulate the data to create new variables from the existing ones, which is a great way to add value from existing assets.
- The new ‘TimeUntilAsleep’ variable could be used to develop a product feature like a phone/tablet app or smart device alert to give users an exact time of when they should get in bed to achieve the exact desired duration of restful sleep they want. This new variable could also be used to implement subscription-based services with premium features such as phone/tablet app displaying data trends with visualizations.
- The new ‘StepsPerCalorie’ variable could be implemented in to a product feature to alert users, via phone/tablet app or smart device, of exactly how many calories have been burned at that point in real-time and also how many steps are required to achieve their exact desired amount of calories burned that day/week/year etc. Also, a subscription-based product feature could be developed to alert users periodically about their real-time progress and provide some kind of motivational value for the user.
- The maximum value of hourly variables ‘Calories’ or ‘StepTotal’ could be used to develop product features that show users the percentile rank they fall under for that day or hour. The 1st Quartile ‘Calories’ value could be used to refine algorithms for resting or baseline calories burned per hour.
Choose a Bellabeat product to receive increased marketing spend for Q2 2023: Leaf Product Line
- After reviewing the results of the analysis, I think the Bellabeat wearable smart device product line that will benefit most from increased marketing spend in Q2 2023 is the Bellabeat Leaf product line rather than the Ivy product line. Based on the data used for this analysis, the core features of the Leaf product line (activity, sleep) as well as the pricing are more in-line with the product segment category (FitBit and FitBit-like products) from where the data for this analysis was collected.

Actionable next steps:

Product testing the proposed product feature changes and create a implementation strategy for the changes that most align with the Bellabeat vision and wearable smart device market positioning. If no proposed feature changes make the cut then possibly do another EDA using more data or more targeted data.
Collect then analyze sales data on the Bellabeat product line (Leaf Product Line) that received increased marketing budget spend for Q2.
Develop wearable smart device target-user-profiles using behavior and preference data for future marketing and product development strategies.
Research other competing brands and products in the wearable smart device market. Possibly acquire data from a wearable smart device more closely related to the Bellabeat Ivy product line (data from other than FitBit). Then combine all the data and ask the same questions from this analysis and see if the conclusions vary or not.

Final Words:

Unfortunately, as with all EDAs there are some questions that could not be fully answered by the data and rather than force the question to fit the data we will acknowledge where the data was incompatible, incomplete, or both.

Some things we can do as an organization to better prepare for similar future analyses:
- Acknowledge that even though Bellabeat products are targeted toward women we used data where the ‘gender’ variable was removed or never collected and although it would be more helpful next time if we included data sets that include a gender type variable, we must also ask if that aligns with Bellabeat mission statement, culture, ethics, and perhaps the Bellabeat community.
- Collect, process and compare our own Bellabeat 1st party data along with 3rd party open data sets like these that were used in this analysis.
- Spend a little more time in the data processing phase of the analysis to make sure we have the most current and reliable data sets but more importantly the most relevant and useful data sets to our questions we want to answer. Use data sets with good metadata and ethical collection means.

Hopefully we can learn from this and apply it when in the ask or preparation phase in similar subsequent EDAs so expectations can be properly managed.

Search This Blog

Yes way, Jose

A Wearable Smart Device EDA for Bellabeat Inc.

Utilizing open data from non-Bellabeat wearable smart device (FitBit) and survey data from smart device users

R Programming for Analysis, Markdown for Documentation, and HTML for Sharing

By Jose F Oviedo - February 2023

A little about Bellabeat…

The goals of this EDA on wearable smart devices:

Key stakeholders for this project include:

Overview of data sets used in this EDA:

Setup in R: Kaggle data set

R programming, RStudio

Data Preparation/Processing: Kaggle data set-Part 1

R programming, tidyverse package and other R packages used to import, and explore the data set.

Data Preparation/Processing: Kaggle data set-Part 2

R programming, tidyverse package and other R packages used to organize, clean, wrangle, validate and transform the data set.

Data Preparation/Processing: DataverseNL data set

MS Excel used to validate data and organize variables in to individual files that will be uploaded in to tag generator <wordclouds.com>

Data Analysis: Kaggle data set

R programming, tidyverse package and other R packages used for data analysis.

Data Analysis: DataverseNL data set

MS Excel files uploaded to Tag generator <wordclouds.com> and used for ad-hoc analysis using wordclouds

Data Visualizations: Kaggle data set

R programming, tidyverse, ggplot2 and other R packages used for data visualizations.

Data Visualizations: DataverseNL data set

Executive Summary, Conclusions and Actionable Next Steps:

Actionable next steps:

Final Words:

Lets dive in to the Q&A…

Comments

Post a Comment