Predicting Sleep efficiency

1. Introduction

Sleep is key to maintaining a healthy lifestyle. However, it is often overlooked in today’s fast paced world. Sleep efficiency plays a crucial role in insomnia research and it is a commonly used metric to assess sleep quality. By using this sleep efficiency dataset from Morocco, we aim to explore the factors influencing sleep efficiency in a different cultural and geographical context. Understanding the dynamics of sleep patterns and efficiency in Morocco can provide valuable insights into the broader issue of sleep deprivation and its impact on individuals and societies.

While it is true that Singapore is often associated with its competitive nature and struggles with sleep, it is crucial to acknowledge that such issues are not unique to Singapore. Morocco, like many other nations, faces its own unique challenges when it comes to sleep patterns and efficiency. By examining sleep efficiency in Morocco, this project will contribute to the broader conversation about sleep health and its implications, ultimately helping to spur data-driven solutions that can improve sleep patterns and overall well-being on a global scale. Through this project, we aim to predict sleep efficiency based on various factors like age, sleep patterns, and lifestyle choices using a multiple linear regression model.

Dataset

Our dataset is sourced from Kaggle. The variables include ID, Age, Gender, Bedtime, Wakeuptime, Sleep duration, Sleep efficiency, REM sleep percentage, Deep Sleep percentage, Light sleep percentage, Awakenings, Caffeine consumption, Alcohol consumption, Smoking status and Exercise frequency. The understanding and data type of each variable is further explained in our codebook.

Connection to Spark

We initiated a connection to a local Spark cluster and read our Sleep_efficiency.csv file into a Spark DataFrame by employing spark_read_csv(). After which, we saved and read our Parquet file using the spark_write_parquet() and spark_read_parquet() functions. We chose the Apache Parquet format as it allows for efficient data storage and retrieval especially when handling large datasets.

library(arrow);library(sparklyr);library(dplyr);library(ggplot2);library(dbplot);library(corrr);library(doParallel);library(plotly);library(plumber)

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp

Attaching package: 'sparklyr'
The following object is masked from 'package:stats':

    filter

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:arrow':

    schema
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
sc <- spark_connect(master = "local", version = "3.4.0")
sleep_csv <- spark_read_csv(sc,
                        name = "Sleep",
                        path = paste0("file://", getwd(), "/Sleep_Efficiency.csv"))

parquet_path <- "Sleep_parquet"
spark_write_parquet(sleep_csv, path = parquet_path, mode="overwrite")
sleep <- spark_read_parquet(sc, path = parquet_path)

Data Cleaning

We start off our analysis with data cleaning, first removing rows with nulls, then the ID variable, as they are not needed in our analysis. We then changed smoking_status and gender to binary integer variables instead of categorical variables, changed awakenings to integer, and converted Bedtime and Wakeup_time to hour format. This brings us to 386 observations of 14 variables in our final sleep_clean dataset.

sleep_clean <- sleep |>
               filter_all(all_vars(!is.na(.)))|>
               select(-ID) |>
               mutate(Smoking_status = as.integer(ifelse(Smoking_status == "Yes", 1, 0)),
                      Awakenings = as.integer(Awakenings),
                      Gender = as.integer(ifelse(Gender == "Female", 1, 0)),
                      Bedtime = (hour(Bedtime)),
                      Wakeup_time = (hour(Wakeup_time)))
glimpse(sleep_clean)
Rows: ??
Warning in arrow_collect(object, ...): NAs introduced by coercion to integer
range
Columns: 14
Database: spark_connection
$ Age                    <int> 65, 69, 40, 40, 57, 27, 53, 41, 11, 50, 55, 30,…
$ Gender                 <int> 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1,…
$ Bedtime                <int> 1, 2, 21, 2, 1, 21, 0, 2, 1, 0, 22, 2, 1, 1, 22…
$ Wakeup_time            <int> 7, 9, 5, 8, 9, 3, 10, 8, 10, 8, 6, 11, 9, 10, 5…
$ Sleep_duration         <dbl> 6.0, 7.0, 8.0, 6.0, 8.0, 6.0, 10.0, 6.0, 9.0, 8…
$ Sleep_efficiency       <dbl> 0.88, 0.66, 0.89, 0.51, 0.76, 0.54, 0.90, 0.79,…
$ REM_sleep_percentage   <int> 18, 24, 20, 28, 27, 28, 28, 28, 18, 28, 18, 24,…
$ Deep_sleep_percentage  <int> 70, 28, 70, 25, 55, 25, 57, 60, 35, 57, 60, 63,…
$ Light_sleep_percentage <int> 10, 53, 10, 52, 18, 52, 20, 17, 45, 20, 22, 18,…
$ Awakenings             <int> 0, 3, 1, 3, 3, 2, 0, 3, 4, 1, 0, 0, 4, 2, 0, 4,…
$ Caffeine_consumption   <dbl> 0, 0, 0, 50, 0, 50, 50, 50, 0, 50, 0, 50, 0, 25…
$ Alcohol_consumption    <dbl> 0, 3, 0, 5, 3, 0, 0, 0, 3, 0, 0, 0, 0, 1, 2, 2,…
$ Smoking_status         <int> 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,…
$ Exercise_frequency     <dbl> 3, 3, 3, 1, 3, 1, 3, 1, 0, 3, 3, 1, 3, 0, 5, 0,…

Summary Statistics

Next, we examined the summary statistics to understand our data further. The average age across the participants is 41 years old, ranging from the youngest at 9 years to the oldest at 69 years old. The average sleep efficiency is 79%, slightly below the National Sleep Foundation (NSF) recommended level at 85%. On average, the participants wake up 1 to 2 times throughout the night. Additionally, exercise frequency is relatively low, typically occurring once or twice a week. On average, participants consume a relatively low amount of caffeine and alcohol, about 23 mg and 1.2 oz respectively. Summary statistics for binary variables are computed to ensure that their minimum and maximum value are 0 and 1 respectively.

sleep_clean|>
  sdf_describe()
# Source: spark<?> [?? x 15]
  summary Age         Gender Bedtime Wakeup_time Sleep_duration Sleep_efficiency
  <chr>   <chr>       <chr>  <chr>   <chr>       <chr>          <chr>           
1 count   386         386    386     386         386            386             
2 mean    40.9715025… 0.497… 10.601… 6.89637305… 7.45077720207… 0.7902849740932…
3 stddev  13.2919941… 0.500… 10.549… 2.00768324… 0.88503395905… 0.1352925512472…
4 min     9           0      0       3           5.0            0.5             
5 max     69          1      23      12          10.0           0.99            
# ℹ 8 more variables: REM_sleep_percentage <chr>, Deep_sleep_percentage <chr>,
#   Light_sleep_percentage <chr>, Awakenings <chr>, Caffeine_consumption <chr>,
#   Alcohol_consumption <chr>, Smoking_status <chr>, Exercise_frequency <chr>

2. Exploratory Data Analysis and Visualization

In this section, we performed exploratory data analysis to analyse and identify trends amongst our variables.

Histogram Frequency plot

Firstly, we explored 14 variables in the dataset by generating histograms to understand their distribution. From our frequency plots, we can draw the following insights:

  1. Sleep efficiency is relatively high ranging between 0.8 to 0.9.

  2. Age of the participants is concentrated around 25 to 50 years old.

  3. The number of males and females in the dataset appears to be relatively balanced.

  4. Most of the participants experience 7 to 8 hours of sleep per night.

  5. The majority of the participants exercise 0 or 3 times a week.

  6. The contrasting patterns of Deep_sleep_percentage and Light_sleep_percentage suggest a potentially high correlation between the two variables.

  7. The majority of the participants woke up 0 or 1 time throughout the night which could explain the relatively high sleep efficiency.

  8. The distribution of Caffeine_consumption is skewed to the right, suggesting that most participants consumed little to no caffeine.

  9. Alcohol consumption is relatively low as the majority appears to consume 0 oz throughout the day.

sleep_clean_df <- sleep_clean |>
                  collect() 
  
for (col in names(sleep_clean_df)){
  graphs <- ggplot(sleep_clean_df, aes_string(x = col)) + 
    geom_histogram(binwidth = 0.15, fill = "skyblue", color = "black")+
    labs(title = paste("Histogram of", col), x = col, y = "Frequency")
  
  print(graphs)
}
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.

Correlation plot

From the correlation plot, Deep_sleep_percentage has the highest correlation with light_sleep_percentage with a correlation coefficient of -0.99. Awakenings has a negative relationship with sleep_efficiency with a correlation coefficient of -0.57, suggesting that as the frequency of awakenings increases, sleep efficiency decreases. Additionally, Deep_sleep_percentage appears to have a strong positive correlation with sleep_efficiency while Light_sleep_percentage has a strong negative correlation with sleep_efficiency.

sleep_corr <- sleep_clean |>
            correlate(use = "pairwise.complete.obs", method = "pearson") |>
            shave(upper = TRUE) |>
            rplot(print_cor = TRUE) +
            scale_x_discrete(guide = guide_axis(angle = 45))
New names:
Correlation method: 'pearson' Missing treated using: 'pairwise.complete.obs'
• `` -> `...1`
• `` -> `...2`
• `` -> `...3`
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
• `` -> `...7`
• `` -> `...8`
• `` -> `...9`
• `` -> `...10`
• `` -> `...11`
• `` -> `...12`
• `` -> `...13`
• `` -> `...14`
sleep_corr

grouped_sleep1 <- sleep_clean |>
                   mutate(Alcohol_status = ifelse(Alcohol_consumption %in% c(1,2,3,4,5),1,0))|>
                   group_by(Alcohol_status, Smoking_status)|>
                   summarise(n = n())|>
                   group_by(Alcohol_status)|>
                   summarize(count = sum(n), prop = sum(Smoking_status * n) / sum(n))|> #prop of those who smoke
                   collect()
`summarise()` has grouped output by "Alcohol_status". You can override using
the `.groups` argument.
Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.
`summarise()` has grouped output by "Alcohol_status". You can override using
the `.groups` argument.
grouped_sleep1 |> 
  mutate(Alcohol_status = as.factor(Alcohol_status))|>
  ggplot(aes(x = Alcohol_status, y = prop))+
  geom_col(width = 0.3)+
  labs(title = "Proportion of smokers who consume alcohol")+
  xlab("Alcohol consumption status")+
  ylab("Proportion")

Chi-square test for independence for Alcohol consumption and Smoking status

The above plot plot shows that Alcohol_consumption and Smoking_status do not seem to correlate with each other as there is no significant difference between the proportion of smokers who do and do not consume alcohol. Hence, this motivates the use of the chi-square test for independence. Since p-value = 0.5436 > 0.05, we do not reject the null hypothesis at the 5% level of significance and conclude that there is sufficient evidence that smoking status and alcohol consumption are independent.

contingency_table <- sleep_clean|>
                     mutate(Alcohol_status = as.integer(ifelse(Alcohol_consumption %in% c(1,2,3,4,5),1,0)))|>
                     sdf_crosstab("Smoking_status", "Alcohol_status")|>
                     collect()|>
                     sapply(as.numeric)|>
                     as.matrix()

contingency_matrix <- matrix(c(60,111,60,94), nrow = 2)
chisq.test(contingency_matrix)

    Pearson's Chi-squared test with Yates' continuity correction

data:  contingency_matrix
X-squared = 0.36889, df = 1, p-value = 0.5436

Relationship between Average sleep efficiency and Smoking status

This plot shows the 95% confidence interval of average sleep efficiency for smokers and non-smokers respectively. The dashed line represents the average sleep efficiency (about 0.79) of all the participants regardless of their smoking status. Compared to their smoking counterparts, non-smokers experience a higher sleep efficiency of 0.82 as opposed to 0.72. Furthermore, their sleep efficiency is higher than the overall average. Hence, smoking does have an effect on sleep efficiency.

smoke_prop_data <- sleep_clean |>
                   group_by(Smoking_status) |>
                   summarise(
                      avg_sleep_efficiency = mean(Sleep_efficiency, na.rm = TRUE),
                      sd_sleep_efficiency = sd(Sleep_efficiency, na.rm = TRUE),
                      n = n()) |>
                   mutate(se = sd_sleep_efficiency/sqrt(n)) |>
                    collect()

average_sleep_eff_df <- sleep_clean |>
                        summarise(avg_sleep_efficiency = mean(Sleep_efficiency))|>
                        collect()

average_sleep_eff_vec <- average_sleep_eff_df$avg_sleep_efficiency

#plotting relationship between smoking status and sleep efficiency
smoke_prop_data |>
  mutate(Smoking_status = as.factor(Smoking_status))|>
  ggplot(aes(x = Smoking_status, y = avg_sleep_efficiency))+
  geom_point(size = 2)+
  geom_errorbar(aes(ymin = avg_sleep_efficiency - 1.96*se,
                    ymax = avg_sleep_efficiency + 1.96*se),
                    width = 0.1) +
  geom_hline(yintercept = average_sleep_eff_vec, linetype = "dashed")+
  labs(title = "Average sleep efficiency based on smoking status")+
  ylab("Average sleep efficiency")+
  xlab("Smoking status")

Relationship between Average sleep efficiency and Gender

Similarly, the interval plot shows the 95% confidence interval for average sleep efficiency of both male and female. Dashed line represents the average sleep efficiency regardless of gender. Females experience a slightly higher sleep quality as compared to males, and the overall average. However, there is no significant difference in average sleep efficiency between the two groups. Hence gender does not affect sleep efficiency.

gender_prop_data <- sleep_clean |>
                    group_by(Gender) |>
                    summarise(avg_sleep_efficiency = mean(Sleep_efficiency),
                              sd_sleep_efficiency = sd(Sleep_efficiency),
                              n = n())|>
                    mutate(se = sd_sleep_efficiency/sqrt(n)) |>
                    collect()

gender_prop_data |>
  mutate(Gender = as.factor(Gender)) |>
  ggplot(aes(x = Gender, y = avg_sleep_efficiency))+
  geom_point(size = 2)+
  geom_errorbar(aes(ymin = avg_sleep_efficiency - 1.96*se,
                    ymax = avg_sleep_efficiency + 1.96*se),
                    width = 0.1) +
  geom_hline(yintercept = average_sleep_eff_vec, linetype = "dashed")+
  labs(title = "Average sleep efficiency based on gender")+
  ylab("Average sleep efficiency")+
  xlab("Gender")

Relationship between Age, Sleep duration and Sleep efficiency

From the line graph, participants in their early 30s experienced a significant drop in sleep efficiency, despite having the second longest sleeping duration across all age groups. This suggests that they might have trouble falling asleep which implies that they spend more time awake while in bed. Therefore, a longer sleep duration does not necessarily lead to a higher sleep efficiency.

grouped_sleep2 <- sleep_clean |> 
                  group_by(Age) |>
                  summarise(avg_sleep_efficiency = mean(Sleep_efficiency))|>
                  collect() 
grouped_sleep2 |>
  ggplot(aes(x = Age, y = avg_sleep_efficiency)) +
  geom_line()+
  labs(title = "Average sleep efficiency by age")+
  ylab("Average sleep efficiency")+
  xlab("Age")+
  coord_cartesian(ylim = c(0.4, 1))

grouped_sleep3 <- sleep_clean |> 
                group_by(Age) |>
                summarise(avg_sleep_duration = mean(Sleep_duration))|>
                collect() 
grouped_sleep3 |>
  ggplot(aes(x = Age, y = avg_sleep_duration)) +
  geom_line()+
  labs(title = "Average sleep duration by age")+
  ylab("Average sleep duration")+
  xlab("Age")

Relationship between Sleep efficiency and Awakenings

From the box plot, sleep efficiency is the highest when participants did not wake up during the night. The average sleep efficiency is 0.92, surpassing the NSF’s recommended value of 0.85. Sleep efficiency is approximately 0.7 for individuals who woke up 2, 3 or 4 times during the night. This indicates that Awakenings might be useful to predict sleep efficiency for individuals experiencing less disrupted sleep.

sleep_clean |>
  ggplot(aes(x = as.factor(Awakenings), y = Sleep_efficiency)) +
  geom_boxplot() +
  xlab("Awakenings") +
  ylab("Sleep efficiency")

Relationship between Sleep efficiency and Caffeine consumption

One can experience varying levels of sleep efficiency even when 0 mg of caffeine is consumed. As shown in the raster plot, there are outliers that show that sleep efficiency can be as high as 0.9 even when a high dose of caffeine is consumed. Hence, caffeine consumption might not be a significant variable. Moreover, the box plot shows that caffeine consumption does not seem to affect Awakenings as well. This suggests that caffeine might not necessarily affect sleep efficiency in this context. Nonetheless, if a larger sample size or additional information about the timing of caffeine consumption were available, it might provide a clearer understanding of this relationship.

sleep_clean |>
  dbplot_raster(x = Caffeine_consumption, y = Sleep_efficiency, fill = n(), resolution = 10)

grouped_sleep4 <- sleep_clean |>
                  group_by(Awakenings)|>
                  summarise(Caffeine_consumption_percentile = percentile(Caffeine_consumption, array(0.25,0.5,0.75)))|>
                  mutate(Caffeine_consumption_percentile = explode(Caffeine_consumption_percentile))|>
                  collect()

grouped_sleep4 |>
  mutate(Awakenings = as.factor(Awakenings))|>
  ggplot(aes(x = Awakenings, y = Caffeine_consumption_percentile))+
  geom_boxplot()+
  labs(title = "The effects of Caffeine consumption on Awakenings")+
  ylab("Caffeine consumption")+
  xlab("Awakenings")

Relationship between Sleep efficiency and Alcohol consumption

The interval plot shows the 95% confidence interval for average sleep efficiency based on the alcohol consumption level. In general, an increase in alcohol consumption generally results in decreased sleep efficiency.

alcohol_prop_data <- sleep_clean |>
  group_by(Alcohol_consumption) |>
  summarise(
    avg_sleep_efficiency = mean(Sleep_efficiency, na.rm = TRUE),
    sd_sleep_efficiency = sd(Sleep_efficiency, na.rm = TRUE),
    n = n()
  ) |>
  mutate(se = sd_sleep_efficiency/sqrt(n)) |>
  arrange(Alcohol_consumption) |>
  collect()

alcohol_prop_data |>
  ggplot(aes(x = Alcohol_consumption, y = avg_sleep_efficiency))+
  geom_point(size = 2)+
  geom_errorbar(aes(ymin = avg_sleep_efficiency - 1.96*se,
                    ymax = avg_sleep_efficiency + 1.96*se),
                    width = 0.1) +
  geom_hline(yintercept = average_sleep_eff_vec, linetype = "dashed")+
  labs(title = "Average sleep efficiency based alcohol consumption")+
  ylab("Average sleep efficiency")+
  xlab("Alcohol consumption (in oz)")

Relationship between Sleep efficiency and Exercise frequency

The interval plot shows the 95% confidence interval for average sleep efficiency based on the exercise frequency. We can observe that exercising generally improves sleep efficiency, but the benefits of exercising more than once a week is hard to discern because the confidence intervals for frequencies 1 to 5 all overlap the average sleep efficiency level.

exercise_prop_data <- sleep_clean |>
                    group_by(Exercise_frequency) |>
                    summarise(avg_sleep_efficiency = mean(Sleep_efficiency),
                              sd_sleep_efficiency = sd(Sleep_efficiency),
                              n = n())|>
                    mutate(se = sd_sleep_efficiency/sqrt(n)) |>
                    arrange(Exercise_frequency) |>
                    collect()

#plotting relationship between exercise and sleep efficiency
exercise_prop_data |>
  ggplot(aes(x = Exercise_frequency, y = avg_sleep_efficiency))+
  geom_point(size = 2)+
  geom_errorbar(aes(ymin = avg_sleep_efficiency - 1.96*se,
                    ymax = avg_sleep_efficiency + 1.96*se),
                    width = 0.1) +
  geom_hline(yintercept = average_sleep_eff_vec, linetype = "dashed")+
  labs(title = "Average sleep efficiency based on exercise")+
  ylab("Average sleep efficiency")+
  xlab("Exercise Frequency (per week)")

Relationship between Bedtime and Sleep efficiency

The plot shows the average sleep efficiency across various bedtimes. The size of the points represents the number of individuals who slept at each time. Majority of the participants tend to sleep at either 10pm or 12am. Sleep efficiency is the highest when individuals go to bed at 10pm and starts to drop thereafter.

Upon closer examination, the plot shows the scatter plots after filtering for cases where sleep efficiency was above average level of 0.79, and sleep duration was at least 7 hours.

grouped_sleep5 <- sleep_clean |>
                  group_by(Bedtime)|>
                  summarise(n = n(), ave_sleep = mean(Sleep_efficiency))|>
                  arrange(desc(ave_sleep))|>
                  collect()

bedtime_plot <- grouped_sleep5 |>
                mutate(Bedtime = as.factor(Bedtime)) |>
                ggplot(aes(x=Bedtime, y=ave_sleep))+
                geom_point(aes(size = n))+
                labs(title = "The effects of Bedtime on Sleep efficiency")+
                ylab("Average sleep efficiency")+
                xlab("Bedtime in 24 Hours")+
                theme_minimal()
ggplotly(bedtime_plot)
grouped_sleep6 <- sleep_clean |>
                  group_by(Bedtime, Sleep_duration)|>
                  summarise(n = n(), ave_sleep = mean(Sleep_efficiency))|>
                  arrange(desc(ave_sleep))|>
                  filter(ave_sleep>0.79 & Sleep_duration>6) |>
                  collect()
`summarise()` has grouped output by "Bedtime". You can override using the
`.groups` argument.
bedtime_plot <- grouped_sleep6 |>
                mutate(Bedtime = as.factor(Bedtime)) |>
                ggplot(aes(x=Bedtime, y=ave_sleep))+
                geom_point(aes(size = n))+
                geom_point(aes(colour= Sleep_duration))+
                labs(title = "The effects of Bedtime on Sleep efficiency")+
                ylab("Average sleep efficiency")+
                xlab("Bedtime in 24 Hours")+
                theme_minimal()
ggplotly(bedtime_plot)

3. Data Modelling in Spark

In this section, we aim to use machine learning models to conduct predictive modelling. We used multiple linear regression to predict sleep efficiency. The first model includes all the variables except for Bedtime and Wakeup_time to avoid perfect collinearity problem as they have a linear relationship with Sleep_duration. In the second model, Deep_sleep_percentage is removed to avoid multicollinearity as it is highly correlated with Light_sleep_percentage. Additionally, our visualisations have shown that Gender, Sleep_duration and Caffeine_consumption does not affect sleep efficiency. Moreover, given that their p-value in the first regression is more than 0.05, it suggests that they are not statistically significant. Hence they are excluded from our second model. Lastly, the third model is the same as the second model but includes Caffeine_consumption.

sleep_split <- sleep_clean |>
               sdf_random_split(training = 0.8,
                                testing = 0.2,
                                seed = 1337)

sleep_train <- sleep_split$training
sleep_test <- sleep_split$testing
ml1 <- sleep_train |> 
       ml_linear_regression(formula = Sleep_efficiency ~ . -Bedtime -Wakeup_time)|>
        tidy()
ml1
# A tibble: 12 × 5
   term                    estimate std.error statistic  p.value
   <chr>                      <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)             0.811     0.148        5.49  8.12e- 8
 2 Age                     0.000611  0.000265     2.30  2.19e- 2
 3 Gender                  0.00247   0.00753      0.328 7.43e- 1
 4 Sleep_duration         -0.000614  0.00372     -0.165 8.69e- 1
 5 REM_sleep_percentage    0.00410   0.00105      3.91  1.15e- 4
 6 Deep_sleep_percentage   0.000852  0.00163      0.523 6.01e- 1
 7 Light_sleep_percentage -0.00482   0.00169     -2.86  4.49e- 3
 8 Awakenings             -0.0307    0.00267    -11.5   0       
 9 Caffeine_consumption    0.000145  0.000129     1.12  2.63e- 1
10 Alcohol_consumption    -0.00723   0.00227     -3.18  1.62e- 3
11 Smoking_status         -0.0472    0.00725     -6.51  2.96e-10
12 Exercise_frequency      0.00568   0.00250      2.27  2.38e- 2
ml2 <- sleep_train |> 
       ml_linear_regression(formula = Sleep_efficiency ~ . -Bedtime -Wakeup_time-Gender-Sleep_duration-Deep_sleep_percentage-Caffeine_consumption)|>
       tidy()
ml2
# A tibble: 8 × 5
  term                    estimate std.error statistic  p.value
  <chr>                      <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)             0.884     0.0241       36.7  0       
2 Age                     0.000531  0.000248      2.14 3.32e- 2
3 REM_sleep_percentage    0.00406   0.000865      4.69 4.07e- 6
4 Light_sleep_percentage -0.00567   0.000262    -21.7  0       
5 Awakenings             -0.0312    0.00260     -12.0  0       
6 Alcohol_consumption    -0.00753   0.00225      -3.34 9.31e- 4
7 Smoking_status         -0.0475    0.00711      -6.68 1.05e-10
8 Exercise_frequency      0.00490   0.00236       2.07 3.88e- 2
ml3 <- sleep_train |> 
       ml_linear_regression(formula = Sleep_efficiency ~ . -Bedtime -Wakeup_time-Gender-Sleep_duration-Deep_sleep_percentage)|>
       tidy()

ml3
# A tibble: 9 × 5
  term                    estimate std.error statistic  p.value
  <chr>                      <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)             0.883     0.0241       36.6  0       
2 Age                     0.000568  0.000250      2.27 2.36e- 2
3 REM_sleep_percentage    0.00383   0.000883      4.34 1.96e- 5
4 Light_sleep_percentage -0.00570   0.000262    -21.7  0       
5 Awakenings             -0.0307    0.00263     -11.7  0       
6 Caffeine_consumption    0.000157  0.000125      1.26 2.09e- 1
7 Alcohol_consumption    -0.00722   0.00226      -3.19 1.57e- 3
8 Smoking_status         -0.0472    0.00710      -6.65 1.29e-10
9 Exercise_frequency      0.00534   0.00239       2.24 2.59e- 2

10-Fold Cross-Validation

To find the best model, we employed a 10-fold cross-validation by using ml_cross_validator() to compare their Root Mean Squared Error (RMSE). The parameters of elastic_net_param and reg_param are set to zero to perform the least squares. (Ridge regression without penalty). From the RMSE plot, it shows that the RMSE for Model 2 is the lowest (0.05933). Nonetheless, Model 3 with added the variable: caffeine_consumption, is still chosen to be the best model to predict sleep efficiency. As this plot employs ggplot(), it is important to note that the axes are automatically scaled and the difference in RMSE between the models differ slightly in absolute terms despite the huge drop reflected in the line graph.

Despite the slight increase in Model 3’s RMSE, we still chose to include caffeine_consumption in our final model. The prevalence of tea drinking as part of Moroccan culture explains the relatively low amount of caffeine consumed (about 23 mg) per person. However, it is still an important variable to predict sleep efficiency for individuals outside Morocco, where higher levels of caffeine consumption is more prevalent.

pipeline1 <- ml_pipeline(sc) |>
             ft_r_formula(formula = Sleep_efficiency ~ . -Bedtime -Wakeup_time)|> 
             ml_linear_regression()

grid <- list(linear_regression = list(elastic_net_param = 0, reg_param = 0))

cv <- ml_cross_validator(
  sc,
  estimator = pipeline1,
  evaluator = ml_regression_evaluator(
              sc, 
              label_col = "Sleep_efficiency"),
  estimator_param_maps = grid,
  num_folds = 10,
  parallelism = 1,
  seed = 1337
)

pipeline_model1 <- ml_fit(cv,sleep_train)
rmse1 <- ml_validation_metrics(pipeline_model1);rmse1
        rmse reg_param_1 elastic_net_param_1
1 0.05987477           0                   0
pipeline2 <- ml_pipeline(sc) |>
             ft_r_formula(formula = Sleep_efficiency ~ . -Bedtime -Wakeup_time-Gender-Sleep_duration-Deep_sleep_percentage-Caffeine_consumption)|> 
             ml_linear_regression()

grid <- list(linear_regression = list(elastic_net_param = 0, reg_param = 0))

cv <- ml_cross_validator(
  sc,
  estimator = pipeline2,
  evaluator = ml_regression_evaluator(
              sc, 
              label_col = "Sleep_efficiency"),
  estimator_param_maps = grid,
  num_folds = 10,
  parallelism = 1,
  seed = 1337
)

pipeline_model2 <- ml_fit(cv,sleep_train)
rmse2 <- ml_validation_metrics(pipeline_model2);rmse2
        rmse reg_param_1 elastic_net_param_1
1 0.05933476           0                   0
pipeline3 <- ml_pipeline(sc) |>
             ft_r_formula(formula = Sleep_efficiency ~ . -Bedtime -Wakeup_time-Gender-Sleep_duration-Deep_sleep_percentage)|> 
             ml_linear_regression()

grid <- list(linear_regression = list(elastic_net_param = 0, reg_param = 0))

cv <- ml_cross_validator(
  sc,
  estimator = pipeline3,
  evaluator = ml_regression_evaluator(
              sc, 
              label_col = "Sleep_efficiency"),
  estimator_param_maps = grid,
  num_folds = 10,
  parallelism = 1,
  seed = 1337
)

pipeline_model3 <- ml_fit(cv,sleep_train)
rmse3 <- ml_validation_metrics(pipeline_model3);rmse3
        rmse elastic_net_param_1 reg_param_1
1 0.05943701                   0           0
rmse_df <- data.frame(model = c(1,2,3), 
                      rmse = c(rmse1$rmse,rmse2$rmse,rmse3$rmse)) |>
           mutate(model = as.factor(model))

rmse_plot <- rmse_df |>
            ggplot(aes(x = model, y = rmse, group = 1)) +
            geom_point() +  
            geom_line(color = "deepskyblue") + 
            geom_text(aes(label = round(rmse,5)), vjust = -0.5)+
            ylim(0.058, 0.06)+
            labs(title = "Plot of RMSE by Model", x = "Model", y = "RMSE")  

rmse_plot

4. Spark ML pipeline

Improving Model 3 by tuning its hyperparameters

Final model after hyperparameter tuning:

Sleep_efficiency = 0.757 + 0.00757 Age + 0.0147 REM_sleep_percentage -0.0861 Light_sleep_percentage -0.0422 Awakenings + 0.0043 Caffeine_consumption -0.0128 Alcohol_consumption + 0.0472 Smoking_status + 0.0080 Exercise_frequency

Firstly, a pipeline of 6 stages is created. All numerical variables in Model 3 are standardised with mean equals to 0 and standard deviation equals to 1. Next, Smoking_status is indexed and encoded. Finally, all variables go through the ml_linear_regression() stage. The pipeline was created using a for-loop, as we have treated our dataset as if it were big data with more variables. In reality, there could be more categorical variables in the regression to be indexed and encoded. Therefore, this is a more efficient approach and improves the readability of the pipeline.

all_str_variables <- sleep |> 
                     select_if(is.character)|>
                     names()

all_variables_ml3 <- ml3$term[-1]
str_variables_ml3 <- intersect(all_str_variables, all_variables_ml3)
num_variables_ml3 <- setdiff(all_variables_ml3, str_variables_ml3)
pipeline <- ml_pipeline(sc) |>
            ft_vector_assembler(input_col = num_variables_ml3,
                                output_col = "features") |>
            ft_standard_scaler(input_col = "features",
                               output_col = "stdz_features",
                               with_mean = TRUE) 

encoder_input_vec <- c()
for (variable in str_variables_ml3) {
  output_col <- paste0(variable, "_indexed")
  pipeline <- pipeline |>
                        ft_string_indexer(
                        input_col = variable, 
                        output_col = paste0(variable, "_indexed")) 
  encoder_input_vec <- c(encoder_input_vec, output_col)
}


encoder_output_vec <- c()
for (variable in encoder_input_vec) {
  output_col <- paste0(sub("_indexed", "", variable), "_encoded")
  pipeline <- pipeline |>
                        ft_one_hot_encoder(
                        input_cols = variable, 
                        output_col = paste0(sub("_indexed", "", variable), "_encoded"))
  encoder_output_vec <- c(encoder_output_vec , output_col)
}

input_vector <- c("stdz_features", encoder_output_vec)

pipeline <- pipeline |>
            ft_vector_assembler(
            input_cols =  input_vector, 
            output_col = "final_features") |>
            
            ml_linear_regression(
            features_col = "final_features", 
            label_col = "Sleep_efficiency")

Then, cross validation with different combinations of elastic_net_param from values 0 to 1 is employed. The initial value for reg_param starts from 0.001 instead of 0 because it gives us a model with a lower RMSE. Finally, the improved model has a slightly lower RMSE of 0.05939 with the following hyperparameters – elastic_net_param_1 = 0 and reg_param_1 = 0.00289.

ncores <- detectCores(logical = FALSE)
grid <- list(
        linear_regression = list(
        elastic_net_param = seq(from = 0, to = 1, length = 11), 
        reg_param = seq(from = 0.001, to = 0.01, length = 20)))

cv <- ml_cross_validator(
  sc,
  estimator = pipeline,
  estimator_param_maps = grid,
  evaluator = ml_regression_evaluator(sc, 
                                      label_col = "Sleep_efficiency"),
  num_folds = 10,
  parallelism = ncores,
  seed = 1337
)
cv_model <- ml_fit(cv,sleep_train);cv_model
CrossValidatorModel (Transformer)
<cross_validator__de058fc5_a302_4604_9e25_7c466b836d9c> 
 (Parameters -- Tuning)
  estimator: Pipeline
             <pipeline__89600ec1_cf63_4de2_9b62_a5ee1b54610e> 
  evaluator: RegressionEvaluator
             <regression_evaluator__312a4dcb_0e1e_4621_a46c_cedd6fc9821b> 
    with metric rmse 
  num_folds: 10 
  [Tuned over 220 hyperparameter sets]
 (Best Model)
  PipelineModel (Transformer) with 6 stages
  <pipeline__89600ec1_cf63_4de2_9b62_a5ee1b54610e> 
    Stages 
    |--1 VectorAssembler (Transformer)
    |    <vector_assembler__dc637456_1dad_4839_a9bb_c9d7b05a8b53> 
    |     (Parameters -- Column Names)
    |      input_cols: Age, REM_sleep_percentage, Light_sleep_percentage, Awakenings, Caffeine_consumption, Alcohol_consumption, Exercise_frequency
    |      output_col: features
    |--2 StandardScalerModel (Transformer)
    |    <standard_scaler__70e92643_d090_4997_b652_95a46b41964a> 
    |     (Parameters -- Column Names)
    |      input_col: features
    |      output_col: stdz_features
    |     (Transformer Info)
    |      mean:  num [1:7] 41.1 22.98 25.08 1.66 22.31 ... 
    |      std:  num [1:7] 13.39 3.9 15.49 1.39 28.03 ... 
    |--3 StringIndexerModel (Transformer)
    |    <string_indexer__d11e90e3_92af_4714_91ac_2992f74fb923> 
    |     (Parameters -- Column Names)
    |      input_col: Smoking_status
    |      output_col: Smoking_status_indexed
    |     (Transformer Info)
    |      labels:  chr [1:2] "0" "1" 
    |--4 OneHotEncoderModel (Transformer)
    |    <one_hot_encoder__41bbd743_a79b_45bd_a424_9079990b292f> 
    |     (Parameters -- Column Names)
    |      input_cols: Smoking_status_indexed
    |      output_col: one_hot_encoder__41bbd743_a79b_45bd_a424_9079990b292f__output
    |      output_cols: Smoking_status_encoded
    |     (Transformer Info)
    |      category_sizes:  int 2 
    |--5 VectorAssembler (Transformer)
    |    <vector_assembler__fcb97d87_f2dd_48c5_b6c8_f0046e5b859e> 
    |     (Parameters -- Column Names)
    |      input_cols: stdz_features, Smoking_status_encoded
    |      output_col: final_features
    |--6 LinearRegressionModel (Transformer)
    |    <linear_regression__a78e9a1b_2d71_4828_b5b0_69d3bffa66eb> 
    |     (Parameters -- Column Names)
    |      features_col: final_features
    |      label_col: Sleep_efficiency
    |      prediction_col: prediction
    |     (Transformer Info)
    |      coefficients:  num [1:8] 0.0075 0.01438 -0.08611 -0.04219 0.00438 ... 
    |      intercept:  num 0.757 
    |      num_features:  int 8 
    |      scale:  num 1 
ml_validation_metrics(cv_model) |>
              arrange(rmse)|>
              head(3)
        rmse elastic_net_param_1 reg_param_1
1 0.05939788                   0 0.002894737
2 0.05939823                   0 0.003368421
3 0.05939934                   0 0.002421053
ml_stage(cv_model$best_model, stage = "linear_regression")
LinearRegressionModel (Transformer)
<linear_regression__a78e9a1b_2d71_4828_b5b0_69d3bffa66eb> 
 (Parameters -- Column Names)
  features_col: final_features
  label_col: Sleep_efficiency
  prediction_col: prediction
 (Transformer Info)
  coefficients:  num [1:8] 0.0075 0.01438 -0.08611 -0.04219 0.00438 ... 
  intercept:  num 0.757 
  num_features:  int 8 
  scale:  num 1 
bestmodel_coeff <- ml_stage(cv_model$best_model, stage = "linear_regression")$coefficients
bestmodel_coeff
[1]  0.007503705  0.014378519 -0.086109289 -0.042190082  0.004375454
[6] -0.012812382  0.008021552  0.047199543
predictions <- ml_transform(cv_model, sleep_test)|>
               collect()
Warning in arrow_enabled_object.spark_jobj(sdf): Arrow disabled due to columns:
features, stdz_features, Smoking_status_encoded, final_features
glimpse(predictions)
Rows: 61
Columns: 20
$ Age                    <int> 18, 21, 22, 22, 23, 24, 24, 24, 25, 27, 27, 27,…
$ Gender                 <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1,…
$ Bedtime                <int> 0, 0, 21, 23, 23, 0, 0, 2, 22, 23, 23, 0, 1, 21…
$ Wakeup_time            <int> 8, 9, 3, 6, 7, 7, 9, 8, 6, 6, 7, 7, 8, 4, 4, 5,…
$ Sleep_duration         <dbl> 8.0, 8.5, 6.0, 7.0, 8.0, 7.5, 8.5, 6.0, 8.0, 7.…
$ Sleep_efficiency       <dbl> 0.68, 0.87, 0.72, 0.91, 0.80, 0.58, 0.88, 0.78,…
$ REM_sleep_percentage   <int> 20, 26, 27, 25, 15, 24, 26, 26, 28, 24, 25, 28,…
$ Deep_sleep_percentage  <int> 30, 63, 55, 65, 65, 28, 63, 60, 57, 25, 20, 60,…
$ Light_sleep_percentage <int> 48, 18, 18, 17, 15, 53, 18, 16, 21, 52, 55, 17,…
$ Awakenings             <int> 3, 0, 1, 1, 4, 1, 0, 1, 2, 3, 3, 1, 3, 2, 2, 1,…
$ Caffeine_consumption   <dbl> 25, 50, 0, 75, 75, 0, 50, 0, 25, 75, 0, 50, 25,…
$ Alcohol_consumption    <dbl> 2, 0, 5, 1, 0, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0,…
$ Smoking_status         <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1,…
$ Exercise_frequency     <dbl> 0, 1, 3, 4, 2, 0, 1, 0, 3, 2, 2, 1, 1, 1, 2, 2,…
$ features               <list> <18, 20, 48, 3, 25, 2, 0>, <21, 26, 18, 0, 50,…
$ stdz_features          <list> <-1.72550559, -0.76333571, 1.47963540, 0.96541…
$ Smoking_status_indexed <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1,…
$ Smoking_status_encoded <list> 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0…
$ final_features         <list> <-1.72550559, -0.76333571, 1.47963540, 0.96541…
$ prediction             <dbl> 0.5970352, 0.8565372, 0.8423357, 0.8885480, 0.7…

Predict sleep efficiency on the test set using the best model after hyperparameter tuning. This plot shows the actual against predicted sleep efficiency. The model performs relatively well in predicting individuals with sleep efficiency between 0.85 to 0.90.

predictions |> 
  ggplot(aes(x = prediction, y = Sleep_efficiency)) +
  geom_point(color = "deepskyblue4", size = 2) +
  geom_abline(color = "deepskyblue2", linetype = "dashed") +
  labs(
    x = "Predicted sleep efficiency",
    y = "Actual sleep efficiency",
    title = "Predicted vs Actual sleep efficiency"
  )