The company Cyclistic offers a bike-share service to a network of 692 stations in Chicago. There are three types of pricing plans:
Casual riders: customers who use single-ride or full-day passes.
Annual members: customers who purchase annual memberships.
Finance analysts at Cyclistic have determined that annual members are more profitable than casual riders.
The current goal of Cyclistic is to develop marketing strategies that will convert casual riders into annual members. In order for that to happen, the marketing team needs to understand how casual riders differ from annual members.
This report analyzes how casual riders and annual members use Cyclistic bikes differently using data from last year, 2021.
The data is publicly available here from the source company Divvy, designated as Cyclistic in this report. The data is provided under this license. The data is provided by the month, and contains anonymized information about:
The dataset in question is:
Cyclistic itself.The data was concatenated using this Jupyter Notebook.
Since the 2021 data is now all in one file, we can load it and take a peek at its structure:
df <- read_csv("D:\\Datasets\\Divvy_Trips_Data\\Divvy_Trips_2021\\2021_all_data.csv")
## New names:
## * `` -> ...1
## Rows: 5595063 Columns: 14
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (5): ...1, start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(df)
## spec_tbl_df [5,595,063 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:5595063] 0 1 2 3 4 5 6 7 8 9 ...
## $ ride_id : chr [1:5595063] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr [1:5595063] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:5595063], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
## $ ended_at : POSIXct[1:5595063], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
## $ start_station_name: chr [1:5595063] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:5595063] "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr [1:5595063] NA NA NA NA ...
## $ end_station_id : chr [1:5595063] NA NA NA NA ...
## $ start_lat : num [1:5595063] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:5595063] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:5595063] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:5595063] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:5595063] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
We rename the “…1” column and make sure that all columns have consistent nomenclature:
df <- rename(df, index = "...1")
df <- rename(df, Member_Type = "member_casual")
# let's convert the member types to uppercase:
df$Member_Type[df$Member_Type=="casual"] <- "Casual"
df$Member_Type[df$Member_Type=="member"] <- "Member"
colnames(df)
## [1] "index" "ride_id" "rideable_type"
## [4] "started_at" "ended_at" "start_station_name"
## [7] "start_station_id" "end_station_name" "end_station_id"
## [10] "start_lat" "start_lng" "end_lat"
## [13] "end_lng" "Member_Type"
clean_names(df)
## # A tibble: 5,595,063 x 14
## index ride_id rideable_type started_at ended_at
## <dbl> <chr> <chr> <dttm> <dttm>
## 1 0 E19E6F1B8D4C42ED electric_bike 2021-01-23 16:14:19 2021-01-23 16:24:44
## 2 1 DC88F20C2C55F27F electric_bike 2021-01-27 18:43:08 2021-01-27 18:47:12
## 3 2 EC45C94683FE3F27 electric_bike 2021-01-21 22:35:54 2021-01-21 22:37:14
## 4 3 4FA453A75AE377DB electric_bike 2021-01-07 13:31:13 2021-01-07 13:42:55
## 5 4 BE5E8EB4E7263A0B electric_bike 2021-01-23 02:24:02 2021-01-23 02:24:45
## 6 5 5D8969F88C773979 electric_bike 2021-01-09 14:24:07 2021-01-09 15:17:54
## 7 6 09275CC10F854E9E electric_bike 2021-01-04 05:05:04 2021-01-04 05:10:39
## 8 7 DF7A32A217AEFB14 electric_bike 2021-01-14 15:07:00 2021-01-14 15:13:40
## 9 8 C2EFC62379EB716C electric_bike 2021-01-09 09:57:55 2021-01-09 10:00:26
## 10 9 B9F73448DFBE0D45 classic_bike 2021-01-24 19:15:38 2021-01-24 19:22:51
## # ... with 5,595,053 more rows, and 9 more variables: start_station_name <chr>,
## # start_station_id <chr>, end_station_name <chr>, end_station_id <chr>,
## # start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## # member_type <chr>
Lets make sure our datetime columns are formatted as such:
df[["started_at"]] <- as.POSIXct(df[["started_at"]],
format = "%Y-%m-%d %H:%M:%S")
df[["ended_at"]] <- as.POSIXct(df[["ended_at"]],
format = "%Y-%m-%d %H:%M:%S")
# lets check to see if the formatting has worked:
sapply(df, class)
## $index
## [1] "numeric"
##
## $ride_id
## [1] "character"
##
## $rideable_type
## [1] "character"
##
## $started_at
## [1] "POSIXct" "POSIXt"
##
## $ended_at
## [1] "POSIXct" "POSIXt"
##
## $start_station_name
## [1] "character"
##
## $start_station_id
## [1] "character"
##
## $end_station_name
## [1] "character"
##
## $end_station_id
## [1] "character"
##
## $start_lat
## [1] "numeric"
##
## $start_lng
## [1] "numeric"
##
## $end_lat
## [1] "numeric"
##
## $end_lng
## [1] "numeric"
##
## $Member_Type
## [1] "character"
We look at null values for our data frame by column:
colSums(is.na.data.frame(df))
## index ride_id rideable_type started_at
## 0 0 0 0
## ended_at start_station_name start_station_id end_station_name
## 0 690809 690806 739170
## end_station_id start_lat start_lng end_lat
## 739170 0 0 4771
## end_lng Member_Type
## 4771 0
We have a lot of null values in certain columns. Let’s get rid of the values in the largest column (end_station_name):
df <- drop_na(df, end_station_name)
# checking to see if na values are dropped:
colSums(is.na.data.frame(df))
## index ride_id rideable_type started_at
## 0 0 0 0
## ended_at start_station_name start_station_id end_station_name
## 0 267591 267590 0
## end_station_id start_lat start_lng end_lat
## 0 0 0 0
## end_lng Member_Type
## 0 0
We still have around 270k null values, so we keep cleaning:
df <- drop_na(df, start_station_name)
# check again to see if na values are dropped:
colSums(is.na.data.frame(df))
## index ride_id rideable_type started_at
## 0 0 0 0
## ended_at start_station_name start_station_id end_station_name
## 0 0 0 0
## end_station_id start_lat start_lng end_lat
## 0 0 0 0
## end_lng Member_Type
## 0 0
Great! Now the data frame is free of null values.
We can proceed to check for duplicates in the data. Since the ride_id is a primary key, we can use that to see if we have any duplicates in our data:
get_dupes(df, ride_id)
## No duplicate combinations found of: ride_id
## # A tibble: 0 x 15
## # ... with 15 variables: ride_id <chr>, dupe_count <int>, index <dbl>,
## # rideable_type <chr>, started_at <dttm>, ended_at <dttm>,
## # start_station_name <chr>, start_station_id <chr>, end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, Member_Type <chr>
Now our data is:
The steps taken above verify the dataset’s integrity.
In terms of bias, this does not seem to be an issue with the dataset. All bike rides are included (except those taken by service staff as they would skew the data), and each ride is anonymized.
In terms of how this data will help us answer the business task, we will need to engineer some new features in the next section.
We should create a column with ride_length in seconds, so we can see how long each ride lasted:
df <- df %>%
mutate(ride_length = as.numeric(difftime(ended_at,started_at,units="secs"),
units="secs"))
We can also create a weekday value for our analysis:
df <- df %>%
mutate(weekday = weekdays(df$started_at))
Lastly, in order to prepare our data fully, we can include a column for the time of day. This is subjective, however, according to the English definitions of times of day, we get the following:
# we can find the starting time of day for each ride (morning, afternoon, evening):
# {
# morning: 00:00:00-11:59:59
# afternoon: 12:00:00-17:59:59
# evening: 18:00:00-23:59:59
# }
# src: https://stackoverflow.com/questions/62419598/r-convert-time-to-time-of-day
get_time_of_day <- function(start_time) {
hour_time <- hour(start_time)
case_when(hour_time >= 0 & hour_time <= 11 ~ 'Morning',
hour_time >= 12 & hour_time <= 17 ~ 'Afternoon',
hour_time >= 18 & hour_time <= 23 ~ 'Evening')
}
df <- df %>%
mutate(time_of_day = get_time_of_day(started_at))
Since the last function’s application can insert null values, lets take a peek at our data again:
str(df)
## tibble [4,588,302 x 17] (S3: tbl_df/tbl/data.frame)
## $ index : num [1:4588302] 9 10 11 12 13 14 15 16 17 18 ...
## $ ride_id : chr [1:4588302] "B9F73448DFBE0D45" "457C7F4B5D3DA135" "57C750326F9FDABE" "4D518C65E338D070" ...
## $ rideable_type : chr [1:4588302] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:4588302], format: "2021-01-24 19:15:38" "2021-01-23 12:57:38" ...
## $ ended_at : POSIXct[1:4588302], format: "2021-01-24 19:22:51" "2021-01-23 13:02:10" ...
## $ start_station_name: chr [1:4588302] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:4588302] "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr [1:4588302] "Wood St & Augusta Blvd" "California Ave & North Ave" "Wood St & Augusta Blvd" "Wood St & Augusta Blvd" ...
## $ end_station_id : chr [1:4588302] "657" "13258" "657" "657" ...
## $ start_lat : num [1:4588302] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:4588302] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:4588302] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:4588302] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ Member_Type : chr [1:4588302] "Member" "Member" "Casual" "Casual" ...
## $ ride_length : num [1:4588302] 433 272 587 537 609 ...
## $ weekday : chr [1:4588302] "Sunday" "Saturday" "Saturday" "Saturday" ...
## $ time_of_day : chr [1:4588302] "Evening" "Afternoon" "Afternoon" "Afternoon" ...
colSums(is.na.data.frame(df))
## index ride_id rideable_type started_at
## 0 0 0 0
## ended_at start_station_name start_station_id end_station_name
## 0 0 0 0
## end_station_id start_lat start_lng end_lat
## 0 0 0 0
## end_lng Member_Type ride_length weekday
## 0 0 0 0
## time_of_day
## 0
Here are some basic statistics about our member types:
ordered_weekdays <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday")
# src: https://www.tutorialspoint.com/r/r_mean_median_mode.htm
mode_calc <- function (v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
df %>%
group_by(Member_Type) %>%
summarize(count = n(),
mean_ride_len_s = mean(ride_length),
mean_ride_len_m = mean(ride_length)/60,
max_ride_len_s = max(ride_length),
max_ride_len_m = max(ride_length)/60,
most_common_wkday = mode_calc(weekday)) %>%
datatable(caption = "Table 1: Basic Statistics Of Each Member Type",
class = "hover",
options = list(
scrollX = TRUE,
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#3288bd', 'color': 'white'});",
"}")
)) %>%
formatRound(c("mean_ride_len_s", "mean_ride_len_m", "max_ride_len_m"), 1)
From this table, we can conclude the following:
Since weekday matters to the member type, we will investigate further. Let’s calculate the most common days of the week by member type:
df %>%
group_by(Member_Type, weekday) %>%
summarize(observations = n()) %>%
pivot_wider(names_from = weekday, values_from = observations) %>%
select(ordered_weekdays) %>%
datatable(caption = "Table 2: Number Of Rides Per Day By Member Type",
class = "hover",
options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#3288bd', 'color': 'white'});",
"}")
))
## `summarise()` has grouped output by 'Member_Type'. You can override using the
## `.groups` argument.
## Note: Using an external vector in selections is ambiguous. i Use
## `all_of(ordered_weekdays)` instead of `ordered_weekdays` to silence this
## message. i See
## <https://tidyselect.r-lib.org/reference/faq-external-vector.html>. This message
## is displayed once per session.
## Adding missing grouping variables: `Member_Type`
ggplot(df, mapping=aes(x=weekday, fill=rideable_type)) +
geom_bar() +
facet_wrap(~Member_Type) +
labs(x = "Weekday",
y = "Count",
title = "Number Of Rides Per Day By Member Type") +
scale_x_discrete(limits = ordered_weekdays) +
scale_y_continuous(labels = label_number_si()) +
scale_fill_brewer(name = "Rideable Type", labels = c("Classic Bike", "Docked Bike", "Electric Bike"),
palette = "Blues") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45))
From this visualization, we can conclude:
Let’s see how this correlates to the average ride length by day:
df %>%
group_by(Member_Type, weekday) %>%
summarize(avg_ride_len = mean(ride_length)) %>%
pivot_wider(names_from = weekday, values_from = avg_ride_len) %>%
select(ordered_weekdays) %>%
datatable(caption = "Table 3: Average Ride Length By Weekday Based On Member Type",
class = "hover",
options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#3288bd', 'color': 'white'});",
"}")
)) %>%
formatRound(ordered_weekdays, 1)
## `summarise()` has grouped output by 'Member_Type'. You can override using the
## `.groups` argument.
## Adding missing grouping variables: `Member_Type`
ggplot(df,mapping = aes(x = weekday, y = ride_length)) +
geom_bar(stat = "summary", fun = "mean", fill = "#3288bd") +
facet_wrap(~Member_Type) +
labs(x = "Weekday",
y = "Average Ride Length",
title = "Average Ride Length By Weekday Based On Member Type") +
scale_x_discrete(limits = ordered_weekdays) +
scale_y_continuous(labels = label_number_si(accuracy = 0.1)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45),
legend.position = "none")
The above chart has taught us:
We can also look at common times of day for each member type:
df %>%
group_by(Member_Type, time_of_day) %>%
summarize(observations = n()) %>%
pivot_wider(names_from = time_of_day, values_from = observations) %>%
select(c("Morning", "Afternoon", "Evening")) %>%
datatable(caption = "Table 4: Time Of Day Usage Based On Member Type",
class = "hover",
options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#3288bd', 'color': 'white'});",
"}")
))
## `summarise()` has grouped output by 'Member_Type'. You can override using the
## `.groups` argument.
## Adding missing grouping variables: `Member_Type`
ggplot(df, mapping = aes(x = time_of_day)) +
geom_bar(fill = "#3288bd") +
facet_wrap(~Member_Type) +
labs(x = "Time Of Day",
y = "Count",
title = "Time Of Day Based On Member Type") +
scale_x_discrete(limits = c("Morning", "Afternoon", "Evening")) +
scale_y_continuous(labels = label_number_si()) +
theme_minimal()
We learn the following from the above chart:
We can see that bike usage differs vastly between annual members and casual riders. It seems as if casual riders are using the bikes for fun (usage peak at weekends), while annual members have more uniform usage throughout the week. On possible explanation is that annual members are using the bikes for commute purposes, which would explain their uniform average ride length and usage in morning/afternoon times.
In terms of similarity between the two groups, we see that classic bikes are used far more than electric or docked bikes. We also see that both member types go on afternoon rides most often.
Based on the above analysis, here are my top three recommendations for the marketing team:
Annual members: customers who purchase annual memberships.
Casual riders: customers who use single-ride or full-day passes.
Docked bikes: bikes that are borrowed from and returned to docking stations (docks). They can be returned to a different station.
Dockless bikes: bikes that do not require a docking station. These are typically parked within the bike share district at a bike rack or on the sidewalk.
Electric bikes: a motorized bicycle.
| Description | File |
|---|---|
| Jupyter notebook used for data concatenation. | data_concatenation.ipynb |
| General analysis/scratch file script. | Bike_Share_Analysis.R |
| This report in R markdown notebook format. | Bike_Share_Analysis_Report.Rmd |