Bike Share Analysis Report

Current Status

The company Cyclistic offers a bike-share service to a network of 692 stations in Chicago. There are three types of pricing plans:

single-ride pass
full-day pass
annual membership

Casual riders: customers who use single-ride or full-day passes.
Annual members: customers who purchase annual memberships.

Finance analysts at Cyclistic have determined that annual members are more profitable than casual riders.

Business Task

The current goal of Cyclistic is to develop marketing strategies that will convert casual riders into annual members. In order for that to happen, the marketing team needs to understand how casual riders differ from annual members.
This report analyzes how casual riders and annual members use Cyclistic bikes differently using data from last year, 2021.

Data Sources

The data is publicly available here from the source company Divvy, designated as Cyclistic in this report. The data is provided under this license. The data is provided by the month, and contains anonymized information about:

Trip start datetime
Trip end datetime
Trip start station
Trip end station
Trip start latitude/longitude
Trip end latitude/longitude
Rider membership type
Bicycle type

The dataset in question is:

Reliable: From the source company Cyclistic itself.
Original: This is novel data.
Comprehensive: This data covers enough features for us to analyze how casual riders and annual members differ.
Current: Covers the previous year, 2021.
Cited: see Data Sources

Data Cleaning

The data was concatenated using this Jupyter Notebook.

Since the 2021 data is now all in one file, we can load it and take a peek at its structure:

df <- read_csv("D:\\Datasets\\Divvy_Trips_Data\\Divvy_Trips_2021\\2021_all_data.csv")

## New names:
## * `` -> ...1

## Rows: 5595063 Columns: 14
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (5): ...1, start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(df)

## spec_tbl_df [5,595,063 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ...1              : num [1:5595063] 0 1 2 3 4 5 6 7 8 9 ...
##  $ ride_id           : chr [1:5595063] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
##  $ rideable_type     : chr [1:5595063] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:5595063], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
##  $ ended_at          : POSIXct[1:5595063], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
##  $ start_station_name: chr [1:5595063] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr [1:5595063] "17660" "17660" "17660" "17660" ...
##  $ end_station_name  : chr [1:5595063] NA NA NA NA ...
##  $ end_station_id    : chr [1:5595063] NA NA NA NA ...
##  $ start_lat         : num [1:5595063] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5595063] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:5595063] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:5595063] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:5595063] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_double(),
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

We rename the “…1” column and make sure that all columns have consistent nomenclature:

df <- rename(df, index = "...1")
df <- rename(df, Member_Type = "member_casual")

# let's convert the member types to uppercase:
df$Member_Type[df$Member_Type=="casual"] <- "Casual"
df$Member_Type[df$Member_Type=="member"] <- "Member"

colnames(df)

##  [1] "index"              "ride_id"            "rideable_type"     
##  [4] "started_at"         "ended_at"           "start_station_name"
##  [7] "start_station_id"   "end_station_name"   "end_station_id"    
## [10] "start_lat"          "start_lng"          "end_lat"           
## [13] "end_lng"            "Member_Type"

clean_names(df)

## # A tibble: 5,595,063 x 14
##    index ride_id          rideable_type started_at          ended_at           
##    <dbl> <chr>            <chr>         <dttm>              <dttm>             
##  1     0 E19E6F1B8D4C42ED electric_bike 2021-01-23 16:14:19 2021-01-23 16:24:44
##  2     1 DC88F20C2C55F27F electric_bike 2021-01-27 18:43:08 2021-01-27 18:47:12
##  3     2 EC45C94683FE3F27 electric_bike 2021-01-21 22:35:54 2021-01-21 22:37:14
##  4     3 4FA453A75AE377DB electric_bike 2021-01-07 13:31:13 2021-01-07 13:42:55
##  5     4 BE5E8EB4E7263A0B electric_bike 2021-01-23 02:24:02 2021-01-23 02:24:45
##  6     5 5D8969F88C773979 electric_bike 2021-01-09 14:24:07 2021-01-09 15:17:54
##  7     6 09275CC10F854E9E electric_bike 2021-01-04 05:05:04 2021-01-04 05:10:39
##  8     7 DF7A32A217AEFB14 electric_bike 2021-01-14 15:07:00 2021-01-14 15:13:40
##  9     8 C2EFC62379EB716C electric_bike 2021-01-09 09:57:55 2021-01-09 10:00:26
## 10     9 B9F73448DFBE0D45 classic_bike  2021-01-24 19:15:38 2021-01-24 19:22:51
## # ... with 5,595,053 more rows, and 9 more variables: start_station_name <chr>,
## #   start_station_id <chr>, end_station_name <chr>, end_station_id <chr>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_type <chr>

Lets make sure our datetime columns are formatted as such:

df[["started_at"]] <- as.POSIXct(df[["started_at"]],
                                 format = "%Y-%m-%d %H:%M:%S")
df[["ended_at"]] <- as.POSIXct(df[["ended_at"]],
                               format = "%Y-%m-%d %H:%M:%S")

# lets check to see if the formatting has worked:
sapply(df, class)

## $index
## [1] "numeric"
## 
## $ride_id
## [1] "character"
## 
## $rideable_type
## [1] "character"
## 
## $started_at
## [1] "POSIXct" "POSIXt" 
## 
## $ended_at
## [1] "POSIXct" "POSIXt" 
## 
## $start_station_name
## [1] "character"
## 
## $start_station_id
## [1] "character"
## 
## $end_station_name
## [1] "character"
## 
## $end_station_id
## [1] "character"
## 
## $start_lat
## [1] "numeric"
## 
## $start_lng
## [1] "numeric"
## 
## $end_lat
## [1] "numeric"
## 
## $end_lng
## [1] "numeric"
## 
## $Member_Type
## [1] "character"

We look at null values for our data frame by column:

colSums(is.na.data.frame(df))

##              index            ride_id      rideable_type         started_at 
##                  0                  0                  0                  0 
##           ended_at start_station_name   start_station_id   end_station_name 
##                  0             690809             690806             739170 
##     end_station_id          start_lat          start_lng            end_lat 
##             739170                  0                  0               4771 
##            end_lng        Member_Type 
##               4771                  0

We have a lot of null values in certain columns. Let’s get rid of the values in the largest column (end_station_name):

df <- drop_na(df, end_station_name)

# checking to see if na values are dropped:
colSums(is.na.data.frame(df))

##              index            ride_id      rideable_type         started_at 
##                  0                  0                  0                  0 
##           ended_at start_station_name   start_station_id   end_station_name 
##                  0             267591             267590                  0 
##     end_station_id          start_lat          start_lng            end_lat 
##                  0                  0                  0                  0 
##            end_lng        Member_Type 
##                  0                  0

We still have around 270k null values, so we keep cleaning:

df <- drop_na(df, start_station_name)

# check again to see if na values are dropped:
colSums(is.na.data.frame(df))

##              index            ride_id      rideable_type         started_at 
##                  0                  0                  0                  0 
##           ended_at start_station_name   start_station_id   end_station_name 
##                  0                  0                  0                  0 
##     end_station_id          start_lat          start_lng            end_lat 
##                  0                  0                  0                  0 
##            end_lng        Member_Type 
##                  0                  0

Great! Now the data frame is free of null values.
We can proceed to check for duplicates in the data. Since the ride_id is a primary key, we can use that to see if we have any duplicates in our data:

get_dupes(df, ride_id)

## No duplicate combinations found of: ride_id

## # A tibble: 0 x 15
## # ... with 15 variables: ride_id <chr>, dupe_count <int>, index <dbl>,
## #   rideable_type <chr>, started_at <dttm>, ended_at <dttm>,
## #   start_station_name <chr>, start_station_id <chr>, end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, Member_Type <chr>

Now our data is:

Correctly named
Correctly formatted
Free of null values
Free of duplicates

The steps taken above verify the dataset’s integrity.
In terms of bias, this does not seem to be an issue with the dataset. All bike rides are included (except those taken by service staff as they would skew the data), and each ride is anonymized.
In terms of how this data will help us answer the business task, we will need to engineer some new features in the next section.

Data Preparation-Processing

We should create a column with ride_length in seconds, so we can see how long each ride lasted:

df <- df %>%
  mutate(ride_length = as.numeric(difftime(ended_at,started_at,units="secs"),
                                  units="secs"))

We can also create a weekday value for our analysis:

df <- df %>%
  mutate(weekday = weekdays(df$started_at))

Lastly, in order to prepare our data fully, we can include a column for the time of day. This is subjective, however, according to the English definitions of times of day, we get the following:

# we can find the starting time of day for each ride (morning, afternoon, evening):
# {
#   morning:    00:00:00-11:59:59
#   afternoon:  12:00:00-17:59:59
#   evening:    18:00:00-23:59:59
# }
# src: https://stackoverflow.com/questions/62419598/r-convert-time-to-time-of-day
get_time_of_day <- function(start_time) {
  hour_time <- hour(start_time)
  case_when(hour_time >= 0 & hour_time <= 11 ~ 'Morning',
            hour_time >= 12 & hour_time <= 17 ~ 'Afternoon',
            hour_time >= 18 & hour_time <= 23 ~ 'Evening')
}
df <- df %>%
  mutate(time_of_day = get_time_of_day(started_at))

Since the last function’s application can insert null values, lets take a peek at our data again:

str(df)

## tibble [4,588,302 x 17] (S3: tbl_df/tbl/data.frame)
##  $ index             : num [1:4588302] 9 10 11 12 13 14 15 16 17 18 ...
##  $ ride_id           : chr [1:4588302] "B9F73448DFBE0D45" "457C7F4B5D3DA135" "57C750326F9FDABE" "4D518C65E338D070" ...
##  $ rideable_type     : chr [1:4588302] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:4588302], format: "2021-01-24 19:15:38" "2021-01-23 12:57:38" ...
##  $ ended_at          : POSIXct[1:4588302], format: "2021-01-24 19:22:51" "2021-01-23 13:02:10" ...
##  $ start_station_name: chr [1:4588302] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr [1:4588302] "17660" "17660" "17660" "17660" ...
##  $ end_station_name  : chr [1:4588302] "Wood St & Augusta Blvd" "California Ave & North Ave" "Wood St & Augusta Blvd" "Wood St & Augusta Blvd" ...
##  $ end_station_id    : chr [1:4588302] "657" "13258" "657" "657" ...
##  $ start_lat         : num [1:4588302] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:4588302] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:4588302] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:4588302] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ Member_Type       : chr [1:4588302] "Member" "Member" "Casual" "Casual" ...
##  $ ride_length       : num [1:4588302] 433 272 587 537 609 ...
##  $ weekday           : chr [1:4588302] "Sunday" "Saturday" "Saturday" "Saturday" ...
##  $ time_of_day       : chr [1:4588302] "Evening" "Afternoon" "Afternoon" "Afternoon" ...

colSums(is.na.data.frame(df))

##              index            ride_id      rideable_type         started_at 
##                  0                  0                  0                  0 
##           ended_at start_station_name   start_station_id   end_station_name 
##                  0                  0                  0                  0 
##     end_station_id          start_lat          start_lng            end_lat 
##                  0                  0                  0                  0 
##            end_lng        Member_Type        ride_length            weekday 
##                  0                  0                  0                  0 
##        time_of_day 
##                  0

Analysis

Here are some basic statistics about our member types:

ordered_weekdays <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday",
                      "Friday", "Saturday")

# src: https://www.tutorialspoint.com/r/r_mean_median_mode.htm
mode_calc <- function (v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}
df %>%
  group_by(Member_Type) %>%
  summarize(count = n(),
            mean_ride_len_s = mean(ride_length),
            mean_ride_len_m = mean(ride_length)/60,
            max_ride_len_s = max(ride_length),
            max_ride_len_m = max(ride_length)/60,
            most_common_wkday = mode_calc(weekday)) %>%
  datatable(caption = "Table 1: Basic Statistics Of Each Member Type",
            class = "hover",
            options = list(
              scrollX = TRUE,
              initComplete = JS(
                "function(settings, json) {",
                "$(this.api().table().header()).css({'background-color': '#3288bd', 'color': 'white'});",
                "}")
              )) %>%
  formatRound(c("mean_ride_len_s", "mean_ride_len_m", "max_ride_len_m"), 1)

From this table, we can conclude the following:

Annual members have higher usage overall (already confirmed by finance analysts).
Casual members ride for a longer average time per ride.
Maximum ride time is significantly higher for casual members.
Weekday matters to the member type.

Since weekday matters to the member type, we will investigate further. Let’s calculate the most common days of the week by member type:

df %>%
  group_by(Member_Type, weekday) %>%
  summarize(observations = n()) %>%
  pivot_wider(names_from = weekday, values_from = observations) %>%
  select(ordered_weekdays) %>%
  datatable(caption = "Table 2: Number Of Rides Per Day By Member Type",
            class = "hover",
            options = list(
              initComplete = JS(
                "function(settings, json) {",
                "$(this.api().table().header()).css({'background-color': '#3288bd', 'color': 'white'});",
                "}")
            ))

## `summarise()` has grouped output by 'Member_Type'. You can override using the
## `.groups` argument.
## Note: Using an external vector in selections is ambiguous. i Use
## `all_of(ordered_weekdays)` instead of `ordered_weekdays` to silence this
## message. i See
## <https://tidyselect.r-lib.org/reference/faq-external-vector.html>. This message
## is displayed once per session.
## Adding missing grouping variables: `Member_Type`

ggplot(df, mapping=aes(x=weekday, fill=rideable_type)) +
  geom_bar() +
  facet_wrap(~Member_Type) +
  labs(x = "Weekday",
       y = "Count",
       title = "Number Of Rides Per Day By Member Type") +
  scale_x_discrete(limits = ordered_weekdays) +
  scale_y_continuous(labels = label_number_si()) +
  scale_fill_brewer(name = "Rideable Type", labels = c("Classic Bike", "Docked Bike", "Electric Bike"),
                    palette = "Blues") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45))

From this visualization, we can conclude:

Annual members do not significantly use docked bikes.
Casual rider usage peaks on weekends.
Annual members use bikes throughout the week.
Classic and electric bikes are the most common for both casual riders and annual members.
Both member types use classic bikes most.

Let’s see how this correlates to the average ride length by day:

df %>%
  group_by(Member_Type, weekday) %>%
  summarize(avg_ride_len = mean(ride_length)) %>%
  pivot_wider(names_from = weekday, values_from = avg_ride_len) %>%
  select(ordered_weekdays) %>%
  datatable(caption = "Table 3: Average Ride Length By Weekday Based On Member Type",
            class = "hover",
            options = list(
              initComplete = JS(
                "function(settings, json) {",
                "$(this.api().table().header()).css({'background-color': '#3288bd', 'color': 'white'});",
                "}")
            )) %>%
  formatRound(ordered_weekdays, 1)

## `summarise()` has grouped output by 'Member_Type'. You can override using the
## `.groups` argument.
## Adding missing grouping variables: `Member_Type`

ggplot(df,mapping = aes(x = weekday, y = ride_length)) +
  geom_bar(stat = "summary", fun = "mean", fill = "#3288bd") +
  facet_wrap(~Member_Type) +
  labs(x = "Weekday",
       y = "Average Ride Length",
       title = "Average Ride Length By Weekday Based On Member Type") +
  scale_x_discrete(limits = ordered_weekdays) +
  scale_y_continuous(labels = label_number_si(accuracy = 0.1)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45),
        legend.position = "none")

The above chart has taught us:

Casual riders have 2x average ride length of annual members.

We can also look at common times of day for each member type:

df %>%
  group_by(Member_Type, time_of_day) %>%
  summarize(observations = n()) %>%
  pivot_wider(names_from = time_of_day, values_from = observations) %>%
  select(c("Morning", "Afternoon", "Evening")) %>%
  datatable(caption = "Table 4: Time Of Day Usage Based On Member Type",
            class = "hover",
            options = list(
              initComplete = JS(
                "function(settings, json) {",
                "$(this.api().table().header()).css({'background-color': '#3288bd', 'color': 'white'});",
                "}")
            ))

## `summarise()` has grouped output by 'Member_Type'. You can override using the
## `.groups` argument.
## Adding missing grouping variables: `Member_Type`

ggplot(df, mapping = aes(x = time_of_day)) +
  geom_bar(fill = "#3288bd") +
  facet_wrap(~Member_Type) +
  labs(x = "Time Of Day",
       y = "Count",
       title = "Time Of Day Based On Member Type") +
  scale_x_discrete(limits = c("Morning", "Afternoon", "Evening")) +
  scale_y_continuous(labels = label_number_si()) +
  theme_minimal()

We learn the following from the above chart:

Both members types go on afternoon rides most often.

Analysis Summary

Annual members have higher usage overall (already confirmed by finance analysts).
Casual members ride for a longer average time per ride.
Maximum ride time is significantly higher for casual members.
Weekday matters to the member type.
Annual members do not significantly use docked bikes.
Casual rider usage peaks on weekends.
Annual members use bikes throughout the week.
Classic and electric bikes are the most common for both casual riders and annual members.
Both member types use classic bikes most.
Casual riders have 2x average ride length of annual members.
Both members types go on afternoon rides most often.
Annual members are probably using bikes for commute purposes.

We can see that bike usage differs vastly between annual members and casual riders. It seems as if casual riders are using the bikes for fun (usage peak at weekends), while annual members have more uniform usage throughout the week. On possible explanation is that annual members are using the bikes for commute purposes, which would explain their uniform average ride length and usage in morning/afternoon times.
In terms of similarity between the two groups, we see that classic bikes are used far more than electric or docked bikes. We also see that both member types go on afternoon rides most often.

Actions

Based on the above analysis, here are my top three recommendations for the marketing team:

Target ads temporally. Casual riders should get ads pre-weekend, perhaps Thursday or Friday.
Casual riders should have ads that emphasize the utility of this service. The goal here would be to change their thinking on the use of the bikes (for commute instead of just fun).
Casual riders should also have ads that emphasize using bikes during the day (morning and afternoon), since they currently use them during the afternoon and evening.

Appendix

Definitions

Annual members: customers who purchase annual memberships.
Casual riders: customers who use single-ride or full-day passes.
Docked bikes: bikes that are borrowed from and returned to docking stations (docks). They can be returned to a different station.
Dockless bikes: bikes that do not require a docking station. These are typically parked within the bike share district at a bike rack or on the sidewalk.
Electric bikes: a motorized bicycle.

Files Used

Description	File
Jupyter notebook used for data concatenation.	data_concatenation.ipynb
General analysis/scratch file script.	Bike_Share_Analysis.R
This report in R markdown notebook format.	Bike_Share_Analysis_Report.Rmd