Exploring the Ford GoBike Dataset: Uncovering Insights into Bike-Sharing in the San Francisco Bay Area

                                                                  BLOG №7


Table of Contents:

  1. Preliminary Wrangling
  2. Univariate Exploration
  3. Bivariate Exploration
  4. Multivariate Exploration
  5. Conclusion

Introduction: 

Bike-sharing systems have become increasingly popular in cities around the world, providing a convenient and eco-friendly mode of transportation. In this blog post, we dive into the Ford GoBike dataset, a rich source of information about individual rides made in the greater San Francisco Bay area. Join us as we explore the data and uncover fascinating insights about bike-sharing trends and patterns in this vibrant region. You can download the HTML files of the project from this repository

Preliminary Wrangling: 

Before we begin our exploration, it's crucial to understand and prepare the dataset. We'll discuss the steps taken to clean and wrangle the data, addressing issues such as missing values, data types, and outliers. By ensuring the dataset's quality and consistency, we can conduct accurate and meaningful analyses. In this step, We assess and clean the data. Now we will provide you useful codes that will be used in the wrangling part.

# Load the dataset
df = pd.read_csv("201902-fordgobike-tripdata.csv")
# Display the first few rows of the DataFrame
df.head(5)



# Display information about the dataset, including column names and data types
df.info()
After careful consideration, I have decided to drop the rows with missing values in the member_birth_year and member_gender columns. These missing values account for a small proportion of the dataset, and I believe that maintaining accuracy in the analysis is crucial. By dropping these rows, I ensure that the remaining data used for analysis is complete and reliable.
# Drop rows with missing values in the specified columns df.dropna(subset=['end_station_id', 'end_station_name', 'start_station_id', 'start_station_name'], inplace=True) df.dropna(subset=['member_birth_year', 'member_gender'], inplace=True)

# Check for missing values in the dataset
print(df.isnull().sum())




After dropping the missing values, I noticed that the summary statistics of the dataset have changed slightly. I expected this because the dropped rows had missing values in certain columns, which can affect calculations such as mean, standard deviation, and quartiles.

Although there are minor changes in the summary statistics, it is not a significant issue. Since the dropped rows represent a small portion of the total dataset, the impact on the overall analysis is minimal. The remaining data still provides valuable insights and can be used for meaningful analysis.

I made the decision to drop missing values in order to ensure the accuracy and integrity of my analysis. By removing incomplete or unreliable data points, I can focus on a more complete and representative subset of the dataset, which leads to more reliable conclusions and insights.

Overall, I believe that the slight changes in the summary statistics after dropping missing values are acceptable and do not compromise the validity of my analysis.

# Convert 'start_time' and 'end_time' columns to datetime df['start_time'] = pd.to_datetime(df['start_time']) df['end_time'] = pd.to_datetime(df['end_time'])

We convert the start_station_id and nd_station_idcolumns to the int64 data type using the astype() method. This change was made because station IDs are typically represented as integers, and using the int64 data type allows for more efficient storage and supports integer-based operations.


# Convert 'start_station_id' and 'end_station_id' columns to int
df['start_station_id'] = df['start_station_id'].astype(int)
df['end_station_id'] = df['end_station_id'].astype(int)
  • Printing the unique values in the user_type column allows us to see the different types of users in the bike-sharing system.
  • Printing the unique values in the member_gender column helps us understand the distribution of gender among the users.
  • Printing the unique values in the bike_share_for_all_trip column provides insights into whether users share bikes for all their trips or not.
# Looping through columns to print unique values columns = ['user_type', 'member_gender', 'bike_share_for_all_trip'] for column in columns: print(df[column].unique())

# Get the count of each unique value in the 'user_type' column
print(df['user_type'].value_counts())
You can see that subscribers are ten times bigger than customers

# Display the count of each unique value in the 'member_gender' column--> print(df['member_gender'].value_counts())

Indeed, from the available data, it appears that there is a higher proportion of male users
compared to female and other genders in the bike-sharing system in the San Francisco 
Bay area. The count of male users is significantly higher (130,500) compared 
to female users (40,805) and users with other genders (3,647). This suggests 
that the bike-sharing system is more popular among male riders.

By using this code, we can get the count of unique values in the bike_share_for_all_trip column. It helps determine the frequency of users who share their bikes for the entire trip ('Yes') compared to those who don't ('No').
# Show the count of each unique value in the 'bike_share_for_all_trip' column
print(df['bike_share_for_all_trip'].value_counts())

# Define the function to print value counts def print_value_counts(column): value_counts = df[column].value_counts() print(value_counts) # Print value counts of the 'start_station_id' column print("Value counts of the 'start_station_id' column") print_value_counts('start_station_id') print("\n") # Print value counts of the 'end_station_id' column print("Value counts of the 'end_station_id' column") print_value_counts('end_station_id') print("\n") # Print value counts of the 'start_station_name' column print("Value counts of the 'start_station_name' column") print_value_counts('start_station_name') print("\n") # Print value counts of the 'end_station_name' column print("Value counts of the 'end_station_name' column") print_value_counts('end_station_name') print("\n")
 


What is the structure of the dataset?
     Dataset Structure


After dropping the missing values, the dataset now consists of 174,952 rows and 16 columns. Each row represents an individual ride made in the bike-sharing system covering the greater San Francisco Bay area. The dataset includes the following columns:

  • duration_sec: The duration of the ride in seconds (numeric)
  • start_time: The start time of the ride (datetime)
  • end_time: The end time of the ride (datetime)
  • start_station_id: The ID of the start station (numeric)
  • start_station_name: The name of the start station (string)
  • start_station_latitude: The latitude of the start station (numeric)
  • start_station_longitude: The longitude of the start station (numeric)
  • end_station_id: The ID of the end station (numeric)
  • end_station_name: The name of the end station (string)
  • end_station_latitude: The latitude of the end station (numeric)
  • end_station_longitude: The longitude of the end station (numeric)
  • bike_id: The ID of the bike used for the ride (numeric)
  • user_type: The type of user (either "Customer" or "Subscriber")
  • member_birth_year: The birth year of the user (numeric)
  • member_gender: The gender of the user (either "Male", "Female", or "Other")
  • bike_share_for_all_trip: Indicates whether the user shared the bike for the entire trip (either "Yes" or "No")

This structure provides an overview of the columns and their respective data types, which will be helpful for analyzing and visualizing the data.

What are the main features of interest in my dataset?
Main Features of Interest

From my perspective, the main features of interest in the dataset are as follows:

  1. Start Station ID and Name: The unique values in the start_station_id and start_station_name columns provide insights into the starting point of the bike rides. By examining the count of unique values in these columns, we can identify the most frequently used start stations. For example, the station with ID 58 has the highest count of 3,649 rides, followed by station 67 with 3,408 rides. Understanding popular start-stations helps in analyzing user preferences and planning station infrastructure.

  2. End Station ID and Name: Similarly, the unique values in the end_station_id and end_station_name columns give us information about the destination of the bike rides. By analyzing the count of unique values, we can identify the most commonly used end stations. For instance, the station with ID 67 has the highest count of 4,624 rides, followed by station 58 with 3,709 rides. This helps us understand popular destinations and can be useful for station planning and optimizing bike availability.

  3. User Type: The unique values in the user_type column provides insights into the types of users in the bike-sharing system. By examining the count of unique values, we can determine the proportion of subscribers and customers. In our dataset, there are 158,386 subscribers and 16,566 customers. This information helps us understand the user base and tailor services accordingly.

  4. Member Gender: The unique values in the member_gender column gives insights into the gender distribution among the users. By analyzing the count of unique values, we find that there are 130,500 male users, 40,805 female users, and 3,647 users of other genders. Understanding gender distribution helps in designing targeted marketing campaigns and identifying any gender-based patterns or preferences.

  5. Bike Share for All Trip: The unique values in the bike_share_for_all_trip column provides insights into whether users share their bikes for the entire trip or not. By examining the count of unique values, we find that the majority of users (157,606) do not share their bikes for the entire trip, while a smaller portion (17,346) do. This information helps in understanding the level of bike sharing and its impact on trip durations and bike availability.

These features provide insights into the usage patterns, popular routes, user preferences, and demographic distribution within the bike-sharing system. Analyzing start and end stations helps in identifying demand patterns, optimizing station locations, and improving the overall user experience. Additionally, understanding user types, gender distribution, and bike-sharing behavior helps in tailoring services and marketing strategies to meet user needs and preferences.

Univariate Exploration: 

In this section, we examine individual variables in the dataset to gain insights into their distributions and characteristics. We analyze features like trip duration, user age, and bike types to understand their ranges, central tendencies, and potential outliers. Visualizations such as histograms, bar plots, and box plots will aid us in exploring these variables.

In this section, I will follow the Question-Visualization-Observation framework in these upcoming explorations. It is always good to use this framework in your analysis. It makes your project outstanding and well-structured.

You can check the codes in the repository by downloading the HTML file.

Moving beyond individual variables, we now explore the relationships between pairs of variables. We can uncover interesting connections and dependencies by examining the interplay between factors like user age and trip duration or user type and bike usage patterns. Scatter plots, heatmaps, and grouped bar plots are some of the visualizations we'll employ to facilitate this analysis.

Question 12: What is the relationship between the user type and the ride duration?



Observation:

The box plot visualizes the relationship between the user type and the ride duration. Here are the observations:

Subscribers: The box plot for subscribers shows a relatively narrower and lower box compared to customers. The median ride duration for subscribers is around 500 seconds (approximately 8 minutes). The lower and upper quartiles indicate that the majority of ride durations for subscribers fall within the range of approximately 300 to 800 seconds (5 to 13 minutes).

Customers: The box plot for customers displays a wider and higher box compared to subscribers. The median ride duration for customers is around 800 seconds (approximately 13 minutes). The lower and upper quartiles suggest that the ride durations for customers are more varied, ranging from approximately 500 to 1400 seconds (8 to 23 minutes).

Question 13: How does the ride duration vary between different genders?



Question 14: Is there a relationship between the top 10 start stations and the user type?


Observation:

The stacked bar plot visualizes the relationship between the top 10 start stations and user types. Here are the observations:

Market St at 10th St: The majority of rides from this start station are by subscribers, indicating that it is a popular choice among regular users of the bike-sharing system. The number of customer rides is relatively lower in comparison.

San Francisco Caltrain Station 2 (Townsend St at 4th St): Similar to the previous start station, subscribers make up the majority of rides, suggesting a preference among regular users. The number of customer rides is relatively lower.

Insight


The stacked bar plot provides insights into the user type distribution for the top 10 start stations. Subscribers are the predominant user type for most of the popular start stations, indicating a strong base of regular users in these areas. The number of customer rides is relatively lower, suggesting that these start stations are preferred by subscribers who use the bike-sharing system more frequently.


Question 15: Are there any relationships between the start day of the week, user type, and the duration of rides?




Observation:

The box plot visualizes the relationship between the start day of the week, user type, and the duration of rides. Here are the observations:

Overall, the box plot reveals that ride durations are influenced by both the start day of the week and the user type. Subscribers tend to have shorter and more consistent ride durations, while customers have longer and more variable ride durations. Weekends (Saturday and Sunday) exhibit slightly longer ride durations for both user types, possibly due to different usage patterns and leisurely rides during weekends.

Multivariate Exploration: 

In this section, we expand our exploration further by considering the simultaneous relationships among multiple variables. By visualizing three or more variables together, we can uncover complex patterns and interactions. For example, we may investigate how trip duration varies across different user types and age groups, considering factors like gender or bike type. Through careful visualization and analysis, we'll derive meaningful insights from these multivariate relationships.

Question 16: How does the ride duration vary across different user types and member genders?


Observation:

The heatmap visualizes the mean ride duration across different user types and member genders. Here are the observations:

-Male subscribers have the shortest average ride duration, followed by female subscribers. Other customers have the longest average ride duration among all categories.

-The shortest ride durations are observed for male subscribers, with an average duration of around 616 seconds (10 minutes). Female subscribers have slightly longer average ride durations of approximately 696 seconds (12 minutes). Other customers have the longest average ride duration, exceeding 1602 seconds (26 minutes).

-There is a clear distinction in ride duration between subscribers and customers within each member's gender category. Subscribers tend to have shorter average ride durations compared to customers, regardless of their gender.

-The heatmap provides a comprehensive overview of the ride duration patterns across different user types and member genders, allowing for easy comparison and identification of trends.

Insight:


The findings suggest that the combination of user type and member gender plays a role in ride duration. While males generally have shorter average ride durations compared to females, the distinction between subscribers and customers within each gender category is more pronounced. Subscribers, regardless of their gender, have shorter average ride durations, indicating that they may use the bike-sharing system for more frequent and shorter trips, possibly for commuting or regular transportation purposes. On the other hand, other customers tend to have longer average ride durations, indicating a different usage pattern, potentially for leisure or recreational purposes.


Question 17: How does the ride duration vary across different user types, the day of the week?



Observation:

The point plot visualizes the ride duration across different combinations of user types, the day of the week, and member genders. Here are the observations:

For both subscribers and customers, ride durations are generally longer on weekends (Saturday and Sunday) compared to weekdays.

Insight:

The findings suggest that the combination of user type, and the day of the week influences ride durations. Weekends tend to have longer ride durations compared to weekdays for both subscribers and customers, indicating potential differences in usage patterns and trip purposes.


Question 18: How does the ride duration vary between different user types, genders, and bike-sharing behaviors?


Observation:

The box plot matrix visualizes the relationship between ride duration, user type, member gender, and bike-sharing behavior. Here are the observations for each subplot:

Subplot 1: Ride duration by user type and member gender

For both subscribers and customers, females tend to have slightly longer ride durations compared to males and other genders. The difference in ride durations between genders is more noticeable among subscribers.

Subplot 2: Ride duration by user type and bike-sharing behavior

Subscribers who do not share bikes for the entire trip tend to have shorter ride durations compared to subscribers who do share bikes.

Subplot 3: Ride duration by member gender and bike-sharing behavior

Males and females who do not share bikes for the entire trip have similar ride durations. However, among those who share bikes, females tend to have slightly longer ride durations compared to males.

Subplot 4: Ride duration by user type, member gender, and bike-sharing behavior

Among both customers and subscribers, females who share bikes for the entire trip have the longest ride durations, followed by other genders who share bikes.

Subscribers who do not share bikes for the entire trip have shorter and less variable ride durations, regardless of the member's gender.

Conclusion: 

As we conclude our analysis, we summarize the key findings and insights gained from exploring the Ford GoBike dataset. We reflect on the bike-sharing trends, user characteristics, and patterns we've uncovered, highlighting their significance for the San Francisco Bay area. We also discuss the potential implications of these insights for bike-sharing operators and the future of urban mobility. Finally, we suggest areas for further analysis and potential research directions to continue unraveling the intricacies of bike-sharing systems.

Throughout my data exploration, I gained valuable insights into the bike-sharing dataset, examining various aspects such as univariate distributions, relationships between variables, and multivariate interactions. Here are the key findings and reflections from my exploration:

  1. User Type Distribution: I observed that the majority of users in the dataset were subscribers, indicating a strong base of regular users who potentially use the bike-sharing system for commuting or regular transportation purposes. On the other hand, customers represented a smaller portion of the user base, suggesting that they may be more casual or occasional users of the system.

  2. Ride Duration: I found interesting patterns in ride durations. Subscribers tended to have shorter ride durations compared to customers, which aligns with the assumption that subscribers might use the system more frequently for shorter trips. As for customers, they had longer and more variable ride durations, suggesting a different usage pattern, potentially for leisure or recreational purposes.

  3. Gender Differences: I observed slight variations in ride durations between different member genders. Females tended to have slightly longer ride durations compared to males, indicating potential differences in riding behavior and trip purposes. However, it's important to note that the differences between genders were relatively small and not as prominent as the differences observed between user types.

  4. Start Station Analysis: By exploring the top 10 start stations, I found that subscribers were the dominant user type for the most popular start stations. This suggests a strong base of regular users in those areas, while customers represented a smaller portion of rides. This finding indicates that these start stations are preferred by subscribers who use the bike-sharing system more frequently.

  5. Day of the Week Analysis: I observed that the day of the week played a role in ride durations. Generally, ride durations were longer on weekends (Saturday and Sunday) compared to weekdays for both subscribers and customers. This trend suggests potential differences in usage patterns and trip purposes on weekends, possibly related to leisure or recreational activities.

  6. Multivariate Exploration: Through my multivariate exploration, I uncovered more complex relationships between variables. I observed interactions between user type, member gender, and ride duration, highlighting the importance of considering multiple factors simultaneously. The analysis also revealed interesting interactions between user type, the day of the week, and ride duration, providing insights into how riding behavior varies across different user types and days of the week.

In conclusion, this exploratory data analysis provided me with valuable insights into the bike-sharing dataset. I gained a better understanding of user types, ride durations, gender differences, and the influence of factors such as start stations and the day of the week on ride patterns.




Comments

Popular posts from this blog

Life of a Data Analyst: Unlocking the Power of Insights

Demystifying the Data Analysis Process: Unveiling Insights through Questioning, Wrangling, and Exploration