BLOG №7

Introduction:

Bike-sharing systems have become increasingly popular in cities around the world, providing a convenient and eco-friendly mode of transportation. In this blog post, we dive into the Ford GoBike dataset, a rich source of information about individual rides made in the greater San Francisco Bay area. Join us as we explore the data and uncover fascinating insights about bike-sharing trends and patterns in this vibrant region. You can download the HTML files of the project from this repository.

Preliminary Wrangling:

Before we begin our exploration, it's crucial to understand and prepare the dataset. We'll discuss the steps taken to clean and wrangle the data, addressing issues such as missing values, data types, and outliers. By ensuring the dataset's quality and consistency, we can conduct accurate and meaningful analyses. In this step, We assess and clean the data. Now we will provide you useful codes that will be used in the wrangling part.


# Load the dataset
df = pd.read_csv("201902-fordgobike-tripdata.csv")
# Display the first few rows of the DataFrame
df.head(5)


# Display information about the dataset, including column names and data types
df.info()


After careful consideration, I have decided to drop the rows with missing values in the member_birth_year and member_gender columns. These missing values account for a small proportion of the dataset, and I believe that maintaining accuracy in the analysis is crucial. By dropping these rows, I ensure that the remaining data used for analysis is complete and reliable.
# Drop rows with missing values in the specified columns
df.dropna(subset=['end_station_id', 'end_station_name', 'start_station_id', 'start_station_name'], inplace=True)
df.dropna(subset=['member_birth_year', 'member_gender'], inplace=True)


# Check for missing values in the dataset
print(df.isnull().sum())



Good now we don't have missing values, we can continue our job.
After dropping values, now I will start to get into the dataset by calling some functions



After dropping the missing values, I noticed that the summary statistics of the dataset have changed slightly. I expected this because the dropped rows had missing values in certain columns, which can affect calculations such as mean, standard deviation, and quartiles.
Although there are minor changes in the summary statistics, it is not a significant issue. Since the dropped rows represent a small portion of the total dataset, the impact on the overall analysis is minimal. The remaining data still provides valuable insights and can be used for meaningful analysis.
I made the decision to drop missing values in order to ensure the accuracy and integrity of my analysis. By removing incomplete or unreliable data points, I can focus on a more complete and representative subset of the dataset, which leads to more reliable conclusions and insights.
Overall, I believe that the slight changes in the summary statistics after dropping missing values are acceptable and do not compromise the validity of my analysis.
From the table above, duration_sec and member_birth_year are important.
We convert the 'start_time' and 'end_time' columns to the datetime data type using the pd.to_datetime() function. This will allow us to easily work with time-related data and perform various time-based operations.


# Convert 'start_time' and 'end_time' columns to datetime
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])

We convert the start_station_id and nd_station_idcolumns to the int64 data type using the astype() method. This change was made because station IDs are typically represented as integers, and using the int64 data type allows for more efficient storage and supports integer-based operations.


# Convert 'start_station_id' and 'end_station_id' columns to int
df['start_station_id'] = df['start_station_id'].astype(int)
df['end_station_id'] = df['end_station_id'].astype(int)

Printing the unique values in the user_type column allows us to see the different types of users in the bike-sharing system.
Printing the unique values in the member_gender column helps us understand the distribution of gender among the users.
Printing the unique values in the bike_share_for_all_trip column provides insights into whether users share bikes for all their trips or not.
# Looping through columns to print unique values
columns = ['user_type', 'member_gender', 'bike_share_for_all_trip']
for column in columns:
    print(df[column].unique())

This code will provide the count of unique values in the user_type column, which helps in understanding the proportion of subscribers and customers in the bike-sharing system.
# Get the count of each unique value in the 'user_type' column
print(df['user_type'].value_counts())


You can see that subscribers are ten times bigger than customers

This code will display the count of unique values in the member_gender column, giving insights into the gender distribution among the users. It helps identify the dominant gender among the users.
# Display the count of each unique value in the 'member_gender' column-->
print(df['member_gender'].value_counts())



Indeed, from the available data, it appears that there is a higher proportion of male users

compared to female and other genders in the bike-sharing system in the San Francisco

Bay area. The count of male users is significantly higher (130,500) compared

to female users (40,805) and users with other genders (3,647). This suggests

that the bike-sharing system is more popular among male riders.

By using this code, we can get the count of unique values in the bike_share_for_all_trip column. It helps determine the frequency of users who share their bikes for the entire trip ('Yes') compared to those who don't ('No').

# Show the count of each unique value in the 'bike_share_for_all_trip' column
print(df['bike_share_for_all_trip'].value_counts())




# Define the function to print value counts
def print_value_counts(column):
    value_counts = df[column].value_counts()
    print(value_counts)

# Print value counts of the 'start_station_id' column
print("Value counts of the 'start_station_id' column")
print_value_counts('start_station_id')
print("\n")

# Print value counts of the 'end_station_id' column
print("Value counts of the 'end_station_id' column")
print_value_counts('end_station_id')
print("\n")

# Print value counts of the 'start_station_name' column
print("Value counts of the 'start_station_name' column")
print_value_counts('start_station_name')
print("\n")

# Print value counts of the 'end_station_name' column
print("Value counts of the 'end_station_name' column")
print_value_counts('end_station_name')
print("\n")

What is the structure of the dataset?
Dataset Structure

After dropping the missing values, the dataset now consists of 174,952 rows and 16 columns. Each row represents an individual ride made in the bike-sharing system covering the greater San Francisco Bay area. The dataset includes the following columns:

duration_sec: The duration of the ride in seconds (numeric)
start_time: The start time of the ride (datetime)
end_time: The end time of the ride (datetime)
start_station_id: The ID of the start station (numeric)
start_station_name: The name of the start station (string)
start_station_latitude: The latitude of the start station (numeric)
start_station_longitude: The longitude of the start station (numeric)
end_station_id: The ID of the end station (numeric)
end_station_name: The name of the end station (string)
end_station_latitude: The latitude of the end station (numeric)
end_station_longitude: The longitude of the end station (numeric)
bike_id: The ID of the bike used for the ride (numeric)
user_type: The type of user (either "Customer" or "Subscriber")
member_birth_year: The birth year of the user (numeric)
member_gender: The gender of the user (either "Male", "Female", or "Other")
bike_share_for_all_trip: Indicates whether the user shared the bike for the entire trip (either "Yes" or "No")

This structure provides an overview of the columns and their respective data types, which will be helpful for analyzing and visualizing the data.

What are the main features of interest in my dataset?
Main Features of Interest

From my perspective, the main features of interest in the dataset are as follows:

Start Station ID and Name: The unique values in the start_station_id and start_station_name columns provide insights into the starting point of the bike rides. By examining the count of unique values in these columns, we can identify the most frequently used start stations. For example, the station with ID 58 has the highest count of 3,649 rides, followed by station 67 with 3,408 rides. Understanding popular start-stations helps in analyzing user preferences and planning station infrastructure.
End Station ID and Name: Similarly, the unique values in the end_station_id and end_station_name columns give us information about the destination of the bike rides. By analyzing the count of unique values, we can identify the most commonly used end stations. For instance, the station with ID 67 has the highest count of 4,624 rides, followed by station 58 with 3,709 rides. This helps us understand popular destinations and can be useful for station planning and optimizing bike availability.
User Type: The unique values in the user_type column provides insights into the types of users in the bike-sharing system. By examining the count of unique values, we can determine the proportion of subscribers and customers. In our dataset, there are 158,386 subscribers and 16,566 customers. This information helps us understand the user base and tailor services accordingly.
Member Gender: The unique values in the member_gender column gives insights into the gender distribution among the users. By analyzing the count of unique values, we find that there are 130,500 male users, 40,805 female users, and 3,647 users of other genders. Understanding gender distribution helps in designing targeted marketing campaigns and identifying any gender-based patterns or preferences.
Bike Share for All Trip: The unique values in the bike_share_for_all_trip column provides insights into whether users share their bikes for the entire trip or not. By examining the count of unique values, we find that the majority of users (157,606) do not share their bikes for the entire trip, while a smaller portion (17,346) do. This information helps in understanding the level of bike sharing and its impact on trip durations and bike availability.

These features provide insights into the usage patterns, popular routes, user preferences, and demographic distribution within the bike-sharing system. Analyzing start and end stations helps in identifying demand patterns, optimizing station locations, and improving the overall user experience. Additionally, understanding user types, gender distribution, and bike-sharing behavior helps in tailoring services and marketing strategies to meet user needs and preferences.

Univariate Exploration:

In this section, we examine individual variables in the dataset to gain insights into their distributions and characteristics. We analyze features like trip duration, user age, and bike types to understand their ranges, central tendencies, and potential outliers. Visualizations such as histograms, bar plots, and box plots will aid us in exploring these variables.

In this section, I will follow the Question-Visualization-Observation framework in these upcoming explorations. It is always good to use this framework in your analysis. It makes your project outstanding and well-structured.

You can check the codes in the repository by downloading the HTML file.

Question 1: What is the distribution of ride durations?

Observation:

The histogram shows the distribution of ride durations in seconds. Most rides have durations between 0 and 2000-4000 seconds, with the peak around 450-700 seconds. There are only a few rides with longer durations, indicating that the majority of rides are relatively short.

Question 2: How does the number of rides vary across different user types?

Observation:

The bar plot displays the number of rides recorded for each user type. Subscribers have a significantly higher number of rides compared to customers. This suggests that the bike-sharing system has a large base of subscribers who use it more frequently.

Question 3: What is the proportion of customers and subscribers?

Observation:

The pie chart illustrates the proportion of user types in the bike-sharing system. Subscribers make up the majority, accounting for 90.5% of the users, while customers represent the remaining 9.5%.

Question 4: Which start stations are the most popular?

Observation:

The bar plot displays the top 10 most popular start stations based on the number of rides. Market St at 10th St is the most popular start station, followed by San Francisco Caltrain Station 2 and Berry St at 4th St.

Question 5: Which end stations are the most popular?

Observation:

The visualization displays the top 10 popular end stations in the bike-sharing system. The bar plot provides insights into the frequency of usage for each end station. Based on the number of rides, we can determine the popularity of different end stations. The station with the highest count, "San Francisco Caltrain Station 2 (Townsend St at 4th St)," is the most frequently used end station, followed by "Market St at 10th St" and "Montgomery St BART Station (Market St at 2nd St)." These stations serve as popular destinations for bike rides, indicating high demand and usage.

Question 6: What is the distribution of member genders in the dataset?

Observation:

The count plot illustrates the number of rides recorded for each member's gender. The dataset contains a significantly higher number of male users compared to females and other genders. This indicates a gender imbalance in the bike-sharing system, with males being the predominant users.

Observation:

The count plot showcases the number of rides based on the bike-sharing behavior of users. The majority of users do not share their bikes for the entire trip, while a smaller portion engages in bike sharing. This suggests that bike sharing is not a widespread practice among the users in the bike-sharing system.

Observation:

From the pie chart, we can observe the following:

Approximately 9.9% of the users share bikes for the entire trip, as indicated by the "Yes" slice.
The majority of users (90.1%) do not share bikes for the entire trip, as shown by the "No" slice.
The chart highlights the difference between the two bike-sharing behaviors, making it easy to understand the proportion of users who choose to share bikes for the entire trip.

Question 9: What is the distribution of member birth years?

Observation:

The histogram represents the distribution of member birth years. The majority of users have birth years between 1980 and 2000, with the highest frequency observed around the late 1980s to early 1990s.

Question 10: How does the number of rides vary across different days of the week?

Observation:

The bar plot displays the number of rides recorded for each day of the week. We can observe that weekdays (Monday to Friday) have higher ride counts compared to weekends (Saturday and Sunday). The highest number of rides is recorded on Thursdays, while the lowest number is on Saturdays.

Question 11: How does the number of rides vary across different months of the year?

Observation:

The dataset only contains data for February, it means that there is no variation across different months, and visualizing the number of rides by month may not provide meaningful insights.

Bivariate Exploration:

Moving beyond individual variables, we now explore the relationships between pairs of variables. We can uncover interesting connections and dependencies by examining the interplay between factors like user age and trip duration or user type and bike usage patterns. Scatter plots, heatmaps, and grouped bar plots are some of the visualizations we'll employ to facilitate this analysis.

Question 12: What is the relationship between the user type and the ride duration?

Observation:

The box plot visualizes the relationship between the user type and the ride duration. Here are the observations:

Subscribers: The box plot for subscribers shows a relatively narrower and lower box compared to customers. The median ride duration for subscribers is around 500 seconds (approximately 8 minutes). The lower and upper quartiles indicate that the majority of ride durations for subscribers fall within the range of approximately 300 to 800 seconds (5 to 13 minutes).

Customers: The box plot for customers displays a wider and higher box compared to subscribers. The median ride duration for customers is around 800 seconds (approximately 13 minutes). The lower and upper quartiles suggest that the ride durations for customers are more varied, ranging from approximately 500 to 1400 seconds (8 to 23 minutes).

Question 13: How does the ride duration vary between different genders?

Observation:

The violin plot illustrates the variation in ride duration between different member genders. Here are the observations:

Male: The violin plot for males shows a relatively wider distribution compared to females and other genders. The majority of ride durations for males are concentrated around the range of approximately 300 to 1200 seconds (5 to 20 minutes). The peak of the distribution is slightly skewed towards shorter ride durations, indicating that males tend to have shorter trips on average.

Female: The violin plot for females displays a narrower distribution compared to males. The ride durations for females are concentrated around the range of approximately 400 to 1000 seconds (7 to 17 minutes). The distribution is more symmetrical, with a slightly higher peak at the median ride duration, suggesting a more consistent pattern of ride durations.

Other Genders: The violin plot for other genders shows a distribution similar to that of females, with ride durations concentrated around the range of approximately 400 to 1000 seconds (7 to 17 minutes). The distribution is relatively narrower compared to males, indicating a more consistent pattern of ride durations.

Insight

To conclude for this visualization, the violin plot highlights the variation in ride durations between different member genders. While males tend to have a wider range of ride durations, females and other genders exhibit more consistent patterns with narrower distributions. These findings suggest potential differences in riding behavior and trip purposes among different member genders.

Question 14: Is there a relationship between the top 10 start stations and the user type?

Observation:

The stacked bar plot visualizes the relationship between the top 10 start stations and user types. Here are the observations:

Market St at 10th St: The majority of rides from this start station are by subscribers, indicating that it is a popular choice among regular users of the bike-sharing system. The number of customer rides is relatively lower in comparison.

San Francisco Caltrain Station 2 (Townsend St at 4th St): Similar to the previous start station, subscribers make up the majority of rides, suggesting a preference among regular users. The number of customer rides is relatively lower.

Insight

The stacked bar plot provides insights into the user type distribution for the top 10 start stations. Subscribers are the predominant user type for most of the popular start stations, indicating a strong base of regular users in these areas. The number of customer rides is relatively lower, suggesting that these start stations are preferred by subscribers who use the bike-sharing system more frequently.

Question 15: Are there any relationships between the start day of the week, user type, and the duration of rides?

Observation:

The box plot visualizes the relationship between the start day of the week, user type, and the duration of rides. Here are the observations:

Overall, the box plot reveals that ride durations are influenced by both the start day of the week and the user type. Subscribers tend to have shorter and more consistent ride durations, while customers have longer and more variable ride durations. Weekends (Saturday and Sunday) exhibit slightly longer ride durations for both user types, possibly due to different usage patterns and leisurely rides during weekends.

Multivariate Exploration:

In this section, we expand our exploration further by considering the simultaneous relationships among multiple variables. By visualizing three or more variables together, we can uncover complex patterns and interactions. For example, we may investigate how trip duration varies across different user types and age groups, considering factors like gender or bike type. Through careful visualization and analysis, we'll derive meaningful insights from these multivariate relationships.

Question 16: How does the ride duration vary across different user types and member genders?

Observation:

The heatmap visualizes the mean ride duration across different user types and member genders. Here are the observations:

-Male subscribers have the shortest average ride duration, followed by female subscribers. Other customers have the longest average ride duration among all categories.

-The shortest ride durations are observed for male subscribers, with an average duration of around 616 seconds (10 minutes). Female subscribers have slightly longer average ride durations of approximately 696 seconds (12 minutes). Other customers have the longest average ride duration, exceeding 1602 seconds (26 minutes).

-There is a clear distinction in ride duration between subscribers and customers within each member's gender category. Subscribers tend to have shorter average ride durations compared to customers, regardless of their gender.

-The heatmap provides a comprehensive overview of the ride duration patterns across different user types and member genders, allowing for easy comparison and identification of trends.

Insight:

The findings suggest that the combination of user type and member gender plays a role in ride duration. While males generally have shorter average ride durations compared to females, the distinction between subscribers and customers within each gender category is more pronounced. Subscribers, regardless of their gender, have shorter average ride durations, indicating that they may use the bike-sharing system for more frequent and shorter trips, possibly for commuting or regular transportation purposes. On the other hand, other customers tend to have longer average ride durations, indicating a different usage pattern, potentially for leisure or recreational purposes.

Question 17: How does the ride duration vary across different user types, the day of the week?

Observation:

The point plot visualizes the ride duration across different combinations of user types, the day of the week, and member genders. Here are the observations:

For both subscribers and customers, ride durations are generally longer on weekends (Saturday and Sunday) compared to weekdays.

Insight:

The findings suggest that the combination of user type, and the day of the week influences ride durations. Weekends tend to have longer ride durations compared to weekdays for both subscribers and customers, indicating potential differences in usage patterns and trip purposes.

Observation:

The box plot matrix visualizes the relationship between ride duration, user type, member gender, and bike-sharing behavior. Here are the observations for each subplot:

Subplot 1: Ride duration by user type and member gender

For both subscribers and customers, females tend to have slightly longer ride durations compared to males and other genders. The difference in ride durations between genders is more noticeable among subscribers.

Subplot 2: Ride duration by user type and bike-sharing behavior

Subscribers who do not share bikes for the entire trip tend to have shorter ride durations compared to subscribers who do share bikes.

Subplot 3: Ride duration by member gender and bike-sharing behavior

Males and females who do not share bikes for the entire trip have similar ride durations. However, among those who share bikes, females tend to have slightly longer ride durations compared to males.

Subplot 4: Ride duration by user type, member gender, and bike-sharing behavior

Among both customers and subscribers, females who share bikes for the entire trip have the longest ride durations, followed by other genders who share bikes.

Subscribers who do not share bikes for the entire trip have shorter and less variable ride durations, regardless of the member's gender.

Conclusion:

As we conclude our analysis, we summarize the key findings and insights gained from exploring the Ford GoBike dataset. We reflect on the bike-sharing trends, user characteristics, and patterns we've uncovered, highlighting their significance for the San Francisco Bay area. We also discuss the potential implications of these insights for bike-sharing operators and the future of urban mobility. Finally, we suggest areas for further analysis and potential research directions to continue unraveling the intricacies of bike-sharing systems.

Throughout my data exploration, I gained valuable insights into the bike-sharing dataset, examining various aspects such as univariate distributions, relationships between variables, and multivariate interactions. Here are the key findings and reflections from my exploration:

User Type Distribution: I observed that the majority of users in the dataset were subscribers, indicating a strong base of regular users who potentially use the bike-sharing system for commuting or regular transportation purposes. On the other hand, customers represented a smaller portion of the user base, suggesting that they may be more casual or occasional users of the system.
Ride Duration: I found interesting patterns in ride durations. Subscribers tended to have shorter ride durations compared to customers, which aligns with the assumption that subscribers might use the system more frequently for shorter trips. As for customers, they had longer and more variable ride durations, suggesting a different usage pattern, potentially for leisure or recreational purposes.
Gender Differences: I observed slight variations in ride durations between different member genders. Females tended to have slightly longer ride durations compared to males, indicating potential differences in riding behavior and trip purposes. However, it's important to note that the differences between genders were relatively small and not as prominent as the differences observed between user types.
Start Station Analysis: By exploring the top 10 start stations, I found that subscribers were the dominant user type for the most popular start stations. This suggests a strong base of regular users in those areas, while customers represented a smaller portion of rides. This finding indicates that these start stations are preferred by subscribers who use the bike-sharing system more frequently.
Day of the Week Analysis: I observed that the day of the week played a role in ride durations. Generally, ride durations were longer on weekends (Saturday and Sunday) compared to weekdays for both subscribers and customers. This trend suggests potential differences in usage patterns and trip purposes on weekends, possibly related to leisure or recreational activities.
Multivariate Exploration: Through my multivariate exploration, I uncovered more complex relationships between variables. I observed interactions between user type, member gender, and ride duration, highlighting the importance of considering multiple factors simultaneously. The analysis also revealed interesting interactions between user type, the day of the week, and ride duration, providing insights into how riding behavior varies across different user types and days of the week.

In conclusion, this exploratory data analysis provided me with valuable insights into the bike-sharing dataset. I gained a better understanding of user types, ride durations, gender differences, and the influence of factors such as start stations and the day of the week on ride patterns.

Exploring the Ford GoBike Dataset: Uncovering Insights into Bike-Sharing in the San Francisco Bay Area

Table of Contents:

Introduction:

Preliminary Wrangling:

What is the structure of the dataset? Dataset Structure

What are the main features of interest in my dataset?Main Features of Interest

Univariate Exploration:

Question 1: What is the distribution of ride durations?

Observation:

Question 2: How does the number of rides vary across different user types?

Observation:

Question 3: What is the proportion of customers and subscribers?

Observation:

Question 4: Which start stations are the most popular?

Observation:

Question 5: Which end stations are the most popular?

Observation:

Question 6: What is the distribution of member genders in the dataset?

Observation:

Question 7: What is the distribution of bike-sharing behavior (bike_share_for_all_trip)?

Observation:

Question 8: How many users share bikes for the entire trip?

Observation:

Question 9: What is the distribution of member birth years?

Observation:

Question 10: How does the number of rides vary across different days of the week?

Observation:

Question 11: How does the number of rides vary across different months of the year?

Observation:

Question 12: What is the relationship between the user type and the ride duration?

Observation:

Question 13: How does the ride duration vary between different genders?

Observation:

Insight

Question 14: Is there a relationship between the top 10 start stations and the user type?

Observation:

Insight

Question 15: Are there any relationships between the start day of the week, user type, and the duration of rides?

Observation:

Multivariate Exploration:

Question 16: How does the ride duration vary across different user types and member genders?

Observation:

Insight:

Question 17: How does the ride duration vary across different user types, the day of the week?

Observation:

Insight:

Question 18: How does the ride duration vary between different user types, genders, and bike-sharing behaviors?

Observation:

Conclusion:

Comments

Post a Comment

Popular posts from this blog

Examples of Data Analysis in Various Industries

Do I get BENEFIT if I follow you? Introduce yourself and your blog.

Demystifying the Data Analysis Process: Unveiling Insights through Questioning, Wrangling, and Exploration

What is the structure of the dataset?
Dataset Structure

What are the main features of interest in my dataset?
Main Features of Interest