Beginner learning data analysis using Python

As a new grad and little knowledge about data analysis, I have always fascinated by learning it. But I was clueless and hesitant about where to begin? Being an avid Quora reader and I tried posting it on the platform where to begin when I am interested in data analysis. They suggested taking some tutorials on YouTube, and I read few articles also started to do some research from my side. With a little guidance from my beloved partner and being an amateur data learner, I started to learn and travel with the ongoing phase.

Here I will be sharing how it started, what are the materials I used in order to learn, and the project I did as a matter of practice.

Brushing up with the basics:

1. I started to Enroll in a Udemy course for both SQL and Python. I took SQL Bootcamp from zero to hero by Jose Portilla and for Python 2021 complete python Bootcamp from zero to hero. One of the best courses I took to kickstart and it was super cool.

2. Just read some basic concepts in W3 and geeks for geeks.

3. When I am stuck with understanding the code, I usually follow stack overflow

4. I used Kaggle in order to explore the data set project and it is quite easy to pull up the data sets and open-source files to execute the project.

Photo by Myriam Jessier on Unsplash

I have followed with a simple pet project of IPL data analysis using Python.

  1. Importing libraries: Make sure you install python on your computer if not it can be downloaded via this link.

Now start installing the necessary libraries packages by using the pip command pip install library_name in the terminal.

pip install numpy
pip install plotly
pip install pandas
pip install matplotlib
pip install seaborn

After successful installation of libraries starts importing the libraries.

2. Loading the datasheet: Loading the data from CSV ( common separated value) file into pandas data frame is done by using the following command.

df = pd.read_csv("filename.csv")

In my case the file is in excel form hence I used pd.read_excel. Follow my Github for the uploaded files about IPL data sets.

xl=pd.read_excel(current_directory+”/data/ipl_data_set/Players.xlsx”)
print(xl)

3. Displaying type of data: In order to know what type of data information from the file, whether the column has values or not also to pull out the first five and last five columns in the file, here is the command.

The output shows columns have a no-empty value (non-null) and has one object value.

#team information.
print(teams.info())
teams.head()
print(teams.info())
teams.tail()

4. Displaying rows and columns in a data frame:

df.shape

5. Displaying column labels in the data frame:

df.columns

6. Values to be replaced: In case the values in the data file needs to be replaced. In our case, I want to shorten the team name, hence replacing it with the team initials.

deliveries.replace([‘Mumbai Indians’,’Kolkata Knight Riders’,’Royal Challengers Bangalore’,’Deccan Chargers’,’Chennai Super Kings’,
‘Rajasthan Royals’,’Delhi Daredevils’,’Gujarat Lions’,’Kings XI Punjab’,‘Sunrisers Hyderabad’,’Rising Pune Supergiants’,’Kochi Tuskers Kerala’,’Pune Warriors’,’Rising Pune Supergiant’]
,[‘MI’,’KKR’,’RCB’,’DC’,’CSK’,’RR’,’DD’,’GL’,’KXIP’,’SRH’,’RPS’,’KTK’,’PW’,’RPS’],inplace=True)

7. Data Visualization:

#umpires who umpired the most
plt.subplots(figsize=(14,6))
ax = matches[‘umpire1’].value_counts().plot.bar(width = 0.8, color= sns.color_palette(‘bright’,20))
plt.xlabel(“umpires”, fontsize =14)
plt.ylabel(“count”, fontsize= 15)
plt.title(“umpire-1 who umpired the most”, fontsize= 15)
plt.show()
plt.subplots(figsize=(14,6))
ax = matches[‘umpire3’].value_counts().plot.bar(width = 0.8, color= sns.color_palette(‘Reds’))
for p in ax.patches:
ax.annotate(format(p.get_height()), (p.get_x()+0.15, p.get_height()+1))
plt.xlabel(“umpires”, fontsize =14)
plt.ylabel(“count”, fontsize= 15)
plt.title(“umpire-3 who umpired the most( from highest to the lowest)”, fontsize= 15)
plt.show()
#winning team:
plt.subplots(figsize=(14,6))
ax = matches[‘toss_winner’].value_counts().plot.bar(width = 0.8, color= sns.color_palette(‘RdPu’, 20))
for p in ax.patches:
ax.annotate(format(p.get_height()), (p.get_x()+0.15, p.get_height()+1))
plt.xlabel(“teams”, fontsize =14)
plt.ylabel(“count”, fontsize= 15)
plt.title(“teams that won the toss(from the highest to the lowest)”, fontsize= 15)
plt.show()
most_runs_average_strikerate = pd.read_csv(current_directory+’/data/ipl_data_set/most_runs_average_strikerate.csv’)
most_runs_average_strikerate.info(

Scatterplot: Correlation between first 300 team players who scored total runs and the total number of balls.

df = most_runs_average_strikerate.iloc[:300,:]
import plotly.graph_objs as go
# Creating trace1
trace1 = go.Scatter(
x = df.batsman,
y = df.total_runs,
mode = “lines”,
name = “total runs”,
marker = dict(color = ‘rgba(16, 112, 2, 0.8)’),
text= df.total_runs)
# Creating trace2
trace2 = go.Scatter(
x = df.batsman,
y = df.numberofballs,
mode = “lines+markers”,
name = “number of balls”,
marker = dict(color = ‘rgba(80, 26, 80, 0.8)’),
text= df.total_runs)
data = [trace1, trace2]
layout = dict(title = ‘total runs & number of balls vs batsman of Top 300 members’,
xaxis= dict(title= ‘batsman’,ticklen= 5,zeroline= False)
)
fig = dict(data = data, layout = layout)
iplot(fig)

Creating 3D scatterplot

import plotly.express as px
df = most_runs_average_strikerate.iloc[:100,:]
fig = px.scatter_3d(df, x=’batsman’, y=’total_runs’, z=’numberofballs’,
)
fig.show()

These are the few data visualization that I have used for analysis. All the datasets I have used are available on Github.

Thank you.

Happy learning!!

New grad, Data enthusiast, ex-banker, & business process analyst. Mother of the toddler, coffee lover and amateur gardener