3 mighty libraries for a Data Science beginner to become friends with (part 1)

I understand you.

When we start something new, we use to get so excited that we begin to google search a lot of random stuff about that thing. And it becomes almost inevitable not to get lost on the endless sea of information available on the internet. But by putting some effort, we are able to filter what is important to know based on our current level.

On this set of posts, you will be given precious information about three python libraries that you should definitely look up:

Pandas
NumPy
Matplotlib

You probably already heard a thing or two about any of these. But if you didn't pay attention before - because maybe you thought they were not worth it - this is your chance to redeem yourself.

To better illustrate the next topics, consider the following dataset for demonstration purposes:

	PurchaseDate	Region	State	Seller	Item	Units	UnitPrice
0	10-Jun-2020	Northeast	Bahia	Tobias	Stove	62	400.99
1	11-Jun-2020	Southeast	São Paulo	Nadia	Fridge	29	100.99
2	3-Aug-2020	Northeast	Ceará	Carlos	Stove	55	1200.49
3	22-Aug-2020	Northeast	Bahia	Pedro	Fridge	81	1900.99
4	26-Aug-2020	Midwest	Goiás	Tania	Blender	42	2300.95
5	10-Sep-2020	Northeast	Sergipe	Tobias	Carpet	35	400.99
6	12-Sep-2020	North	Pará	Carlos	Carpet	3	2750.00
7	7-Oct-2020	Northeast	Sergipe	Nadia	Blender	2	1250.00
8	15-Oct-2020	North	Amazonas	Pedro	Stove	7	1000.29
9	27-Nov-2020	Southeast	São Paulo	Nadia	Fridge	16	1500.99
10	13-Dec-2020	South	Paraná	Tania	Blender	76	1450.99

file: order_data.csv

On this first post, we'll start with Pandas. And if you want to check on some specific functionality, I will leave you the table below so you can also go straight to the point.

Load and Transform	Visualize	Locate	Summarize
read_csv	head	loc	describe
read_excel	tail	iloc	info
sort_values	shape	duplicated	sum
set_index	index	query	count
reset_index	columns	df['col']	min
drop	dtypes	df.your_col	max
copy	isnull	---	mean
---	values	---	median
---	---	---	corr

Pandas

Panda image by Eric Baccega

Pandas is a very powerful python library widely used by data scientists and/or analysts for both manipulating and analysing data. It also works well with many other python modules, and its main advantage is having an intuitive and practical usability without compromising its functionality.

For convenience, pandas use to be loaded into the project with the allias pd, as shown below:

import pandas as pd

It is useful so we can just use this short allias instead of typing the whole package name every time we want to use a pandas function. This library also allows us to create two type of structures that make the manipulating easier: Series and Dataframes.

Series

According to pandas official documentation, a series is a one-dimensional ndarray (an array that belongs to the NumPy class ndarray) with axis labels.

Omg, this...

Better saying, a pandas series is nothing but an unidimensional array (having a unique dimension) which can store any sort of data with labels or indexes as the axis. In short, it is like a column of a dataframe. Let's see how to create a series in pandas and how to identify its structure. The following code can be replicated in your jupyter notebook:

# imports
In [1]: import pandas as pd

# creates a series to store people names
In [2]: names = pd.Series(['Carlos', 'Sara', 'Louise', 'James'])

In [3]: names
Out[3]: 0   Carlos
        1     Sara
        2   Louise
        3    James
        dtype: object

Note that, as indexation in python begins with 0, the index range of the series created goes from 0 to n-1, where n is the number of elements on your series. This is a very important thing to remember, and which might cause a lot of malfunctioning on your codes if you mess that up.

We can also see that, as I haven't specified any index range for my series, pandas automaticatly insert the standard indexation. However, if I want a different label for my axis, I can specify that when creating the object. See how it looks:

# creates a series to store animals species
In [2]: animals = pd.Series(['Dog', 'Elephant', 'Fox', 'Eagle'],
                            index=['A', 'B', 'C', 'D'])

In [3]: animals
Out[3]: A        Dog
        B   Elephant
        C        Fox
        D      Eagle
        dtype: object

These are, in fact, very simple examples. But they might help you having an ideia of what a series look like.

Dataframes

Differently from a series, a pandas dataframe is a two-dimensional tabular structure where data is labeled by its own combination of column and row. This structure is size-mutable and potentially heterogeneous. That is, we can easily create a dataframe with two columns and two rows, and then add new columns and rows for this same object. And we can store different types of data on the same dataframe, which can be very convinient in many situations.

Let's see how it works on the jupyter notebook:

# imports
In [1]: import pandas as pd

# creates a dataframe to store people names, ages, and heights
In [2]: names = pd.DataFrame([['Carlos', 27, 1.78], ['Sara', 12, 1.35],
                              ['Louise', 35, 1.62], ['James', 18, 1.87]],
                              columns=['name', 'age', 'height'],
                              index=['i', 'ii','iii', 'iv'])

In [3]: names
Out[3]:      name   age   height
        i  Carlos    27     1.78
       ii    Sara    12     1.35
      iii  Louise    35     1.62
       iv   James    18     1.87

Could you notice how they differ? Now we have a table storing a set of information that we can either access by the index - returning all the information in a the row - or by the combination of index and columns - returning a single desired value.

And if we just want to go on regular indexing, we only need to remove the index attribute from the function.

See the official documentation for more information.

Just so to remind you, all the operations from now on will take into consideration the dataset we defiined on the very beginning of this post. Also, see that you will face many abbreviations and acronyms on your way through the data science world. I will provide you a cheatsheet on later posts.

Now, let us finally see what pandas can bring us!

Loading data

Pandas has many functions to load data into your project. You can mine data from a csv - a text file - or an excel spreadsheet, for instance. But it is also possible to get information from SQL and HTML tables, SQL query, JSON strings, Google BigQuery, Stata .dta files, and so on. See here all the options pandas offers.

Here are two frequently used functions:

pd.read_csv

For comma-separated text files.

# locate and indicate your file path
df = pd.read_csv('C:\...\order_data.csv')

pd.read_excel

For excel files.

# locate and indicate your file path
# you can also indicate the sheet name if there are more than one
df = pd.read_csv('C:\...\your_file_here.xlsx',
                      sheet_name='sheet_name_here')

Transforming data

Sometimes we only need to perform some basic transformations on our DataFrame and that's where this functions come in handy.

df.sort_values

Sort DataFrame by the values of chosen column or columns. It is possible to sort twice if you pass more than one column as parameter. You can also choose the direction of your sorting by assigning ascending as True or False.

See the documentation.

# sorts in ascending order by the column 'Seller'
df.sort_values(by='Seller', ascending=True)

df.set_index

Set an existing DataFrame column as the index. You can either use it to replace the original index or to expand it.

See the documentation.

# using append=True to expand the index
# column 'Seller' is picked
df.set_index('Seller', append=True)

df.reset_index

Reset the index of a DataFrame, using the default one instead. Default indexation in python begin with 0. You can either drop the current index or insert it as a column into the DataFrame.

Pandas

Series

Dataframes

Loading data

pd.read_csv

pd.read_excel

Transforming data

df.sort_values

df.set_index

df.reset_index

df.drop

df.copy

Visualizing data

df.head

df.tail

df.shape

df.index

df.columns

df.dtypes

df.isnull

df.values

Locating data

df.loc

df.iloc

df.duplicated

df.query

df['col']

df.your_col

Summarizing data

df.describe

df.info

df.sum

df.count

df.min

df.max

df.mean

df.median

df.corr

Summary

Reference Links: