Python Programming

Python Pandas DataFrame basics

Pandas DataFrame

Pandas DataFrame is a 2-dimensional, titled data structure with columns of potentially different types. You can think of it like a spreadsheet or database table, or a dict of Series objects. It is  the most commonly used pandas object. Pandas DataFrame accepts many different kinds of input like Dict of one-dimensional ndarrays, lists, dicts, or Series, two-dimensional ndarrays, structured or record ndarray, a dictionary of Series, or another DataFrame.

df = pd.DataFrame(data=None, index=None, columns=None)

Here, df is the Pandas DataFrame and data can be NumPy ndarray, dict, or DataFrame. Along with the data, you can optionally pass an index (row labels) and columns (column labels) attributes as arguments. If you pass an index and/or columns, you are warranting the index and/or columns of the resulting Pandas DataFrame. Both index and columns will default to range(n) where n is the length of data, if they are not provided. When the data is a dictionary and columns are not specified, then the Pandas DataFrame column labels will be the dictionary’s keys.



Amazon Purchase Links:

Top Gaming Computers

Best Laptops

Best Graphic Cards

Portable Hard Drives

Best Keyboards

Best High Quality PC Mic

Computer Accessories

*Please Note: These are affiliate links. I may make a commission if you buy the components through these links. I would appreciate your support in this way!

Create Pandas DataFrame from Dictionary of Series/Dictionaries

>>> import pandas as pd
>>> dict_series = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
... 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
>>> df = pd.DataFrame(dict_series)
>>> df
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
index
columns
>>> df.shape
(4, 2)
>>> df.index
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> df.columns
Index(['one', 'two'], dtype='object')
>>> list(df.columns)
['one', 'two']
>>> dicts_only = {'a':[1,2,3], 'b':[4,5,6]}
381 Introduction to Data Science
10. >>> dict_df = pd.DataFrame(dicts_only)
11. >>> dict_df
a b
0 1 4
1 2 5
2 3 6
>>> dict_df.index
RangeIndex(start=0, stop=3, step=1)

You need to import pandas library. Create a dictionary whose values are a series of a one-dimensional arrays. You can create a Pandas DataFrame from a dictionary of series. Pass dict_series dictionary as argument to DataFrame() class which returns a Pandas DataFrame object. Index labels are passed as a list to index attribute. If the number of labels specified in the various series are not the same, then the resulting index will be the union of all the index labels of various series. In DataFrame df under column “one“, there is no data element associated with index label “d“, so NaN will be inserted at that position. A number of rows and columns in a Pandas DataFrame is obtained using shape attribute. Get the index labels for the Pandas DataFrame using index attribute. With columns attribute, you get all the columns of the DataFrame. Get the columns of DataFrame as a list by passing the columns attribute as an argument to the list() function. You can create a Pandas DataFrame from a dictionary without using index and columns attribute. For dict_df  DataFrame , a and b are columns and index labels are integers ranging from zero to two .


Create DataFrame from ndarrays/lists/list of dictionaries

>>> import numpy as np
>>> import pandas as pd
>>> dict_ndarrays = {'one': np.random.random(5), 'two':np.random.random(5)}
>>> pd.DataFrame(dict_ndarrays)
one two
0 0.346580 0.827881
1 0.738850 0.577504
2 0.969715 0.781170
3 0.668432 0.746535
4 0.709333 0.440675
>>> pd.DataFrame([[1,2,3,4,5], [6,7,8,9,10]])
0 1 2 3 4
0 1 2 3 4 5
1 6 7 8 9 10
columns
rows
>>> dict_lists = {'one': [1, 2, 3, 4, 5], 'two': [5, 4, 3, 2, 1]}
>>> pd.DataFrame(dict_lists)
382 Introduction to Python Programming
one two
0 1 5
1 2 4
2 3 3
3 4 2
4 5 1
>>> pd.DataFrame(dict_lists, index=['a', 'b', 'c', 'd', 'e'])
one two
a 1 5
b 2 4
c 3 3
d 4 2
e 5 1
>>> lists_dicts = [{'a':1, 'b':2}, {'a':5, 'b':10, 'c':20}]
>>> pd.DataFrame(lists_dicts)
a b c
0 1 2 NaN
1 5 10 20.0

The pandas library is built on top of NumPy. Here, dict_ndarrays  is a dictionary of ndarrays from which you can create a Pandas DataFrame. Also, nested lists can be used to create a Pandas DataFrame. If no index and columns are specified, then both index and columns will have integer labels. Keys are considered as column labels when a Pandas DataFrame is created using dictionaries. The Pandas DataFrame columns will be preserved in the same order as specified by dictionary keys. In, index labels are specified for a DataFrame created from the dictionary. The Pandas DataFrame can also be created from a list of dictionaries. Since DataFrame columns will be a union of all the keys in the list of dictionaries, elements for missing columns will be NaN.


Pandas DataFrame Column Selection, Addition and Deletion

>>> import pandas as pd
>>> la_liga = {"Ranking":[1,2,3], "Team": ["Barcelona", "Atletico Madrid", "Real
Madrid"]}
>>> df = pd.DataFrame(la_liga)
>>> df
Ranking Team
0 1 Barcelona
1 2 Atletico Madrid
2 3 Real Madrid
>>> df['Team']
0 Barcelona
1 Atletico Madrid
2 Real Madrid
Name: Team, dtype: object
>>> df['Played'] = [34, 36, 38]
>>> df['Won'] = [27, 23, 22]
>>> df[['Played', 'Won']]
Played Won
0 34 27
1 36 23
2 38 22
>>> df['Points'] = df['Won'] * 2
>>> df
Ranking Team Played Won Points
0 1 Barcelona 34 27 54
1 2 Atletico Madrid 36 23 46
2 3 Real Madrid 38 22 44
>>> df['Lost'] = [1, 5, 6]
>>> df
Ranking Team Played Won Points Lost
0 1 Barcelona 34 27 54 1
1 2 Atletico Madrid 36 23 46 5
2 3 Real Madrid 38 22 44 6
>>> df['Drawn'] = df['Played'] - df['Won'] - df['Lost']
>>> df
Ranking Team Played Won Points Lost Drawn
0 1 Barcelona 34 27 54 1 6
1 2 Atletico Madrid 36 23 46 5 8
2 3 Real Madrid 38 22 44 6 10
>>> df['Year'] = 2018
>>> df
Ranking Team Played Won Points Lost Drawn year
0 1 Barcelona 34 27 54 1 6 2018
1 2 Atletico Madrid 36 23 46 5 8 2018
2 3 Real Madrid 38 22 44 6 10 2018
>>> del df['Year']
>>> df.pop('Drawn')
0 6
1 8
2 10
Name: Drawn, dtype: int64
>>> df.insert(5, 'Goal Difference', [63, 38, 42])
>>> df
Ranking Team Played Won Points Goal Difference Lost
0 1 Barcelona 34 27 54 63 1
1 2 Atletico Madrid 36 23 46 38 5
2 3 Real Madrid 38 22 44 42 6
 >>> df.rename(columns = {'Team':'Club Team'})
Ranking Club Team Played Won Points Goal Difference Lost
0 1 Barcelona 34 27 54 63 1
1 2 Atletico Madrid 36 23 46 38 5
2 3 Real Madrid 38 22 44 42 6

Create Pandas DataFrame df  from la_liga dictionary. You can select a particular column in a DataFrame by specifying the column name within quotes inside a bracket of a Pandas DataFrame. You can add a new column to the DataFrame by specifying the column label within the bracket of DataFrame and assign data elements to it. Grab multiple columns from a Pandas DataFrame by passing a list of columns. You can also create a new column by making use of the data elements found in existing columns. Column “Points” is inserted to the Pandas DataFrame df after multiplying all the data elements in column “Won” by 2. You can perform basic arithmetic operations on Pandas DataFrame columns . When inserting a scalar value, it will congenitally be propagated to fill the column. Columns can be deleted or popped . By default, columns get inserted at the end. The insert function is available to insert at a dainty location in the columns . You can rename the column label using the rename() method. The columns attribute has to be passed to the rename() method and assign it with a dictionary where the old column label will be key and new column label will be a value of . All the above operations have a direct impact on the Pandas DataFrame.



Displaying Data in Pandas DataFrame

>>> import pandas as pd
>>> df = pd.DataFrame({'WorldCup_Winner':["Brazil", "Germany", "Argentina",
"Brazil", "Spain"], 'Year':[1962, 1974, 1986, 2002, 2010]})
>>> df.columns
Index(['WorldCup_Winner', 'Year'], dtype='object')
>>> df.head(2)
WorldCup_Winner Year
0 Brazil 1962
1 Germany 1974
>>> df.tail(2)
WorldCup_Winner Year
3 Brazil 2002
4 Spain 2010
>>> df['WorldCup_Winner'].unique()
array(['Brazil', 'Germany', 'Argentina', 'Spain'], dtype=object)
7. >>> df['WorldCup_Winner'].unique().tolist()
['Brazil', 'Germany', 'Argentina', 'Spain']
>>> df.transpose()
0 1 2 3 4
WorldCup_Winner Brazil Germany Argentina Brazil Spain
Year 1962 1974 1986 2002 2010
>>> df.sort_values(by=['Year'], ascending = False)
WorldCup_Winner Year
4 Spain 2010
3 Brazil 2002
2 Argentina 1986
1 Germany 1974
0 Brazil 1962
>>> df.sort_index(ascending = False)
WorldCup_Winner Year
4 Spain 2010
3 Brazil 2002
2 Argentina 1986
1 Germany 1974
0 Brazil 1962
>>> df['WorldCup_Winner'].value_counts()
Brazil 2
Argentina 1
Germany 1
Spain 1
Name: WorldCup_Winner, dtype: int64
>>> df['WorldCup_Winner'].value_counts().index.tolist()
['Brazil', 'Argentina', 'Germany', 'Spain']
13. >>> df['WorldCup_Winner'].value_counts().values.tolist()
[2, 1, 1, 1]

DataFrame head(n)  method returns first n rows and tail(n)  method returns last n rows. You can find unique data elements in a column by chaining unique() method with a Pandas DataFrame column using dot notation. The unique()  method returns a one-dimensional array-like object, which can be converted to a list using tolist() method. The transpose() method flips the DataFrame over its main diagonal by writing rows as columns and vice versa. The syntax for sort_values() method is,

df.sort_values(by, axis=0, ascending=True)

where the by parameter can be a string, list of strings, index label, column label, list of index labels, or list of column labels to sort by. If the value of the axis is 0 then by may contain column labels. If the value of the axis is 1, then by may contain index labels. By default, the value of the axis parameter is 0. The default value of ascending parameter is True, if so then the data elements will be sorted in ascending order. A False value leads to sorting the data elements in descending order . By default, the sort_index() method, performs sorting on row labels in ascending order and returns a copy of the Pandas DataFrame. If the ascending parameter is set to Boolean False, then the sort_index() method performs sorting in descending order. The value_counts() method when chained with a Pandas DataFrame, returns a Series object containing counts of unique values . The resulting items will be in descending order so that the first element is the most frequently-occurring element. The NA values are excluded by default. The index attribute returns the index or row labels of the Series . The values attribute returns a NumPy representation of the Series .


Using Pandas DataFrame assign() method

>>> import pandas as pd
>>> df_mountain = pd.DataFrame({"Mountain":['Mount Everest', 'K2',
'Kangchenjunga'], "Length":[8848, 8611, 8586]})
 >>> df_mountain.assign(Ranking = [1, 2, 3])
Length Mountain Ranking
0 8848 Mount Everest 1
1 8611 K2 2
2 8586 Kangchenjunga 3
 >>> df = pd.DataFrame({'A':[2, 4, 6], 'B':[3, 6, 9]})
>>> df.assign(C = lambda x:x['A'] ** 2)
A B C
0 2 3 4
1 4 6 16****
2 6 9 36

DataFrame has an assign() method that allows you to easily create new columns that are assuredly derived from existing columns . The assign() method always returns a copy of the data, leaving the original Pandas DataFrame untouched.

Pandas DataFrame Indexing and Selecting Data

The Python and NumPy indexing operators [] and dot operator . provide quick and easy access to select a subset of data elements in a pandas Pandas DataFrame across a wide range of use cases. However, since the type of the data to be  penetrated isn’t known in advance, directly using standard operators has some optimization limits. For production code, it’s highly recommended that you take advantage of the optimized pandas data access methods, like .loc[] and .iloc[], which are used to retrieve rows. Note that .loc[ ] and .iloc[ ] methods are followed by square brackets [ ], not parentheses () and are called as indexers. The .loc[] method is primarily label based, but may also be used with a Boolean array. The .loc[] method will raise KeyError when the items are not found. Inputs accepted by

.loc[] method are a single label, e.g. 5 or ‘a’ (note that 5 is interpreted as a label of the index/row; this use is not an integer position along the index), a list or array of labels [‘a’, ‘b’, ‘c’], a slice items with labels ‘a’:’f’ (note that contrary to usual Python slices, both the start and the stop are included, when present in the index!) and Boolean array. The .iloc[] method is primarily an integer position based (from 0 to length-1 of the axis), but may also be used with a Boolean array. The .iloc[] method will raise an IndexError if a requested indexer is out-of-bounds, except in the case of slice indexers, which allow out-ofbounds indexing (this conforms with Python/NumPy slice semantics). Allowed inputs for .iloc[] method are an integer, such as 5, a list or array of integers [4, 3, 0], a slice object with ints 1:7, and a Boolean array. For example,

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(np.random.rand(5,5), index = ['row_1', 'row_2', 'row_3',
'row_4', 'row_5'], columns = ['col_1', 'col_2', 'col_3', 'col_4', 'col_5'])
>>> df
col_1 col_2 col_3 col_4 col_5
row_1 0.302179 0.067154 0.848890 0.291533 0.710989
row_2 0.668777 0.246157 0.339020 0.232109 0.390328
row_3 0.787487 0.703837 0.542948 0.839311 0.050887
row_4 0.905814 0.026933 0.381502 0.754635 0.399242
row_5 0.244861 0.343171 0.992433 0.058433 0.266207
>>> df.loc['row_1']
col_1 0.302179
col_2 0.067154
col_3 0.848890
col_4 0.291533
col_5 0.710989
Name: row_1, dtype: float64
>>> df.loc['row_2', 'col_3']
0.339020
>>> df.loc[['row_1', 'row_2'],['col_2', 'col_3']]
col_2 col_3
row_1 0.067154 0.84889
row_2 0.246157 0.33902
>>> df.loc[:, ['col_2', 'col_3']]
col_2 col_3
row_1 0.067154 0.848890
row_2 0.246157 0.339020
row_3 0.703837 0.542948
row_4 0.026933 0.381502
row_5 0.343171 0.992433
>>> df.iloc[1]
col_1 0.668777
col_2 0.246157
col_3 0.339020
col_4 0.232109
col_5 0.390328
Name: row_2, dtype: float64
>>> df.iloc[3:5, 0:2]
col_1 col_2
row_4 0.905814 0.026933
row_5 0.244861 0.343171
>>> df.iloc[:3, :]
col_1 col_2 col_3 col_4 col_5
row_1 0.302179 0.067154 0.848890 0.291533 0.710989
row_2 0.668777 0.246157 0.339020 0.232109 0.390328
row_3 0.787487 0.703837 0.542948 0.839311 0.050887
>>> df.iloc[:,:]
col_1 col_2 col_3 col_4 col_5
row_1 0.302179 0.067154 0.848890 0.291533 0.710989
row_2 0.668777 0.246157 0.339020 0.232109 0.390328
row_3 0.787487 0.703837 0.542948 0.839311 0.050887
row_4 0.905814 0.026933 0.381502 0.754635 0.399242
row_5 0.244861 0.343171 0.992433 0.058433 0.266207
>>> df.iloc[2:, 2:]
col_3 col_4 col_5
row_3 0.542948 0.839311 0.050887
row_4 0.381502 0.754635 0.399242
row_5 0.992433 0.058433 0.266207
>>> df.iloc[:,1]
row_1 0.067154
row_2 0.246157
row_3 0.703837
row_4 0.026933
row_5 0.343171
Name: col_2, dtype: float64
>>> df[df > 0.2]
col_1 col_2 col_3 col_4 col_5
row_1 0.302179 NaN 0.848890 0.291533 0.710989
row_2 0.668777 0.246157 0.339020 0.232109 0.390328
row_3 0.787487 0.703837 0.542948 0.839311 NaN
row_4 0.905814 NaN 0.381502 0.754635 0.399242
row_5 0.244861 0.343171 0.992433 NaN 0.266207

For .loc[row_label_indexing, col_label_indexing] and .iloc[row_integer_indexing, col_integer_indexing] methods, a single argument always refers to selecting data elements from row indices in the Pandas DataFrame and not the column indices. When col_label_indexing or col_integer_indexing is absent, it means all the columns for that particular row will be selected. For example, .loc[‘a’] is equivalent to .loc[‘a’,:]. This example applies to iloc as well. With df.loc[indexer] you know automatically that df.loc[] is selecting rows. In difference, it is not clear if df[indexer] will select rows or columns (or raise ValueError) without knowing

details about indexer and df. In, select the first row labeled as row_1 and all the columns of that row. Line selects data elements present in the second row labeled as row_2 and third column labeled as col_3. Selects values present in the first row, row_1 and second row, row_2 along with their corresponding columns. In, you slice via labels and select all the rows under column 2 and 3.


You can grab data based on position instead of labels using .iloc method. The .iloc[] method provides integer-based indexing. The semantics follow Python and NumPy slicing closely. These are zero-based indexing. When slicing, the start bound is carried, while the upper bound is excluded. In, select all the data elements from the second row along the entire columns. Even though we have labeled these rows and columns, still their integer indices range from 0 to n – 1, where n is the length of the data. Slicing returns a subset of data elements present in DataFrame along with their corresponding labels. In _, even though the row index is out of range, still the existing rows will be selected and out-of-range indexes are handled gracefully. All data elements starting from first to the third row along their entire column are selected . The entire Pandas DataFrame is selected

in . Rows from position three onwards and columns from position three onwards are selected . All the rows in the second column are selected . An important feature of pandas is contingent selection using bracket notation, very similar to numpy. Data elements greater than 0.2 in Pandas DataFrame are displayed while the lower values are treated as NaN . Note, none of the above operations change the original data elements of DataFrame.

Group By: split-apply-combine

Here, “group by” refers to a process involving one or more of the following steps:

  • criteria the data are split into group on some criteria.
  • Applying a function to each group independently.
  • Combining the results into a data structure.
  1. Out of these, the split step is the most straightforward. In fact, in many postures, we may wish to split the data set into groups and do something with those groups.
  2. In the apply step, we might wish to do one of the following:

Aggregation: compute an abstract statistic (or statistics) for each group. For example,

Compute group sums or means.

Compute group sizes/counts.

Transformation: perform some group-specific computations and return a likeindexed object. For example,

Standardize data (zscore) within a group.

Filling NAs within groups with a value derived from each group.

Filtration: discard some groups, conforming to a group-wise computation that evaluates True or False. For example, Discard data that belong to groups with only a few members. sum or mean are filter out data base on group.

  1. Some combination of the above: GroupBy will examine the results of the apply step and try to return a sensibly combined result if it does not fit into either of the above two categories.



For example,

>>> import pandas as pd
>>> cars_data = {'Company':['General Motors','Ford', 'Toyota', 'General Motors',
'Ford', 'Toyota'], 'Model': ['Camaro', 'Mustang', 'Prius', 'Malibu', 'Fiesta', 'Camry'],
'Sold':[12285, 35273, 34287, 29325, 27459, 17621]}
>>> cars_df = pd.DataFrame(cars_data)
>>> cars_df
Company Model Sold
0 General Motors Camaro 12285
1 Ford Mustang 35273
2 Toyota Prius 34287
3 General Motors Malibu 29325
4 Ford Fiesta 27459
5 Toyota Camry 17621
5. >>> cars_df.groupby('Company').mean()
Company Sold
Ford 31366
General Motors 20805
Toyota 25954
>>> cars_df.groupby('Company').std()
Company Sold
Ford 5525.332388
391 Introduction to Data Science
General Motors 12049.099551
Toyota 11784.641615
>>>cars_df.groupby('Company').min()
Company Model Sold
Ford Fiesta 27459
General Motors Camaro 12285
Toyota Camry 17621
>>> cars_df.groupby('Company').max()
Company Model Sold
Ford Mustang 35273
General Motors Malibu 29325
Toyota Prius 34287
>>> cars_df.groupby('Company').sum()
Company Sold
Ford 62732
General Motors 41610
Toyota 51908
>>> cars_df.groupby('Company').describe()
Sold
Company count mean std min 25% 50% 75% max
Ford 2.0 31366.0 5525.332388 27459.0 29412.5 31366.0 33319.5 35273.0
General Motors 2.0 20805.0 12049.099551 12285.0 16545.0 20805.0 25065.0 29325.0
Toyota 2.0 25954.0 11784.641615 17621.0 21787.5 25954.0 30120.5 34287.0
 >>> cars_df.groupby('Company').count()
Company Model Sold
Ford 2 2
General Motors 2 2
Toyota 2 2
>>> cars_df.groupby('Company')['Company'].count()
Company
Ford 2
General Mot ors 2
Toyota 2
Name: Company, dtype: int64
>>> cars_df.groupby('Company')['Company'].count().tolist()
[2, 2, 2]
>>> cars_df.groupby('Company')['Company'].count().index.tolist()
['Ford', 'General Motors', 'Toyota']
 >>> cars_df.groupby(['Company','Model']).groups
{('Ford', 'Fiesta'): Int64Index([4], dtype='int64'), ('Ford', 'Mustang'):
Int64Index([1], dtype='int64'), ('General Motors', 'Camaro'): Int64Index([0],
dtype='int64'), ('General Motors', 'Malibu'): Int64Index([3], dtype='int64'),
('Toyota', 'Camry'): Int64Index([5], dtype='int64'), ('Toyota', 'Prius'):
Int64Index([2], dtype='int64')}
>>> grp_by_company = cars_df.groupby('Company')
 >>> for label, group in grp_by_company:
... print(label)
... print(group)
...
Ford
Company Model Sold
1 Ford Mustang 35273
4 Ford Fiesta 27459
General Motors
Company Model Sold
0 General Motors Camaro 12285
3 General Motors Malibu 29325
Toyota
Company Model Sold
2 Toyota Prius 34287
5 Toyota Camry 17621

In the above code, cars_df, is the DataFrame on which the groupby() method will be applied. The groupby() method allows you to group rows of data together based on a column name and call aggregate functions on them. For instance, let’s group based on “Company” column using the groupby() method. This will return a Pandas DataFrameGroupBy object upon which you can call aggregate methods. If you need to count only one column then specify the name of the column within brackets as shown in for which you can get values and index labels as a list by chaining tolist() method.


Aggregate Functions:

Aggregate Functions and Their Description:

mean()

Compute mean of groups

sum()

Compute sum of group values

size()

Compute group sizes

count()

Compute count of group

std()

Standard Deviation of groups

describe()

Generate Descriptive Statistics

min()

Compute the minimum of group values

max()

Compute maximum of group values

The group’s attribute is a dictionary whose keys are the computed unique groups and corresponding values are the index labels belonging to each group. Assign the DataFrameGroupBy object returned by groupby() method to a grp_by_company variable. With a grp_by_company object, you can iterate through the grouped data by specifying two iterating variables. Here label variable returns the data elements of the grouped column company and group variable returns the grouped data. Concatenate, Append and Merge The pandas library provides various facilities for easily combining/concatenating together Series as well as Pandas DataFrame objects. The pandas library also has support for full-featured, high-performance in-memory merge operations, also called join operations. The pandas library provides a single function, merge(), as the entry point for all standard merge operations between different Pandas DataFrame objects.


Engr Fahad

My name is Shahzada Fahad and I am an Electrical Engineer. I have been doing Job in UAE as a site engineer in an Electrical Construction Company. Currently, I am running my own YouTube channel "Electronic Clinic", and managing this Website. My Hobbies are * Watching Movies * Music * Martial Arts * Photography * Travelling * Make Sketches and so on...

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button