900字范文,内容丰富有趣,生活中的好帮手!
900字范文 > pandas数据处理实践五(透视表pivot_table 分组和透视表实战Grouper和pivot_table)

pandas数据处理实践五(透视表pivot_table 分组和透视表实战Grouper和pivot_table)

时间:2023-08-22 18:35:04

相关推荐

pandas数据处理实践五(透视表pivot_table 分组和透视表实战Grouper和pivot_table)

建议大家多研究pandas的官方文档

透视表:

DataFrame.pivot_table(values = None,index = None,columns = None,aggfunc ='mean',fill_value = None,margin = False,dropna = True,margins_name ='All')

创建一个电子表格样式的数据透视表作为DataFrame。数据透视表中的级别将存储在结果DataFrame的索引和列上的MultiIndex对象

In [7]: df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",^M...: "bar", "bar", "bar", "bar"],^M...: "B": ["one", "one", "one", "two", "two",^M...: "one", "one", "two", "two"],^M...: "C": ["small", "large", "large", "small",^M...: "small", "large", "small", "small",^M...: "large"],^M...: "D": [1, 2, 2, 3, 3, 4, 5, 6, 7]})...:In [8]: dfOut[8]:A BC D0 foo one small 11 foo one large 22 foo one large 23 foo two small 34 foo two small 35 bar one large 46 bar one small 57 bar two small 68 bar two large 7In [9]: table = pd.pivot_table(df, values='D', index=['A','B'], columns=['C'], aggfunc=np.sum)# 通过透视表,以A,B为索引对象,以c作为列,把D作为值填充In [10]: tableOut[10]:C large smallA Bbar one 4.0 5.0two 7.0 6.0foo one 4.0 1.0two NaN 6.0

再举一个例子:

import numpy as npimport pandas as pdfrom pandas import Series,DataFramedf = pd.read_excel('sales-funnel.xlsx')df.head() # 查看前五行数据AccountNameRepManagerProductQuantityPriceStatus0714466Trantow-BarrowsCraig BookerDebra HenleyCPU130000presented1714466Trantow-BarrowsCraig BookerDebra HenleySoftware110000presented2714466Trantow-BarrowsCraig BookerDebra HenleyMaintenance25000pending3737550Fritsch, Russel and AndersonCraig BookerDebra HenleyCPU135000declined4146832Kiehn-SpinkaDaniel HiltonDebra HenleyCPU265000won# 生成透视表# 从数据的显示来看,我们对顾客 购买的总价钱感兴趣,如何转换表格呢?pd.pivot_table(df,index=['Name'],aggfunc='sum')AccountPriceQuantityNameBarton LLC740150350001Fritsch, Russel and Anderson737550350001Herman LLC141962650002Jerde-Hilpert41229050002Kassulke, Ondricka and Metz30759970003Keeling LLC6889811000005Kiehn-Spinka146832650002Koepp Ltd1459666700004Kulas Inc437790500003Purdy-Kunde163416300001Stokes LLC478688150002Trantow-Barrows2143398450004pd.pivot_table(df,index=['Name','Rep','Manager'])AccountPriceQuantityNameRepManagerBarton LLCJohn SmithDebra Henley740150.035000.01.000000Fritsch, Russel and AndersonCraig BookerDebra Henley737550.035000.01.000000Herman LLCCedric MossFred Anderson141962.065000.02.000000Jerde-HilpertJohn SmithDebra Henley412290.05000.02.000000Kassulke, Ondricka and MetzWendy YuleFred Anderson307599.07000.03.000000Keeling LLCWendy YuleFred Anderson688981.0100000.05.000000Kiehn-SpinkaDaniel HiltonDebra Henley146832.065000.02.000000Koepp LtdWendy YuleFred Anderson729833.035000.02.000000Kulas IncDaniel HiltonDebra Henley218895.025000.01.500000Purdy-KundeCedric MossFred Anderson163416.030000.01.000000Stokes LLCCedric MossFred Anderson239344.07500.01.000000Trantow-BarrowsCraig BookerDebra Henley714466.015000.01.333333

分组和透视表的使用:

本试验的数据是飞机延误

In [15]: import numpy as np^M...: import pandas as pd^M...: from pandas import Series,DataFrame...:...:In [16]: df = pd.read_csv('usa_flights.csv')In [17]:In [17]: df.head()Out[17]:flight_date unique_carrier ...security_delay actual_elapsed_time0 02/01/ 0:00 AA ... NaN381.01 03/01/ 0:00 AA ... NaN358.02 04/01/ 0:00 AA ... NaN385.03 05/01/ 0:00 AA ... NaN389.04 06/01/ 0:00 AA ... 0.0424.0[5 rows x 14 columns]In [18]: df.shape # 查看数据的维度Out[18]: (64, 14)In [22]: df.columns # 查看数据的列标签Out[22]:Index(['flight_date', 'unique_carrier', 'flight_num', 'origin', 'dest','arr_delay', 'cancelled', 'distance', 'carrier_delay', 'weather_delay','late_aircraft_delay', 'nas_delay', 'security_delay','actual_elapsed_time'],dtype='object')

任务一:1.通过arr_delay排序观察延误时间最长top10

In [23]: df.sort_values('arr_delay',ascending=False).head(10)

2.计算延误和没有延误的比例

In [24]: df['cancelled'].value_counts() # 计算取消航班和正常航班的总次数Out[24]:0 19687314791Name: cancelled, dtype: int64In [25]: df['delayed'] = df['arr_delay'].apply(lambda x: x>0) #把延误的转为数值量In [26]: df.head()

In [27]: delay_data = df['delayed'].value_counts()In [28]: delay_dataOut[28]:False 103037True98627Name: delayed, dtype: int64In [29]: delay_data[0]Out[29]: 103037In [30]: delay_data[1] / (delay_data[0] + delay_data[1])Out[30]: 0.4890659711202793

3.每个航空公司的延误情况

In [31]: delay_group = df.groupby(['unique_carrier','delayed'])In [32]: df_delay = delay_group.size().unstack()In [33]: df_delayOut[33]:delayed False Trueunique_carrierAA8912 9841AS3527 2104B64832 4401DL 17719 9803EV 10596 11371F91103 1848HA1351 1354MQ4692 8060NK1550 2133OO9977 10804UA7885 8624US7850 6353VX1254 781WN 21789 21150In [34]: import matplotlib.pyplot as pltIn [35]: df_delay.plot()

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。