您好,欢迎访问代理记账网站
  • 价格透明
  • 信息保密
  • 进度掌控
  • 售后无忧

1001系列之案例0003如何对欧洲人口普查数据集整理挖掘

一、欧洲人口普查数据分析

本案例重点在于对数据质量的清洗、转换,、可视化。

import os                   #导入需要的库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
os.chdir(r"D:\Datalu\File")    #指定工作目录
%matplotlib inline              #可视化设置
plt.rcParams["font.sans-serif"] = ["KAITI"]
plt.rcParams["axes.unicode_minus"] = False

1.1 导入数据,并查看前五行

df = pd.read_csv('european_cities.csv')
df.head()
RankCityStatePopulationDate of census/estimate
01London[2]United Kingdom8,615,2461-Jun-14
12BerlinGermany3,437,91631-May-14
23MadridSpain3,165,2351-Jan-14
34RomeItaly2,872,08630-Sep-14
45ParisFrance2,273,3051-Jan-13

1.2 拷贝数据

df1 =df.copy()

二、提出分析问题

问题提出:

1、查看数据类型
2、查看Shape
3、是否有缺失值
4、对数据进行清洗
5、重新设置索引
6、查看哪个国家城市最多
7、查看哪个国家人口最多

2.1 查看数据类型

#1、查看数据类型
df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Rank                     105 non-null    int64 
 1   City                     105 non-null    object
 2   State                    105 non-null    object
 3   Population               105 non-null    object
 4   Date of census/estimate  105 non-null    object
dtypes: int64(1), object(4)
memory usage: 27.9 KB
#方法一用astype强制转换
df1["Date of census/estimate"] = df1["Date of census/estimate"].astype("datetime64")
df1.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Rank                     105 non-null    int64         
 1   City                     105 non-null    object        
 2   State                    105 non-null    object        
 3   Population               105 non-null    object        
 4   Date of census/estimate  105 non-null    datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 22.0 KB
#方法二用to_datetime转换
df1["Date of census/estimate"] = pd.to_datetime(df1["Date of census/estimate"],errors='coerce')
df1.info(memory_usage="deep") #可以看到内存占用降低6.4KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Rank                     105 non-null    int64         
 1   City                     105 non-null    object        
 2   State                    105 non-null    object        
 3   Population               105 non-null    object        
 4   Date of census/estimate  105 non-null    datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 24.0 KB
#2、查看Shape
df.shape
(105, 5)
#查看数据的统计特征
df.describe()
Rank
count105.000000
mean53.057143
std30.428298
min1.000000
25%27.000000
50%53.000000
75%79.000000
max105.000000

2.2 查看缺失率

#3、是否有缺失值,计算每一列的缺失值比例。结果保留2位小数
round(df.isnull().sum()/len(df),2)
Rank                       0.0
City                       0.0
State                      0.0
Population                 0.0
Date of census/estimate    0.0
dtype: float64

2.3 对数据进行清洗

2.3.1 处理人口特征

#将人口特征中的逗号去掉,变为数值型数据  方式一:用pandas的str方法
df1["Population"] = df1["Population"].str.replace(",","")
df1["Population"].head()
0    8615246
1    3437916
2    3165235
3    2872086
4    2273305
Name: Population, dtype: object
#方法二:将人口特征中的逗号去掉,变为数值型数据,用lambda函数方法
df1["Population"] = df1["Population"].apply(lambda x:x.replace(",",""))
df1["Population"].head()
0    8615246
1    3437916
2    3165235
3    2872086
4    2273305
Name: Population, dtype: object
df1["Population"] = df1["Population"].astype("int32")
df1.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   State                    105 non-null    object        
 1   City                     105 non-null    object        
 2   Rank                     105 non-null    int64         
 3   Population               105 non-null    int32         
 4   Date of census/estimate  105 non-null    datetime64[ns]
dtypes: datetime64[ns](1), int32(1), int64(1), object(2)
memory usage: 3.8+ KB

2.3.2 处理城市和国家特征

#去掉国家这一列中的空格
df1["State"] = df1["State"].apply(lambda x : x.strip())
df1["State"][:1]
0    United Kingdom
Name: State, dtype: object
df1["State"] = df1["State"].astype("category")
df1.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   State                    105 non-null    category      
 1   City                     105 non-null    object        
 2   Rank                     105 non-null    int64         
 3   Population               105 non-null    int32         
 4   Date of census/estimate  105 non-null    datetime64[ns]
dtypes: category(1), datetime64[ns](1), int32(1), int64(1), object(1)
memory usage: 11.8 KB
#将国家设置为索引
df1.set_index("State",inplace=True)
df1.head(2)
StateCityRankPopulationDate of census/estimate
0DenmarkAarhus923266762014-10-01
1SpainAlicante863346782012-01-01

2.3.3 处理时间类型数据

df1.reset_index(inplace=True)
df1.set_index("Date of census/estimate",inplace=True)
df1["Year"] = df1["Date of census/estimate"].dt.year
df1["Month"] = df1["Date of census/estimate"].dt.month
df1["Date"] = df1["Date of census/estimate"].dt.day
df1.head(2)

三、对数据变量挖掘

3.1 查看哪个国家城市最多

df2 = df1["State"].value_counts()
df2.plot(kind="bar",figsize=(14,4),rot=30)

在这里插入图片描述

3.2 查看哪个国家人口最多

df3 = pd.pivot_table(df1,index="State",values="Population",aggfunc="sum").sort_values(by="Population",ascending=False)
df3
Population
State
United Kingdom16011877
Germany15119548
Spain10041639
Italy8764067
Poland6267409
France4395271
Romania2527280
Netherlands2271771
Bulgaria1968755
Austria1794770
Sweden1747095
Hungary1744665
Belgium1686441
Czech Republic1625107
Greece986286
Denmark886116
Lithuania844040
Croatia790017
Latvia701185
Finland605523
Portugal547631
Ireland527612
Estonia434810
Slovakia Slovak Republic417389
#用pandas自带画图
df3.plot(kind="bar",figsize=(16,4),rot=30)

在这里插入图片描述

#用matplotlib画图,其中因变量是文本型,可以用名义变量画图,再将文本作为标签传入
label = list(df3.index)
plt.figure(figsize=(16,4))
plt.plot(np.arange(len(df3)),df3.Population.values)
plt.xticks(ticks=np.arange(len(df3)),labels=label,rotation=30)
plt.show()

在这里插入图片描述

#将两个图画在一起
fg, ax = plt.subplots(1,2,figsize=(18,4))
plt.subplots_adjust(wspace=1)

ax[0].bar(df2.index,df2.values)
ax[0].set_xlabel("不同国家城市数量")

ax[1].bar(np.arange(len(df3)),df3.Population.values)
ax[1].set_xlabel("不同国家人口数量")
ax[1].set_xticks(ticks=np.arange(len(df3)))
plt.tight_layout()
plt.show()

在这里插入图片描述

3.3 通过分类字段对数值字段进行透视

df1.reset_index(inplace=True)
df1 = df1.set_index(["State",'City']).sort_index(level=1)
df4 = pd.pivot_table(df1,index=["State","City"],values="Population",aggfunc="sum")
df4
Population
StateCity
AustriaVienna1794770
BelgiumAntwerp510610
Brussels[17]1175831
BulgariaPlovdiv341041
Sofia1291895
.........
United KingdomManchester510772
Nottingham308735
Sheffield557382
Wakefield327627
Wirral320229

105 rows × 1 columns


分享:

低价透明

统一报价,无隐形消费

金牌服务

一对一专属顾问7*24小时金牌服务

信息保密

个人信息安全有保障

售后无忧

服务出问题客服经理全程跟进