작성일자 : 2023-10-30
Ver 0.1.1
33 ~ 100 문제 풀이 : https://youtu.be/00rctVVSSoA?si=Du3KDMiQeHq7Gp-F
3. Grouping¶
뉴욕 airBnB : https://www.kaggle.com/ptoscano230382/air-bnb-ny-2019
DataUrl = 'https://raw.githubusercontent.com/Datamanim/pandas/main/AB_NYC_2019.csv'
import pandas as pd
df5 = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/AB_NYC_2019.csv')
df5.head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2539 | Clean & quiet apt home by the park | 2787 | John | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 1 | 9 | 2018-10-19 | 0.21 | 6 | 365 |
1 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 1 | 45 | 2019-05-21 | 0.38 | 2 | 355 |
2 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! | 4632 | Elisabeth | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 3 | 0 | NaN | NaN | 1 | 365 |
3 | 3831 | Cozy Entire Floor of Brownstone | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 1 | 270 | 2019-07-05 | 4.64 | 1 | 194 |
4 | 5022 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Laura | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80 | 10 | 9 | 2018-11-19 | 0.10 | 1 | 0 |
Q45. 데이터의 각 host_name의 빈도수를 구하고 index로 정렬하여 상위 5개를 출력하라
df5['host_name'].value_counts().sort_index().head()
host_name 'Cil 1 (Ari) HENRY LEE 1 (Email hidden by Airbnb) 6 (Mary) Haiy 1 -TheQueensCornerLot 1 Name: count, dtype: int64
Q46. 데이터의 각 host_name의 빈도수를 구하고 빈도수 기준 내림차순 정렬한 데이터 프레임을 만들어라. 빈도수 컬럼은 counts로 명명하라
df5.host_name.value_counts().to_frame().rename(columns = {'count' : 'counts'} )
counts | |
---|---|
host_name | |
Michael | 417 |
David | 403 |
Sonder (NYC) | 327 |
John | 294 |
Alex | 279 |
... | ... |
Rhonycs | 1 |
Brandy-Courtney | 1 |
Shanthony | 1 |
Aurore And Jamila | 1 |
Ilgar & Aysel | 1 |
11452 rows × 1 columns
Q47. neighbourhood_group의 값에 따른 neighbourhood컬럼 값의 갯수를 구하여라
df5[['neighbourhood_group', 'neighbourhood']].groupby(['neighbourhood_group', 'neighbourhood'], as_index = False).size()
neighbourhood_group | neighbourhood | size | |
---|---|---|---|
0 | Bronx | Allerton | 42 |
1 | Bronx | Baychester | 7 |
2 | Bronx | Belmont | 24 |
3 | Bronx | Bronxdale | 19 |
4 | Bronx | Castle Hill | 9 |
... | ... | ... | ... |
216 | Staten Island | Tottenville | 7 |
217 | Staten Island | West Brighton | 18 |
218 | Staten Island | Westerleigh | 2 |
219 | Staten Island | Willowbrook | 1 |
220 | Staten Island | Woodrow | 1 |
221 rows × 3 columns
Q48. neighbourhood_group의 값에 따른 neighbourhood컬럼 값 중 neighbourhood_group그룹의 최댓값들을 출력하라
ww = df5[['neighbourhood_group', 'neighbourhood']].groupby(['neighbourhood_group', 'neighbourhood'], as_index = False).size()
pd.set_option('display.max_rows', 250)
ww.sort_values(['neighbourhood_group','size'],ascending = [True, False]).drop_duplicates('neighbourhood_group', keep = 'first')
neighbourhood_group | neighbourhood | size | |
---|---|---|---|
18 | Bronx | Kingsbridge | 70 |
93 | Brooklyn | Williamsburg | 3920 |
105 | Manhattan | Harlem | 2658 |
128 | Queens | Astoria | 900 |
212 | Staten Island | St. George | 48 |
Q49. neighbourhood_group 값에 따른 price값의 평균, 분산, 최대, 최소 값을 구하여라
df5[['neighbourhood_group', 'price']].groupby('neighbourhood_group').agg(['mean','var','max','min'])
price | ||||
---|---|---|---|---|
mean | var | max | min | |
neighbourhood_group | ||||
Bronx | 87.496792 | 11386.885081 | 2500 | 0 |
Brooklyn | 124.383207 | 34921.719135 | 10000 | 0 |
Manhattan | 196.875814 | 84904.159185 | 10000 | 0 |
Queens | 99.517649 | 27923.130227 | 10000 | 10 |
Staten Island | 114.812332 | 77073.088342 | 5000 | 13 |
Q50. neighbourhood_group 값에 따른 reviews_per_month 평균, 분산, 최대, 최소 값을 구하여라
df5[['neighbourhood_group', 'reviews_per_month']].dropna().groupby('neighbourhood_group').agg(['mean','var','max','min'])
reviews_per_month | ||||
---|---|---|---|---|
mean | var | max | min | |
neighbourhood_group | ||||
Bronx | 1.837831 | 2.799878 | 10.34 | 0.02 |
Brooklyn | 1.283212 | 2.299040 | 14.00 | 0.01 |
Manhattan | 1.272131 | 2.651206 | 58.50 | 0.01 |
Queens | 1.941200 | 4.897848 | 20.94 | 0.01 |
Staten Island | 1.872580 | 2.840895 | 10.12 | 0.02 |
Q51. neighbourhood 값과 neighbourhood_group 값에 따른 price 의 평균을 구하라
print(df5.groupby(['neighbourhood', 'neighbourhood_group'])[['price']].mean().shape[0])
df5.groupby(['neighbourhood', 'neighbourhood_group'])[['price']].mean().head()
221
price | ||
---|---|---|
neighbourhood | neighbourhood_group | |
Allerton | Bronx | 87.595238 |
Arden Heights | Staten Island | 67.250000 |
Arrochar | Staten Island | 115.000000 |
Arverne | Queens | 171.779221 |
Astoria | Queens | 117.187778 |
Q52. neighbourhood 값과 neighbourhood_group 값에 따른 price 의 평균을 계층적 indexing 없이 구하라
df5.groupby(['neighbourhood', 'neighbourhood_group'], as_index = False)[['price']].mean().head()
neighbourhood | neighbourhood_group | price | |
---|---|---|---|
0 | Allerton | Bronx | 87.595238 |
1 | Arden Heights | Staten Island | 67.250000 |
2 | Arrochar | Staten Island | 115.000000 |
3 | Arverne | Queens | 171.779221 |
4 | Astoria | Queens | 117.187778 |
Q54. 데이터중 neighbourhood_group 값이 Queens값을 가지는 데이터들 중 neighbourhood 그룹별로 price값의 평균, 분산, 최대, 최소값을 구하라
df5.loc[df5['neighbourhood_group'].str.contains('Queens')].groupby('neighbourhood')[['price']].agg(['mean','var','max','min']).head()
price | ||||
---|---|---|---|---|
mean | var | max | min | |
neighbourhood | ||||
Arverne | 171.779221 | 37383.411141 | 1500 | 35 |
Astoria | 117.187778 | 122428.811196 | 10000 | 25 |
Bay Terrace | 142.000000 | 6816.400000 | 258 | 32 |
Bayside | 157.948718 | 166106.470985 | 2600 | 30 |
Bayswater | 87.470588 | 2330.889706 | 230 | 45 |
Q55. 데이터중 neighbourhood_group 값에 따른 room_type 컬럼의 숫자를 구하고 neighbourhood_group 값을 기준으로 각 값의 비율을 구하여라
Ans = df5[['neighbourhood_group','room_type']].groupby(['neighbourhood_group','room_type']).size().unstack()
Ans.loc[:,:] = (Ans.values /Ans.sum(axis=1).values.reshape(-1,1))
Ans
room_type | Entire home/apt | Private room | Shared room |
---|---|---|---|
neighbourhood_group | |||
Bronx | 0.347388 | 0.597617 | 0.054995 |
Brooklyn | 0.475478 | 0.503979 | 0.020543 |
Manhattan | 0.609344 | 0.368496 | 0.022160 |
Queens | 0.369926 | 0.595129 | 0.034945 |
Staten Island | 0.471850 | 0.504021 | 0.024129 |