작성일자 : 2023-10-31
Ver 0.1.1
33 ~ 100 문제 풀이 : https://youtu.be/00rctVVSSoA?si=Du3KDMiQeHq7Gp-F
7. Merge, Concat¶
국가별 5세이하 사망비율 통계 : https://www.kaggle.com/utkarshxy/who-worldhealth-statistics-2020-complete
데이터 변형
Dataurl = https://raw.githubusercontent.com/Datamanim/pandas/main/mergeTEst.csv
Q91. df1과 df2 데이터를 하나의 데이터 프레임으로 합쳐라
In [1]:
import pandas as pd
In [15]:
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/mergeTEst.csv', index_col=0)
In [16]:
df1 = df.iloc[:4,:]
df2 = df.iloc[4:,:]
display(df1)
display(df2)
2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|
Location | ||||||||||
Afghanistan | 64.023 | 61.640 | 59.367 | 57.170 | 55.08 | 53.107 | 51.267 | 49.560 | 47.983 | 46.453 |
Albania | 11.803 | 10.807 | 9.943 | 9.267 | 8.79 | 8.493 | 8.363 | 8.363 | 8.453 | 8.597 |
Algeria | 23.540 | 22.907 | 22.450 | 22.117 | 21.85 | 21.587 | 21.257 | 20.850 | 20.407 | 19.930 |
Andorra | 4.240 | 4.033 | 3.843 | 3.667 | 3.49 | 3.330 | 3.187 | 3.060 | 2.933 | 2.827 |
2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|
Location | ||||||||||
Angola | 75.713 | 71.280 | 67.233 | 63.570 | 60.430 | 57.757 | 55.510 | 53.460 | 51.757 | 50.093 |
Antigua and Barbuda | 8.667 | 8.223 | 7.807 | 7.420 | 7.070 | 6.757 | 6.483 | 6.230 | 6.000 | 5.783 |
Argentina | 12.887 | 12.380 | 11.840 | 11.283 | 10.733 | 10.203 | 9.683 | 9.177 | 8.680 | 8.227 |
Armenia | 16.497 | 15.677 | 14.897 | 14.170 | 13.477 | 12.817 | 12.183 | 11.583 | 11.007 | 10.497 |
Australia | 3.993 | 3.803 | 3.623 | 3.467 | 3.343 | 3.253 | 3.183 | 3.137 | 3.090 | 3.047 |
Austria | 3.573 | 3.463 | 3.333 | 3.210 | 3.113 | 3.043 | 2.987 | 2.943 | 2.897 | 2.843 |
In [17]:
pd.concat([df1,df2])
Out[17]:
2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|
Location | ||||||||||
Afghanistan | 64.023 | 61.640 | 59.367 | 57.170 | 55.080 | 53.107 | 51.267 | 49.560 | 47.983 | 46.453 |
Albania | 11.803 | 10.807 | 9.943 | 9.267 | 8.790 | 8.493 | 8.363 | 8.363 | 8.453 | 8.597 |
Algeria | 23.540 | 22.907 | 22.450 | 22.117 | 21.850 | 21.587 | 21.257 | 20.850 | 20.407 | 19.930 |
Andorra | 4.240 | 4.033 | 3.843 | 3.667 | 3.490 | 3.330 | 3.187 | 3.060 | 2.933 | 2.827 |
Angola | 75.713 | 71.280 | 67.233 | 63.570 | 60.430 | 57.757 | 55.510 | 53.460 | 51.757 | 50.093 |
Antigua and Barbuda | 8.667 | 8.223 | 7.807 | 7.420 | 7.070 | 6.757 | 6.483 | 6.230 | 6.000 | 5.783 |
Argentina | 12.887 | 12.380 | 11.840 | 11.283 | 10.733 | 10.203 | 9.683 | 9.177 | 8.680 | 8.227 |
Armenia | 16.497 | 15.677 | 14.897 | 14.170 | 13.477 | 12.817 | 12.183 | 11.583 | 11.007 | 10.497 |
Australia | 3.993 | 3.803 | 3.623 | 3.467 | 3.343 | 3.253 | 3.183 | 3.137 | 3.090 | 3.047 |
Austria | 3.573 | 3.463 | 3.333 | 3.210 | 3.113 | 3.043 | 2.987 | 2.943 | 2.897 | 2.843 |
Q92. df3과 df4 데이터를 하나의 데이터 프레임으로 합쳐라. 둘다 포함하고 있는 년도에 대해서만 고려한다
In [18]:
df3 = df.iloc[:2,:4]
df4 = df.iloc[5:,3:]
display(df3)
display(df4)
2010 | 2011 | 2012 | 2013 | |
---|---|---|---|---|
Location | ||||
Afghanistan | 64.023 | 61.640 | 59.367 | 57.170 |
Albania | 11.803 | 10.807 | 9.943 | 9.267 |
2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|
Location | |||||||
Antigua and Barbuda | 7.420 | 7.070 | 6.757 | 6.483 | 6.230 | 6.000 | 5.783 |
Argentina | 11.283 | 10.733 | 10.203 | 9.683 | 9.177 | 8.680 | 8.227 |
Armenia | 14.170 | 13.477 | 12.817 | 12.183 | 11.583 | 11.007 | 10.497 |
Australia | 3.467 | 3.343 | 3.253 | 3.183 | 3.137 | 3.090 | 3.047 |
Austria | 3.210 | 3.113 | 3.043 | 2.987 | 2.943 | 2.897 | 2.843 |
In [19]:
pd.concat([df3, df4], join = 'inner')
Out[19]:
2013 | |
---|---|
Location | |
Afghanistan | 57.170 |
Albania | 9.267 |
Antigua and Barbuda | 7.420 |
Argentina | 11.283 |
Armenia | 14.170 |
Australia | 3.467 |
Austria | 3.210 |
Q93. df3과 df4 데이터를 하나의 데이터 프레임으로 합쳐라. 모든 컬럼을 포함하고, 결측치는 0으로 대체한다
In [11]:
help(pd.concat)
Help on function concat in module pandas.core.reshape.concat: concat(objs: 'Iterable[NDFrame] | Mapping[HashableT, NDFrame]', *, axis: 'Axis' = 0, join: 'str' = 'outer', ignore_index: 'bool' = False, keys=None, levels=None, names=None, verify_integrity: 'bool' = False, sort: 'bool' = False, copy: 'bool | None' = None) -> 'DataFrame | Series' Concatenate pandas objects along a particular axis. Allows optional set logic along the other axes. Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number. Parameters ---------- objs : a sequence or mapping of Series or DataFrame objects If a mapping is passed, the sorted keys will be used as the `keys` argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised. axis : {0/'index', 1/'columns'}, default 0 The axis to concatenate along. join : {'inner', 'outer'}, default 'outer' How to handle indexes on other axis (or axes). ignore_index : bool, default False If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join. keys : sequence, default None If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level. levels : list of sequences, default None Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys. names : list, default None Names for the levels in the resulting hierarchical index. verify_integrity : bool, default False Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation. sort : bool, default False Sort non-concatenation axis if it is not already aligned. copy : bool, default True If False, do not copy data unnecessarily. Returns ------- object, type of objs When concatenating all ``Series`` along the index (axis=0), a ``Series`` is returned. When ``objs`` contains at least one ``DataFrame``, a ``DataFrame`` is returned. When concatenating along the columns (axis=1), a ``DataFrame`` is returned. See Also -------- DataFrame.join : Join DataFrames using indexes. DataFrame.merge : Merge DataFrames by indexes or columns. Notes ----- The keys, levels, and names arguments are all optional. A walkthrough of how this method fits in with other tools for combining pandas objects can be found `here <https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html>`__. It is not recommended to build DataFrames by adding single rows in a for loop. Build a list of rows and make a DataFrame in a single concat. Examples -------- Combine two ``Series``. >>> s1 = pd.Series(['a', 'b']) >>> s2 = pd.Series(['c', 'd']) >>> pd.concat([s1, s2]) 0 a 1 b 0 c 1 d dtype: object Clear the existing index and reset it in the result by setting the ``ignore_index`` option to ``True``. >>> pd.concat([s1, s2], ignore_index=True) 0 a 1 b 2 c 3 d dtype: object Add a hierarchical index at the outermost level of the data with the ``keys`` option. >>> pd.concat([s1, s2], keys=['s1', 's2']) s1 0 a 1 b s2 0 c 1 d dtype: object Label the index keys you create with the ``names`` option. >>> pd.concat([s1, s2], keys=['s1', 's2'], ... names=['Series name', 'Row ID']) Series name Row ID s1 0 a 1 b s2 0 c 1 d dtype: object Combine two ``DataFrame`` objects with identical columns. >>> df1 = pd.DataFrame([['a', 1], ['b', 2]], ... columns=['letter', 'number']) >>> df1 letter number 0 a 1 1 b 2 >>> df2 = pd.DataFrame([['c', 3], ['d', 4]], ... columns=['letter', 'number']) >>> df2 letter number 0 c 3 1 d 4 >>> pd.concat([df1, df2]) letter number 0 a 1 1 b 2 0 c 3 1 d 4 Combine ``DataFrame`` objects with overlapping columns and return everything. Columns outside the intersection will be filled with ``NaN`` values. >>> df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']], ... columns=['letter', 'number', 'animal']) >>> df3 letter number animal 0 c 3 cat 1 d 4 dog >>> pd.concat([df1, df3], sort=False) letter number animal 0 a 1 NaN 1 b 2 NaN 0 c 3 cat 1 d 4 dog Combine ``DataFrame`` objects with overlapping columns and return only those that are shared by passing ``inner`` to the ``join`` keyword argument. >>> pd.concat([df1, df3], join="inner") letter number 0 a 1 1 b 2 0 c 3 1 d 4 Combine ``DataFrame`` objects horizontally along the x axis by passing in ``axis=1``. >>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']], ... columns=['animal', 'name']) >>> pd.concat([df1, df4], axis=1) letter number animal name 0 a 1 bird polly 1 b 2 monkey george Prevent the result from including duplicate index values with the ``verify_integrity`` option. >>> df5 = pd.DataFrame([1], index=['a']) >>> df5 0 a 1 >>> df6 = pd.DataFrame([2], index=['a']) >>> df6 0 a 2 >>> pd.concat([df5, df6], verify_integrity=True) Traceback (most recent call last): ... ValueError: Indexes have overlapping values: ['a'] Append a single row to the end of a ``DataFrame`` object. >>> df7 = pd.DataFrame({'a': 1, 'b': 2}, index=[0]) >>> df7 a b 0 1 2 >>> new_row = pd.Series({'a': 3, 'b': 4}) >>> new_row a 3 b 4 dtype: int64 >>> pd.concat([df7, new_row.to_frame().T], ignore_index=True) a b 0 1 2 1 3 4
In [20]:
pd.concat([df3, df4], join = 'outer').fillna(0)
Out[20]:
2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|
Location | ||||||||||
Afghanistan | 64.023 | 61.640 | 59.367 | 57.170 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Albania | 11.803 | 10.807 | 9.943 | 9.267 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Antigua and Barbuda | 0.000 | 0.000 | 0.000 | 7.420 | 7.070 | 6.757 | 6.483 | 6.230 | 6.000 | 5.783 |
Argentina | 0.000 | 0.000 | 0.000 | 11.283 | 10.733 | 10.203 | 9.683 | 9.177 | 8.680 | 8.227 |
Armenia | 0.000 | 0.000 | 0.000 | 14.170 | 13.477 | 12.817 | 12.183 | 11.583 | 11.007 | 10.497 |
Australia | 0.000 | 0.000 | 0.000 | 3.467 | 3.343 | 3.253 | 3.183 | 3.137 | 3.090 | 3.047 |
Austria | 0.000 | 0.000 | 0.000 | 3.210 | 3.113 | 3.043 | 2.987 | 2.943 | 2.897 | 2.843 |
Q94. df5과 df6 데이터를 하나의 데이터 프레임으로 merge함수를 이용하여 합쳐라. Algeria컬럼을 key로 하고 두 데이터 모두 포함하는 데이터만 출력하라
In [21]:
df5 = df.T.iloc[:7,:3]
df6 = df.T.iloc[6:,2:5]
display(df5)
display(df6)
Location | Afghanistan | Albania | Algeria |
---|---|---|---|
2010 | 64.023 | 11.803 | 23.540 |
2011 | 61.640 | 10.807 | 22.907 |
2012 | 59.367 | 9.943 | 22.450 |
2013 | 57.170 | 9.267 | 22.117 |
2014 | 55.080 | 8.790 | 21.850 |
2015 | 53.107 | 8.493 | 21.587 |
2016 | 51.267 | 8.363 | 21.257 |
Location | Algeria | Andorra | Angola |
---|---|---|---|
2016 | 21.257 | 3.187 | 55.510 |
2017 | 20.850 | 3.060 | 53.460 |
2018 | 20.407 | 2.933 | 51.757 |
2019 | 19.930 | 2.827 | 50.093 |
In [24]:
pd.merge(df5, df6, on = 'Algeria', how = 'inner')
Out[24]:
Location | Afghanistan | Albania | Algeria | Andorra | Angola |
---|---|---|---|---|---|
0 | 51.267 | 8.363 | 21.257 | 3.187 | 55.51 |
Q95. df5과 df6 데이터를 하나의 데이터 프레임으로 merge함수를 이용하여 합쳐라. Algeria컬럼을 key로 하고 합집합으로 합쳐라
In [25]:
pd.merge(df5, df6, on = 'Algeria', how = 'outer')
Out[25]:
Location | Afghanistan | Albania | Algeria | Andorra | Angola |
---|---|---|---|---|---|
0 | 64.023 | 11.803 | 23.540 | NaN | NaN |
1 | 61.640 | 10.807 | 22.907 | NaN | NaN |
2 | 59.367 | 9.943 | 22.450 | NaN | NaN |
3 | 57.170 | 9.267 | 22.117 | NaN | NaN |
4 | 55.080 | 8.790 | 21.850 | NaN | NaN |
5 | 53.107 | 8.493 | 21.587 | NaN | NaN |
6 | 51.267 | 8.363 | 21.257 | 3.187 | 55.510 |
7 | NaN | NaN | 20.850 | 3.060 | 53.460 |
8 | NaN | NaN | 20.407 | 2.933 | 51.757 |
9 | NaN | NaN | 19.930 | 2.827 | 50.093 |