python - How can I generalize my pandas data grouping to more than 3 dimensions? -
i'm using excellent pandas
package deal large amount of varied meteorological diagnostic data , i'm running out of dimensions stitch data together. looking @ documentation, may using multiindex
may solve problem, i'm not sure how apply situation - documentation shows examples of creating multiindexes random data , dataframe
s, not series pre-existing timeseries data.
background
the basic data structure i'm using contains 2 main fields:
metadata
, dictionary consisting of key-value pairs describing numbers aredata
, pandas data structure containing numbers themselves.
the lowest common denominator timeseries data, basic structure has pandas series
object data
entry, , metadata
field describes numbers (e.g. vector rms error 10-meter wind on eastern pacific 24-hour forecast experiment test1).
i'm looking @ taking lowest-common-denominator , gluing various timeseries make results more useful , allow easy combinations. instance, may want @ different lead times - have filter routine take timeseries share same metadata entries except lead time (e.g. experiment, region, etc.) , return new object metadata
field consists of common entries (i.e. lead time
has been removed) , data
field pandas dataframe
column labels given lead time
value. can extend again , want take resulting frames , group them entry varying (e.g. experiment
) give me pandas panel
. entry item index given experiment
metadata values constituent frames , object's new metadata not contain either lead time
or experiment
.
when iterate on these composite objects, have iterseries
routine frame , iterframes
routine panel reconstruct appropriate metadata/data pairing drop 1 dimension (i.e. series frame lead time varying across columns have metadata of parent plus lead time
field restored value taken column label). works great.
problem
i've run out of dimensions (up 3-d panel) , i'm not able use things dropna
remove empty columns once aligned in panel (this has led several bugs when plotting summary statistics). reading using pandas higher-dimensional data has led reading multiindex
, use. i've tried examples given in documentation, i'm still little unclear how apply situation. direction useful. i'd able to:
- combine
series
-based data multi-indexeddataframe
along arbitrary number of dimensions (this great - eliminate 1 call create frames series, , create panels frames) - iterate on resulting multi-indexed
dataframe
, dropping single dimension can reset component metadata.
edit - add code sample
wes mckinney's answer below need - issue in initial translation series-backed storage objects have work dataframe-backed objects once start grouping elements together. data-frame-backed class has following method takes in list of series-based objects , metadata field vary across columns.
@classmethod def from_list(cls, results_list, column_key): """ populate object list of results share metadata except field `column_key`. """ # need 2 copies of input results - 1 building object # data , 1 building object metadata for_data, for_metadata = itertools.tee(results_list) self = cls() self.column_key = column_key self.metadata = next(for_metadata).metadata.copy() if column_key in self.metadata: del self.metadata[column_key] self.data = pandas.dataframe(dict(((transform(r[column_key]), r.data) r in for_data))) return self
once have frame given routine, can apply various operations suggested below - of particular utility being able use names
field when call concat
- eliminates need store name of column key internally since it's stored in multiindex name of index dimension.
i'd able implement solution below , take in list of matching series-backed classes , list of keys , grouping sequentially. however, don't know columns representing ahead of time, so:
- it doesn't make sense me store series data in 1-d dataframe
- i don't see how set name of index , columns the initial series -> frame grouping
i might suggest using pandas.concat
along keys
argument glue series dataframes create multiindex in columns:
in [20]: data out[20]: {'a': 2012-04-16 0 2012-04-17 1 2012-04-18 2 2012-04-19 3 2012-04-20 4 2012-04-21 5 2012-04-22 6 2012-04-23 7 2012-04-24 8 2012-04-25 9 freq: d, 'b': 2012-04-16 0 2012-04-17 1 2012-04-18 2 2012-04-19 3 2012-04-20 4 2012-04-21 5 2012-04-22 6 2012-04-23 7 2012-04-24 8 2012-04-25 9 freq: d, 'c': 2012-04-16 0 2012-04-17 1 2012-04-18 2 2012-04-19 3 2012-04-20 4 2012-04-21 5 2012-04-22 6 2012-04-23 7 2012-04-24 8 2012-04-25 9 freq: d} in [21]: df = pd.concat(data, axis=1, keys=['a', 'b', 'c']) in [22]: df out[22]: b c 2012-04-16 0 0 0 2012-04-17 1 1 1 2012-04-18 2 2 2 2012-04-19 3 3 3 2012-04-20 4 4 4 2012-04-21 5 5 5 2012-04-22 6 6 6 2012-04-23 7 7 7 2012-04-24 8 8 8 2012-04-25 9 9 9 in [23]: df2 = pd.concat([df, df], axis=1, keys=['group1', 'group2']) in [24]: df2 out[24]: group1 group2 b c b c 2012-04-16 0 0 0 0 0 0 2012-04-17 1 1 1 1 1 1 2012-04-18 2 2 2 2 2 2 2012-04-19 3 3 3 3 3 3 2012-04-20 4 4 4 4 4 4 2012-04-21 5 5 5 5 5 5 2012-04-22 6 6 6 6 6 6 2012-04-23 7 7 7 7 7 7 2012-04-24 8 8 8 8 8 8 2012-04-25 9 9 9 9 9 9
you have then:
in [25]: df2['group2'] out[25]: b c 2012-04-16 0 0 0 2012-04-17 1 1 1 2012-04-18 2 2 2 2012-04-19 3 3 3 2012-04-20 4 4 4 2012-04-21 5 5 5 2012-04-22 6 6 6 2012-04-23 7 7 7 2012-04-24 8 8 8 2012-04-25 9 9 9
or even
in [27]: df2.xs('b', axis=1, level=1) out[27]: group1 group2 2012-04-16 0 0 2012-04-17 1 1 2012-04-18 2 2 2012-04-19 3 3 2012-04-20 4 4 2012-04-21 5 5 2012-04-22 6 6 2012-04-23 7 7 2012-04-24 8 8 2012-04-25 9 9
you can have arbitrarily many levels:
in [29]: pd.concat([df2, df2], axis=1, keys=['tier1', 'tier2']) out[29]: tier1 tier2 group1 group2 group1 group2 b c b c b c b c 2012-04-16 0 0 0 0 0 0 0 0 0 0 0 0 2012-04-17 1 1 1 1 1 1 1 1 1 1 1 1 2012-04-18 2 2 2 2 2 2 2 2 2 2 2 2 2012-04-19 3 3 3 3 3 3 3 3 3 3 3 3 2012-04-20 4 4 4 4 4 4 4 4 4 4 4 4 2012-04-21 5 5 5 5 5 5 5 5 5 5 5 5 2012-04-22 6 6 6 6 6 6 6 6 6 6 6 6 2012-04-23 7 7 7 7 7 7 7 7 7 7 7 7 2012-04-24 8 8 8 8 8 8 8 8 8 8 8 8 2012-04-25 9 9 9 9 9 9 9 9 9 9 9 9
Comments
Post a Comment