python - How can I generalize my pandas data grouping to more than 3 dimensions? -


i'm using excellent pandas package deal large amount of varied meteorological diagnostic data , i'm running out of dimensions stitch data together. looking @ documentation, may using multiindex may solve problem, i'm not sure how apply situation - documentation shows examples of creating multiindexes random data , dataframes, not series pre-existing timeseries data.

background

the basic data structure i'm using contains 2 main fields:

  • metadata, dictionary consisting of key-value pairs describing numbers are
  • data, pandas data structure containing numbers themselves.

the lowest common denominator timeseries data, basic structure has pandas series object data entry, , metadata field describes numbers (e.g. vector rms error 10-meter wind on eastern pacific 24-hour forecast experiment test1).

i'm looking @ taking lowest-common-denominator , gluing various timeseries make results more useful , allow easy combinations. instance, may want @ different lead times - have filter routine take timeseries share same metadata entries except lead time (e.g. experiment, region, etc.) , return new object metadata field consists of common entries (i.e. lead time has been removed) , data field pandas dataframe column labels given lead time value. can extend again , want take resulting frames , group them entry varying (e.g. experiment) give me pandas panel. entry item index given experiment metadata values constituent frames , object's new metadata not contain either lead time or experiment.

when iterate on these composite objects, have iterseries routine frame , iterframes routine panel reconstruct appropriate metadata/data pairing drop 1 dimension (i.e. series frame lead time varying across columns have metadata of parent plus lead time field restored value taken column label). works great.

problem

i've run out of dimensions (up 3-d panel) , i'm not able use things dropna remove empty columns once aligned in panel (this has led several bugs when plotting summary statistics). reading using pandas higher-dimensional data has led reading multiindex , use. i've tried examples given in documentation, i'm still little unclear how apply situation. direction useful. i'd able to:

  • combine series-based data multi-indexed dataframe along arbitrary number of dimensions (this great - eliminate 1 call create frames series, , create panels frames)
  • iterate on resulting multi-indexed dataframe, dropping single dimension can reset component metadata.

edit - add code sample

wes mckinney's answer below need - issue in initial translation series-backed storage objects have work dataframe-backed objects once start grouping elements together. data-frame-backed class has following method takes in list of series-based objects , metadata field vary across columns.

@classmethod def from_list(cls, results_list, column_key):     """     populate object list of results share metadata except     field `column_key`.      """     # need 2 copies of input results - 1 building object     # data , 1 building object metadata     for_data, for_metadata = itertools.tee(results_list)      self             = cls()     self.column_key  = column_key     self.metadata    = next(for_metadata).metadata.copy()     if column_key in self.metadata:         del self.metadata[column_key]     self.data = pandas.dataframe(dict(((transform(r[column_key]), r.data)                                         r in for_data)))     return self 

once have frame given routine, can apply various operations suggested below - of particular utility being able use names field when call concat - eliminates need store name of column key internally since it's stored in multiindex name of index dimension.

i'd able implement solution below , take in list of matching series-backed classes , list of keys , grouping sequentially. however, don't know columns representing ahead of time, so:

  • it doesn't make sense me store series data in 1-d dataframe
  • i don't see how set name of index , columns the initial series -> frame grouping

i might suggest using pandas.concat along keys argument glue series dataframes create multiindex in columns:

in [20]: data out[20]:  {'a': 2012-04-16    0 2012-04-17    1 2012-04-18    2 2012-04-19    3 2012-04-20    4 2012-04-21    5 2012-04-22    6 2012-04-23    7 2012-04-24    8 2012-04-25    9 freq: d,  'b': 2012-04-16    0 2012-04-17    1 2012-04-18    2 2012-04-19    3 2012-04-20    4 2012-04-21    5 2012-04-22    6 2012-04-23    7 2012-04-24    8 2012-04-25    9 freq: d,  'c': 2012-04-16    0 2012-04-17    1 2012-04-18    2 2012-04-19    3 2012-04-20    4 2012-04-21    5 2012-04-22    6 2012-04-23    7 2012-04-24    8 2012-04-25    9 freq: d}  in [21]: df = pd.concat(data, axis=1, keys=['a', 'b', 'c'])  in [22]: df out[22]:               b  c 2012-04-16  0  0  0 2012-04-17  1  1  1 2012-04-18  2  2  2 2012-04-19  3  3  3 2012-04-20  4  4  4 2012-04-21  5  5  5 2012-04-22  6  6  6 2012-04-23  7  7  7 2012-04-24  8  8  8 2012-04-25  9  9  9  in [23]: df2 = pd.concat([df, df], axis=1, keys=['group1', 'group2'])  in [24]: df2 out[24]:              group1        group2                         b  c        b  c 2012-04-16       0  0  0       0  0  0 2012-04-17       1  1  1       1  1  1 2012-04-18       2  2  2       2  2  2 2012-04-19       3  3  3       3  3  3 2012-04-20       4  4  4       4  4  4 2012-04-21       5  5  5       5  5  5 2012-04-22       6  6  6       6  6  6 2012-04-23       7  7  7       7  7  7 2012-04-24       8  8  8       8  8  8 2012-04-25       9  9  9       9  9  9 

you have then:

in [25]: df2['group2'] out[25]:               b  c 2012-04-16  0  0  0 2012-04-17  1  1  1 2012-04-18  2  2  2 2012-04-19  3  3  3 2012-04-20  4  4  4 2012-04-21  5  5  5 2012-04-22  6  6  6 2012-04-23  7  7  7 2012-04-24  8  8  8 2012-04-25  9  9  9 

or even

in [27]: df2.xs('b', axis=1, level=1) out[27]:              group1  group2 2012-04-16       0       0 2012-04-17       1       1 2012-04-18       2       2 2012-04-19       3       3 2012-04-20       4       4 2012-04-21       5       5 2012-04-22       6       6 2012-04-23       7       7 2012-04-24       8       8 2012-04-25       9       9 

you can have arbitrarily many levels:

in [29]: pd.concat([df2, df2], axis=1, keys=['tier1', 'tier2']) out[29]:               tier1                       tier2                                 group1        group2        group1        group2                         b  c        b  c        b  c        b  c 2012-04-16       0  0  0       0  0  0       0  0  0       0  0  0 2012-04-17       1  1  1       1  1  1       1  1  1       1  1  1 2012-04-18       2  2  2       2  2  2       2  2  2       2  2  2 2012-04-19       3  3  3       3  3  3       3  3  3       3  3  3 2012-04-20       4  4  4       4  4  4       4  4  4       4  4  4 2012-04-21       5  5  5       5  5  5       5  5  5       5  5  5 2012-04-22       6  6  6       6  6  6       6  6  6       6  6  6 2012-04-23       7  7  7       7  7  7       7  7  7       7  7  7 2012-04-24       8  8  8       8  8  8       8  8  8       8  8  8 2012-04-25       9  9  9       9  9  9       9  9  9       9  9  9 

Comments

Popular posts from this blog

django - How can I change user group without delete record -

java - Need to add SOAP security token -

java - EclipseLink JPA Object is not a known entity type -