Skip to content

Conversation

@jreback
Copy link
Contributor

@jreback jreback commented Nov 12, 2014

This is implemented by storing the codes directly in the table. And a metadata VLArray of the categories.
Query and appending work as expected. The only quirk is that I don't allow you to append to a table unless the new data has exactly the same categories. Otherwise the codes become meaningless.

This has the nice property of drastically shrinking the storage cost compared to regular strings (which are stored as fixed width of the maximum for that particular column).

I bumped the actual HDF5 storage version to current (was 0.10.1). Its not strictly necessary as this is a completely optional feature, but I am adding the sub-group space 'meta' (which FYI we can use for other things, e.g. to store the column labels and avoid the 64KB limit in attrs, their is an issue about this somewhere)

In [14]: df = DataFrame({'a' : Series(list('abccdef')).astype('category'), 'b' : np.random.randn(7)}) In [15]: df Out[15]: a b 0 a -0.094609 1 b -1.814638 2 c 0.214974 3 c -0.195395 4 d 0.206022 5 e 1.130589 6 f -0.832810 In [19]: store = pd.HDFStore('test.h5',mode='w') In [20]: store.append('df',df,data_columns=['a']) In [21]: store.select('df',where=["a in ['b','d']"]) Out[21]: a b 1 b -1.814638 4 d 0.206022 In [22]: store.select('df',where=["a in ['b','d']"]).dtypes Out[22]: a category b float64 dtype: object In [25]: store.get_storer('df').group Out[25]: /df (Group) u'' children := ['table' (Table), 'meta' (Group)] In [26]: store.get_storer('df').group.table Out[26]: /df/table (Table(7,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1), "a": Int8Col(shape=(), dflt=0, pos=2)} byteorder := 'little' chunkshape := (3855,) autoindex := True colindexes := { "a": Index(6, medium, shuffle, zlib(1)).is_csi=False, "index": Index(6, medium, shuffle, zlib(1)).is_csi=False} In [27]: store.get_storer('df').group.meta Out[27]: /df/meta (Group) u'' children := ['a' (VLArray)] 
@jankatins jankatins mentioned this pull request Nov 12, 2014
4 tasks
@jreback
Copy link
Contributor Author

jreback commented Nov 12, 2014

@jreback jreback added Categorical Categorical Data Type IO HDF5 read_hdf, HDFStore labels Nov 12, 2014
@jreback jreback added this to the 0.15.2 milestone Nov 12, 2014
@jreback jreback force-pushed the cat_hdf branch 3 times, most recently from 6e25082 to 199de84 Compare November 13, 2014 11:33
@jankatins
Copy link
Contributor

Not sure what to say here: I've no expertise in pytable, sorry... :-/

@bashtage
Copy link
Contributor

Does using VLarray affect performance with compression? Fixed length strings can be compressed while I think VLArray data cannot.

@jreback
Copy link
Contributor Author

jreback commented Nov 14, 2014

@bashtage I actually changed this back to a regular Array, really for more 'visibility', e.g. you can actually inspect these objects, whereas a VLArray objects get pickled. I don't really think their is any actual perf issue. This is just a single array of the categories and compared to the size of a table usually is much much less.

@jreback
Copy link
Contributor Author

jreback commented Nov 14, 2014

this allows future expadiblity because the array can then be 2-d for example
cc @shoyer

In [2]: s = Series(list('aabbcdedfab')).astype('category').to_hdf('test.h5','s',mode='w',format='table') In [3]: !ptdump -avd test.h5 / (RootGroup) '' /._v_attrs (AttributeSet), 4 attributes: [CLASS := 'GROUP', PYTABLES_FORMAT_VERSION := '2.1', TITLE := '', VERSION := '1.0'] /s (Group) '' /s._v_attrs (AttributeSet), 15 attributes: [CLASS := 'GROUP', TITLE := '', VERSION := '1.0', data_columns := ['values'], encoding := None, index_cols := [(0, 'index')], info := {1: {'type': 'Index', 'names': [None]}, 'values': {'ordered': True}, 'index': {}}, levels := 1, metadata := ['values'], nan_rep := 'nan', non_index_axes := [(1, ['values'])], pandas_type := 'series_table', pandas_version := '0.15.2', table_type := 'appendable_series', values_cols := ['values']] /s/table (Table(11,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values": Int8Col(shape=(), dflt=0, pos=1)} byteorder := 'little' chunkshape := (7281,) autoindex := True colindexes := { "index": Index(6, medium, shuffle, zlib(1)).is_csi=False, "values": Index(6, medium, shuffle, zlib(1)).is_csi=False} /s/table._v_attrs (AttributeSet), 11 attributes: [CLASS := 'TABLE', FIELD_0_FILL := 0, FIELD_0_NAME := 'index', FIELD_1_FILL := 0, FIELD_1_NAME := 'values', NROWS := 11, TITLE := '', VERSION := '2.7', index_kind := 'integer', values_dtype := 'category', values_kind := ['values']] Data dump: [0] (0, 0) [1] (1, 0) [2] (2, 1) [3] (3, 1) [4] (4, 2) [5] (5, 3) [6] (6, 4) [7] (7, 3) [8] (8, 5) [9] (9, 0) [10] (10, 1) /s/meta (Group) '' /s/meta._v_attrs (AttributeSet), 3 attributes: [CLASS := 'GROUP', TITLE := '', VERSION := '1.0'] /s/meta/values (Array(6,)) '' atom := StringAtom(itemsize=1, shape=(), dflt='') maindim := 0 flavor := 'numpy' byteorder := 'irrelevant' chunkshape := None /s/meta/values._v_attrs (AttributeSet), 5 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.4', kind := 'string'] Data dump: [0] a [1] b [2] c [3] d [4] e [5] f 
@bashtage
Copy link
Contributor

That change makes sense. And with compression large chunks of whitespace might be less of an issue anyway.

@jreback
Copy link
Contributor Author

jreback commented Nov 14, 2014

for example. Its actually a function of the max_length of the strings stored.

In [27]: df = DataFrame({'A' : np.random.randn(5), 'B' : Series(['a','foo','bar','a really long string','baz'])}) In [28]: df_cat = df.copy() In [29]: df_cat['B'] = df_cat['B'].astype('category') In [30]: pd.concat([df]*10000).to_hdf('test1.h5','df',mode='w',format='table') In [31]: pd.concat([df_cat]*10000).to_hdf('test_cat.h5','df',mode='w',format='table') In [33]: !ls -ltr *.h5 -rw-rw-r-- 1 jreback staff 1876493 Nov 14 10:02 test1.h5 -rw-rw-r-- 1 jreback staff 895756 Nov 14 10:02 test_cat.h5 
@jreback
Copy link
Contributor Author

jreback commented Nov 14, 2014

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store -> stored

@jorisvandenbossche
Copy link
Member

Is this a format change? What will happen if someone wants to read with an older version of pandas an hdf file that is saved with 0.15.2 (or was such a thing never supported?)

@jreback
Copy link
Contributor Author

jreback commented Nov 15, 2014

Ok, I updated to make this more explicit. It is now backwards AND forwards compatible. In that you can read a >0.15.2 written file in a prior version.

You will get the codes in the table (as that is how they are stored).
The categories are now stored as a regular pathed array, so they can also be retrieved.
So it loss-less in a forward way (but requires the user to use them, as the Categorical type did not exist prior to 0.15.0.)

The following in 0.15.2

In [1]: dfc = DataFrame({ 'A' : Series(list('aabbcdba')).astype('category'), ...: 'B' : np.random.randn(8) }) In [2]: store = pd.HDFStore('test.h5', mode='w') In [3]: store.append('df', dfc, format='table', data_columns=['A']) In [4]: result = store.select('df', where="A in ['b','c']") In [5]: result Out[5]: A B 2 b 0.259910 3 b -0.489301 4 c -1.681019 6 b -2.147062 In [6]: result.dtypes Out[6]: A category B float64 dtype: object In [7]: store Out[7]: <class 'pandas.io.pytables.HDFStore'> File path: test.h5 /df frame_table (typ->appendable,nrows->8,ncols->2,indexers->[index],dc->[A]) /df/meta/A/meta series (shape->[1]) In [8]: store.select('df/meta/A/meta') Out[8]: 0 a 1 b 2 c 3 d dtype: object 

and in 0.15.1 reading the same file

In [1]: store = pd.HDFStore('pandas/test.h5') In [2]: store Out[2]: <class 'pandas.io.pytables.HDFStore'> File path: pandas/test.h5 /df frame_table (typ->appendable,nrows->8,ncols->2,indexers->[index],dc->[A]) /df/meta/A/meta series (shape->[1]) In [3]: store.select('df') Out[3]: A B 0 0 -0.906125 1 0 1.324821 2 1 0.259910 3 1 -0.489301 4 2 -1.681019 5 3 0.711411 6 1 -2.147062 7 0 0.797939 In [4]: store.select('df').dtypes Out[4]: A int8 B float64 dtype: object In [5]: store.select('df/meta/A/meta') Out[5]: 0 a 1 b 2 c 3 d dtype: object 
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already did explicity -> explict (but haven't pushed yet)

@jorisvandenbossche
Copy link
Member

You know added some docs to categorical.rst, but maybe also add (or refer to) something in io.rst#pytables ?

@jreback
Copy link
Contributor Author

jreback commented Nov 15, 2014

hmm ok sure

@jreback
Copy link
Contributor Author

jreback commented Nov 15, 2014

ok, fixed up

@jreback
Copy link
Contributor Author

jreback commented Nov 16, 2014

@jorisvandenbossche any further comments?

@jorisvandenbossche
Copy link
Member

nope, no further comments! (but for the actual pytables interaction, I am not familiar with that)

jreback added a commit that referenced this pull request Nov 17, 2014
ENH: serialization of categorical to HDF5 (GH7621)
@jreback jreback merged commit e0680ec into pandas-dev:master Nov 17, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Categorical Categorical Data Type IO HDF5 read_hdf, HDFStore

4 participants