Skip to content

ENH: read_html to handle rowspan, colspan #17054

@jowens

Description

@jowens

Code Sample, a copy-pastable example if possible

import pandas as pd pd.read_html('https://www.ssa.gov/policy/docs/statcomps/supplement/2015/5h.html')[0]

This has complex table headings:

annual_statistical_supplement__2015_-_beneficiary_families_with_oasdi_benefits_in_current-payment_status__5_h_

read_html output begins with:

 Year Retired-worker families Survivor families Disabled-worker families Unnamed: 4_level_0 Unnamed: 5_level_0 Unnamed: 6_level_0 Unnamed: 7_level_0 Unnamed: 8_level_0 Unnamed: 9_level_0 Unnamed: 10_level_0 Unnamed: 11_level_0 \ Worker only Worker and wife?a Non-disabled widow only Widowed mother or father and? Worker only Worker, wife,?b and? Worker and spouse Unnamed: 7_level_1 Unnamed: 8_level_1 Unnamed: 9_level_1 Unnamed: 10_level_1 Unnamed: 11_level_1 All Men Women 1?child 2?children 3 or more children All Men Women 1?child 2 or more children Unnamed: 11_level_2 0 NaN Number?(thousands) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 1945 416 338 78 181 95 86.00 48.00 24.00 .?.?. .?.?. .?.?. 

(row 0 of the output is probably something one would have to manually eliminate)

Problem description

For HTML headings with rowspan and colspan elements, read_html has undesirable behavior. Basically read_html packs all heading <th> elements in any particular row to the left, so any particular column no longer has any association with the <th> elements that are actually above it in the HTML table.

Ample discussion here about the analogous pandas+Excel test case: #4679

Relevant web discussions:

This may be an issue with the underlying parsers and cannot be solved well in pandas. This appears to be the behavior with both lxml and bs4/html5lib.

Expected Output

Each column should be associated with the <th> elements above it in the table. This might be a multi-row column name (as it is now) (a MultiIndex?) or a tuple (presumably if the argument tupleize_cols is set to True). Instead, currently, column n is associated with the n th <th> entry in the table row regardless of the settings of rowspan/colspan.

It may be this is possible to do properly in current pandas in which case I apologize for filing the issue (but I'd be happy to know how to do it).

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.US-ASCII LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.0
Cython: 0.26
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 5.3.0
sphinx: 1.6.3
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.7.3
bs4: 4.5.3
html5lib: 1.0b10
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIO HTMLread_html, to_html, Styler.apply, Styler.applymap

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions