Several xml files to dataframe

mfernandes · Sep-20-2022, 03:46 PM

I have several xml files that I want to transform into a dataframe. Each xml file should be in one row. Here is an example of a xml file:

<?xml version='1.0' encoding='UTF-8'?> <compteRendu xmlns="http://schemas.assemblee-nationale.fr/referentiel"> <uid>CRSANR5L15S2017E1N001</uid> <metadonnees> <day>04 june 2017</day> </metadonnees> <contenu> <point nivpoint="1" valeur_ptsodj="2" ordinal_prise="1" id_preparation="819547" ordre_absolu_seance="8" code_grammaire="TITRE_TEXTE_DISCUSSION" code_style="Titre" code_parole="" sommaire="1" id_syceron="981344" valeur=""> <orateurs/> <texte>Déclaration de...</texte> <paragraphe valeur_ptsodj="2" ordinal_prise="1" id_preparation="819550" ordre_absolu_seance="11" id_acteur="PA345619" id_mandat="-1" id_nomination_oe="PM725692" id_nomination_op="-1" code_grammaire="DEBAT_1_10" code_style="NORMAL" code_parole="PAROLE_1_2" sommaire="1" id_syceron="981347" valeur=""> <orateurs> <orateur> <name>M. Edouard Philippe</name> </orateur> </orateurs> <texte>Monsieur le président...</texte> </paragraphe> </point> </contenu> </compteRendu>

Here is my code:

import xml.etree.ElementTree as ET import pandas as pd path = "whereIhavexmlfilessaved" # create a dict with first childs as key and descendants as values d = {'metadonnees':['day'], 'contenu':['nom','texte']} # initialize two lists: `cols` and `data` cols, data = list(), list() df=pd.DataFrame() for filename in os.listdir(path): if filename.endswith('.xml'): tree = ET.parse(path+"/"+filename) root = tree.getroot() # loop through d.item for k, v in d.items(): # find child child = root.find(f'{{*}}{k}') # use iter to check each descendant (`elem`) for elem in child.iter(): # get `tag_end` for each descendant, # e.g. `texte` in "{http://schemas.assemblee-nationale.fr/referentiel}texte" tag_end = elem.tag.split('}')[-1] # check if `tag_end` in `v(alue)` if tag_end in v: # add `tag_end` and `elem.text` to appropriate list cols.append(tag_end) data.append(elem.text) dt = pd.DataFrame(data) # helper function to "increment" duplicate col names def f(lst): d = {} out = [] for i in lst: if i not in d: out.append(i) d[i] = 2 else: out.append(i+str(d[i])) d[i] += 1 return out dt.columns = f(cols) df.append(dt)

My code only returns an empty dataframe. The original xml files are much longer. I should only have one column "day", but several columns "name" and "text". Not all xml files are exactly the same. For some xml files the columns are: day, text, name1, text1,...; for others are: day, text, text1, name2, text2,... Here is an example of the dataframe that I want to obtain:

day text name1 text1 name2 text2 04 june 2017 Déclaration de... Edouard Philippe Monsieur le président John python cool 05 june 2017 Hello world NaN World now Mary USA country ...

Could anyone help me improve my code?

**Larz60+** · Sep-20-2022, 06:25 PM

see https://pandas.pydata.org/docs/dev/refer...d_xml.html

mfernandes · Sep-20-2022, 07:58 PM

Thank you for your suggestion, but I already tried pd.read_xml(xml), I just obtain 3 columns: 'uid', 'day' and 'point'.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Extract parts of multiple log-files and put it in a dataframe	hasiro	4	4,578	Apr-27-2022, 12:44 PM Last Post: hasiro
	Concatenate two files with different columns into one dataframe	moralear27	1	3,109	Sep-11-2020, 10:18 PM Last Post: moralear27

Several xml files to dataframe

User Panel Messages

Announcements