Python Forum
Several xml files to dataframe
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Several xml files to dataframe
#1
I have several xml files that I want to transform into a dataframe. Each xml file should be in one row. Here is an example of a xml file:
<?xml version='1.0' encoding='UTF-8'?> <compteRendu xmlns="http://schemas.assemblee-nationale.fr/referentiel"> <uid>CRSANR5L15S2017E1N001</uid> <metadonnees> <day>04 june 2017</day> </metadonnees> <contenu> <point nivpoint="1" valeur_ptsodj="2" ordinal_prise="1" id_preparation="819547" ordre_absolu_seance="8" code_grammaire="TITRE_TEXTE_DISCUSSION" code_style="Titre" code_parole="" sommaire="1" id_syceron="981344" valeur=""> <orateurs/> <texte>Déclaration de...</texte> <paragraphe valeur_ptsodj="2" ordinal_prise="1" id_preparation="819550" ordre_absolu_seance="11" id_acteur="PA345619" id_mandat="-1" id_nomination_oe="PM725692" id_nomination_op="-1" code_grammaire="DEBAT_1_10" code_style="NORMAL" code_parole="PAROLE_1_2" sommaire="1" id_syceron="981347" valeur=""> <orateurs> <orateur> <name>M. Edouard Philippe</name> </orateur> </orateurs> <texte>Monsieur le président...</texte> </paragraphe> </point> </contenu> </compteRendu>
Here is my code:
import xml.etree.ElementTree as ET import pandas as pd path = "whereIhavexmlfilessaved" # create a dict with first childs as key and descendants as values d = {'metadonnees':['day'], 'contenu':['nom','texte']} # initialize two lists: `cols` and `data` cols, data = list(), list() df=pd.DataFrame() for filename in os.listdir(path): if filename.endswith('.xml'): tree = ET.parse(path+"/"+filename) root = tree.getroot() # loop through d.item for k, v in d.items(): # find child child = root.find(f'{{*}}{k}') # use iter to check each descendant (`elem`) for elem in child.iter(): # get `tag_end` for each descendant, # e.g. `texte` in "{http://schemas.assemblee-nationale.fr/referentiel}texte" tag_end = elem.tag.split('}')[-1] # check if `tag_end` in `v(alue)` if tag_end in v: # add `tag_end` and `elem.text` to appropriate list cols.append(tag_end) data.append(elem.text) dt = pd.DataFrame(data) # helper function to "increment" duplicate col names def f(lst): d = {} out = [] for i in lst: if i not in d: out.append(i) d[i] = 2 else: out.append(i+str(d[i])) d[i] += 1 return out dt.columns = f(cols) df.append(dt)
My code only returns an empty dataframe. The original xml files are much longer. I should only have one column "day", but several columns "name" and "text". Not all xml files are exactly the same. For some xml files the columns are: day, text, name1, text1,...; for others are: day, text, text1, name2, text2,... Here is an example of the dataframe that I want to obtain:
day text name1 text1 name2 text2 04 june 2017 Déclaration de... Edouard Philippe Monsieur le président John python cool 05 june 2017 Hello world NaN World now Mary USA country ...
Could anyone help me improve my code?
Reply
#2
see https://pandas.pydata.org/docs/dev/refer...d_xml.html
Reply
#3
Thank you for your suggestion, but I already tried pd.read_xml(xml), I just obtain 3 columns: 'uid', 'day' and 'point'.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Extract parts of multiple log-files and put it in a dataframe hasiro 4 4,578 Apr-27-2022, 12:44 PM
Last Post: hasiro
  Concatenate two files with different columns into one dataframe moralear27 1 3,109 Sep-11-2020, 10:18 PM
Last Post: moralear27

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020
This forum uses Lukasz Tkacz MyBB addons.
Forum use Krzysztof "Supryk" Supryczynski addons.