Complex xml structure #7
-
Hi there! I started using this library because I need to process big xml files, between a few megabytes up to 50GB.
This is more or less the structure where every child has 1 to N relations so for example one FREGESIA can have multiples LOCALIDADE. The thing I need to achieve is process every CPE as an item and if they are in the last child like cpe 9999, I could get the parent names for his "distrito","concelho","freguesia","localidade" but for example, the first cpe "123" is located inside distrito but the info for concelho/freguesia/localidade is not available so it's located there and only has distrito info so the rest would be null. I'm not sure of where to start first and which element to handle first. Thanks in advance and I hope I was clear enough for you to understand this. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Hello, that's an interesting use case! For a possible solution, let me assume that you wants your results to be returned as a dataclass instance. Let's use the following code to define the dataclass: from dataclasses import dataclass
from typing import Optional
@dataclass
class Cpe:
value: str
distrito: Optional[str] = None
concelho: Optional[str] = None
fregesia: Optional[str] = None
localidade: Optional[str] = None
Now what I will do is to create a generic When a And when the “nome” nodes ( The code looks like this: from bigxml import xml_handle_element
@xml_handle_element("CPE")
def handle_cpe(node):
yield Cpe(node.text)
@xml_handle_element("DISTRITO")
@xml_handle_element("CONCELHO")
@xml_handle_element("FREGESIA")
@xml_handle_element("LOCALIDADE")
def handler(node):
nome = None
for item in node.iter_from(
# the following is a shortcut to say that we want item to be a node of that name
# https://bigxml.rogdham.net/handlers/#syntactic-sugar
"FREGUESIA_NOME" if node.name == "FREGESIA" else f"{node.name}_NOME",
# if we see a CPE tag, item will be a Cpe instance
handle_cpe,
# recursive call in which we only will get Cpe instance for item
handler,
):
if isinstance(item, Cpe):
# Cpe instance, coming from whatever level of recursion
setattr(item, node.name.lower(), nome)
yield item
else:
# "nome" node
nome = item.text And finaly, let's parse the XML, for example like so: from bigxml import Parser
with open("filename.xml", "rb") as f:
for cpe in Parser(f).iter_from(handler):
print(cpe) It outputs the following:
Of course you are then free to do whatever you want with the Alternatively, we could go with a more descriptive approach using classes, like the following (click below to see the code). There is quite some duplicate pieces of code but sometimes explicit is more easy to understand than clever. Click here to show alternative code
from bigxml import xml_handle_element, Parser
from dataclasses import dataclass
from typing import Optional
@dataclass
class Cpe:
value: str
distrito: Optional[str] = None
concelho: Optional[str] = None
fregesia: Optional[str] = None
localidade: Optional[str] = None
@xml_handle_element("LOCALIDADE")
class Localidade:
nome = None
@xml_handle_element("LOCALIDADE_NOME")
def handle_nome(self, node):
self.nome = node.text
@xml_handle_element("CPE")
def handle_cpe(self, node):
yield Cpe(node.text)
def xml_handler(self, iterator):
for cpe in iterator:
cpe.localidade = self.nome
yield cpe
@xml_handle_element("FREGESIA")
class Fregesia:
nome = None
@xml_handle_element("FREGUESIA_NOME")
def handle_nome(self, node):
self.nome = node.text
@xml_handle_element("CPE")
def handle_cpe(self, node):
yield Cpe(node.text)
handle_localidade = Localidade
def xml_handler(self, iterator):
for cpe in iterator:
cpe.fregesia = self.nome
yield cpe
@xml_handle_element("CONCELHO")
class Concelho:
nome = None
@xml_handle_element("CONCELHO_NOME")
def handle_nome(self, node):
self.nome = node.text
@xml_handle_element("CPE")
def handle_cpe(self, node):
yield Cpe(node.text)
handle_fregesia = Fregesia
def xml_handler(self, iterator):
for cpe in iterator:
cpe.concelho = self.nome
yield cpe
@xml_handle_element("DISTRITO")
class Distrito:
nome = None
@xml_handle_element("DISTRITO_NOME")
def handle_nome(self, node):
self.nome = node.text
@xml_handle_element("CPE")
def handle_cpe(self, node):
yield Cpe(node.text)
handle_concelho = Concelho
def xml_handler(self, iterator):
for cpe in iterator:
cpe.distrito = self.nome
yield cpe
with open("filename.xml", "rb") as f:
for cpe in Parser(f).iter_from(handler):
print(cpe) I hope those examples give you the right pointers to move forward, tell me what you think! |
Beta Was this translation helpful? Give feedback.
Hello, that's an interesting use case!
For a possible solution, let me assume that you wants your results to be returned as a dataclass instance. Let's use the following code to define the dataclass:
Now what I will do is to create a generic
handler
that will handle nodes of nameDIST…