When trying to get a hierarchical tree of all xpaths in a website (https://startpagina.nl) using Python, I first tried to get the xpaths of branches using: /html/body
:
from selenium import webdriver url = 'https://startpagina.nl' driver = webdriver.Firefox() driver.get(url) test = driver.find_elements_by_xpath('//*') print(len(test)) driver.close()
Based on @Prophet's answer, this will generate a list of all elements in the website. However, I haven't figured out how to get the xpath of these elements, nor how to sort them into a tree structure.
And the /html/body/div[6]
option generates a tree of length 1 instead.
Based on @Micheal Kay's answer, I tried "traversing xml" using the following Python code:
import requests from bs4 import BeautifulSoup import xml.etree.cElementTree as ET from lxml import etree unformatted_filename = "first.xml" formatted_filename = "first.xml" # Get XML from url. resp = requests.get("https://startpagina.nl") # resp = requests.get('https://stackoverflow.com') with open(unformatted_filename, "wb") as foutput: foutput.write(resp.content) # Improve XML formatting with open(unformatted_filename) as fp: soup = BeautifulSoup(fp, "xml") print(f"soup={soup}") with open(formatted_filename, "w") as f: f.write(soup.prettify()) # Parse XML tree = ET.parse(formatted_filename, parser=ET.XMLParser(encoding="utf-8")) root = tree.getroot() for child in root: child.tag, child.attrib tree = ET.parse(formatted_filename) for elem in tree.getiterator(): if elem.tag: print("my name:") print("\t" + elem.tag) if elem.text: print("my text:") print("\t" + (elem.text).strip()) if elem.attrib.items(): print("my attributes:") for key, value in elem.attrib.items(): print("\t" + "\t" + key + " : " + value) if list(elem): # use elem.getchildren() for python2.6 or before print("my no of child: %d" % len(list(elem))) else: print("No child") if elem.tail: print("my tail:") print("\t" + "%s" % elem.tail.strip()) print("$$$$$$$$$$")
However, I haven't figured out how to get the xpath of the individual elements.
So I want to ask:
How to use Python to get the tree of all xpaths in the website? (I wonder if the tree is cyclic, although I hope I will know once I figure out how to obtain the tree.).
Based on manually browsing HTML: I want the output to look like this:
| /html |-- //*[@id="browser-upgrade-notification"] |-- //*[@id="app"] |-- /html/head |-- /html/body |--/-- /html/body/noscript |--/-- /html/body/div[2] |--/-- /html/body/header/section |--/--/-- /html/body/header/section/div |--/--/--/-- /html/body/header/section/div/div[1] ....
This will be an example of a tree list.
The total number of XPaths that select one or more elements is unlimited (for example, it will include paths like
/a/b/../b/../b/../b
) , but if you restrict yourself to paths of the form/a[i]/b[j]/c[k]
, then the number of paths equals the number of elements, and the "tree" of XPaths is the same as the original XML Tree isomorphism.If you want different paths without numeric predicates, such as
/a/b/c
,/a/b/d
, then the easiest way is probably to iterate over the XML document , get the path of each element (in this form) and eliminate duplicates. If you want a tree structure instead of a simple list of paths, use nested maps/dictionaries to build it.The reason it complains about
/html/body/
is that a legal XPath expression cannot contain the trailing/
.