Home > Java > javaTutorial > body text

How to use Python script to generate sitemap.xml

高洛峰
Release: 2017-02-04 11:51:41
Original
1501 people have browsed it

Install lxml

First you need pip install lxml to install the lxml library.

If you encounter the following error on ubuntu:

#include "libxml/xmlversion.h"
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
----------------------------------------
Cleaning up...
 Removing temporary dir /tmp/pip_build_root...
Command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip_build_root/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-O4cIn6-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip_build_root/lxml
Exception information:
Traceback (most recent call last):
 File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 122, in main
  status = self.run(options, args)
 File "/usr/lib/python2.7/dist-packages/pip/commands/install.py", line 283, in run
  requirement_set.install(install_options, global_options, root=options.root_path)
 File "/usr/lib/python2.7/dist-packages/pip/req.py", line 1435, in install
  requirement.install(install_options, global_options, *args, **kwargs)
 File "/usr/lib/python2.7/dist-packages/pip/req.py", line 706, in install
  cwd=self.source_dir, filter_stdout=self._filter_install, show_stdout=False)
 File "/usr/lib/python2.7/dist-packages/pip/util.py", line 697, in call_subprocess
  % (command_desc, proc.returncode, cwd))
InstallationError: Command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip_build_root/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-O4cIn6-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip_build_root/lxml
Copy after login

Please install the following dependencies:

sudo apt-get install libxml2-dev libxslt1-dev
Copy after login

Python code

The following is the code to generate sitemap and sitemapindex indexes. You can pass in the required parameters as required, or add fields:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
 
import io
import re
from lxml import etree
 
 
def generate_xml(filename, url_list):
  """Generate a new xml file use url_list"""
  root = etree.Element('urlset',
             xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")
  for each in url_list:
    url = etree.Element('url')
    loc = etree.Element('loc')
    loc.text = each
    url.append(loc)
    root.append(url)
 
  header = u&#39;<?xml version="1.0" encoding="UTF-8"?>\n&#39;
  s = etree.tostring(root, encoding=&#39;utf-8&#39;, pretty_print=True)
  with io.open(filename, &#39;w&#39;, encoding=&#39;utf-8&#39;) as f:
    f.write(unicode(header+s))
 
 
def update_xml(filename, url_list):
  """Add new url_list to origin xml file."""
  f = open(filename, &#39;r&#39;)
  lines = [i.strip() for i in f.readlines()]
  f.close()
 
  old_url_list = []
  for each_line in lines:
    d = re.findall(&#39;<loc>(http:\/\/.+)<\/loc>&#39;, each_line)
    old_url_list += d
  url_list += old_url_list
 
  generate_xml(filename, url_list)
 
 
def generatr_xml_index(filename, sitemap_list, lastmod_list):
  """Generate sitemap index xml file."""
  root = etree.Element(&#39;sitemapindex&#39;,
             xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")
  for each_sitemap, each_lastmod in zip(sitemap_list, lastmod_list):
    sitemap = etree.Element(&#39;sitemap&#39;)
    loc = etree.Element(&#39;loc&#39;)
    loc.text = each_sitemap
    lastmod = etree.Element(&#39;lastmod&#39;)
    lastmod.text = each_lastmod
    sitemap.append(loc)
    sitemap.append(lastmod)
    root.append(sitemap)
 
  header = u&#39;<?xml version="1.0" encoding="UTF-8"?>\n&#39;
  s = etree.tostring(root, encoding=&#39;utf-8&#39;, pretty_print=True)
  with io.open(filename, &#39;w&#39;, encoding=&#39;utf-8&#39;) as f:
    f.write(unicode(header+s))
 
 
if __name__ == &#39;__main__&#39;:
  urls = [&#39;http://www.baidu.com&#39;] * 10
  mods = [&#39;2004-10-01T18:23:17+00:00&#39;] * 10
  generatr_xml_index(&#39;index.xml&#39;, urls, mods)
Copy after login

Effect

The generated effect should be This format:

sitemap format:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <url>
  <loc>http://www.example.com/foo.html</loc>
 </url>
</urlset>
Copy after login

sitemapindex format:

<?xml version="1.0" encoding="UTF-8"?>
  <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
   <loc>http://www.example.com/sitemap1.xml.gz</loc>
   <lastmod>2004-10-01T18:23:17+00:00</lastmod>
  </sitemap>
  <sitemap>
   <loc>http://www.example.com/sitemap2.xml.gz</loc>
   <lastmod>2005-01-01</lastmod>
  </sitemap>
  </sitemapindex>
Copy after login

lastmod time format problem

The format uses the ISO 8601 standard. If it is a linux/unix system, you can use the following function to obtain

def get_lastmod_time(filename):
  time_stamp = os.path.getmtime(filename)
  t = time.localtime(time_stamp)
  # return time.strftime(&#39;%Y-%m-%dT%H:%M:%S+08:00&#39;, t)
  return time.strftime(&#39;%Y-%m-%dT%H:%M:%SZ&#39;, t)
Copy after login

Optimization

Generally speaking, using lxml is inefficient and takes up a lot of memory. You can create it directly using the write method of the file.

def generate_xml(filename, url_list):
  with gzip.open(filename,"w") as f:
    f.write("""<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n""")
    for i in url_list:
      f.write("""<url><loc>%s</loc></url>\n"""%i)
    f.write("""</urlset>""")
 
 
def append_xml(filename, url_list):
  with gzip.open(filename, &#39;r&#39;) as f:
    for each_line in f:
      d = re.findall(&#39;<loc>(http:\/\/.+)<\/loc>&#39;, each_line)
      url_list.extend(d)
 
    generate_xml(filename, set(url_list))
 
 
def modify_time(filename):
  time_stamp = os.path.getmtime(filename)
  t = time.localtime(time_stamp)
  return time.strftime(&#39;%Y-%m-%dT%H:%M:%S:%SZ&#39;, t)
 
 
def new_xml(filename, url_list):
  generate_xml(filename, url_list)
  root = dirname(filename)
 
  with open(join(dirname(root), "sitemap.xml"),"w") as f:
    f.write(&#39;<?xml version="1.0" encoding="utf-8"?>\n<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n&#39;)
    for i in glob.glob(join(root,"*.xml.gz")):
      lastmod = modify_time(i)
      i = i[len(CONFIG.SITEMAP_PATH):]
      f.write("<sitemap>\n<loc>http:/%s</loc>\n"%i)
      f.write("<lastmod>%s</lastmod>\n</sitemap>\n"%lastmod)
    f.write(&#39;</sitemapindex>&#39;)
Copy after login

Summary

The above is the entire content of this article. I hope the content of this article can bring some benefits to everyone learning or using python. For help, if you have any questions, you can leave a message to communicate. Thank you for your support to the PHP Chinese website.

For more related articles on how to use Python scripts to generate sitemap.xml, please pay attention to the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!