Verwenden Sie Python, um Artikel von der Prosa-Website zu crawlen-Python-Tutorial-php.cn

image.png

Python 2.7 konfigurieren

<code>    bs4

    requests</code>

Nach dem Login kopieren

Installation mit pip sudo pip install bs4

Sudo Pip-Installationsanfragen

Erklären Sie kurz die Verwendung von bs4, da es Webseiten crawlt, daher werde ich find und find_all vorstellen

Der Unterschied zwischen find und find_all besteht darin, dass das, was zurückgegeben wird, unterschiedlich ist. Das erste übereinstimmende Tag und der Inhalt im Tag sind unterschiedlich

find_all gibt eine Liste zurück

Zum Beispiel schreiben wir eine test.html, um den Unterschied zwischen find und find_all zu testen. Der Inhalt ist:

<html>
<head>
</head>
<body>
<div id="one"><a></a></div>
<div id="two"><a href="#">abc</a></div>
<div id="three"><a href="#">three a</a><a href="#">three a</a><a href="#">three a</a></div>
<div id="four"><a href="#">four<p>four p</p><p>four p</p><p>four p</p> a</a></div>
</body>
</html>

Nach dem Login kopieren

<code class="xml"><span class="hljs-tag"> </span></code>

Nach dem Login kopieren

Dann lautet der Code von test.py:

from bs4 import BeautifulSoup
import lxml

if __name__=='__main__':
  s = BeautifulSoup(open('test.html'),'lxml')
  print s.prettify()
  print "------------------------------"
  print s.find('div')
  print s.find_all('div')
  print "------------------------------"
  print s.find('div',id='one')
  print s.find_all('div',id='one')
  print "------------------------------"
  print s.find('div',id="two")
  print s.find_all('div',id="two")
  print "------------------------------"
  print s.find('div',id="three")
  print s.find_all('div',id="three")
  print "------------------------------"
  print s.find('div',id="four")
  print s.find_all('div',id="four")
  print "------------------------------"

Nach dem Login kopieren

<code class="python"><span class="hljs-keyword"> </span></code>

Nach dem Login kopieren

Nach dem Ausführen können wir das Ergebnis sehen. Beim Abrufen des angegebenen Tags gibt es keinen großen Unterschied zwischen den beiden. Beim Abrufen einer Gruppe von Tags wird der Unterschied zwischen den beiden angezeigt

image.png

Wir müssen also bei der Verwendung darauf achten, was wir brauchen, sonst tritt ein Fehler auf.

Der nächste Schritt besteht darin, Webseiteninformationen durch Anfragen zu erhalten. Ich verstehe nicht ganz, warum andere Dinge schreiben

Ich greife direkt auf die Webseite zu, rufe über die get-Methode die Webseiten der zweiten Ebene mehrerer Kategorien auf prose.com ab und übergebe dann einen Gruppentest, um alle Webseiten zu crawlen

<span style="color: #0000ff">def</span><span style="color: #000000"> get_html():
  url </span>= <span style="color: #800000">"</span><span style="color: #800000"></span><span style="color: #800000">"</span><span style="color: #000000">
  two_html </span>= [<span style="color: #800000">'</span><span style="color: #800000">sanwen</span><span style="color: #800000">'</span>,<span style="color: #800000">'</span><span style="color: #800000">shige</span><span style="color: #800000">'</span>,<span style="color: #800000">'</span><span style="color: #800000">zawen</span><span style="color: #800000">'</span>,<span style="color: #800000">'</span><span style="color: #800000">suibi</span><span style="color: #800000">'</span>,<span style="color: #800000">'</span><span style="color: #800000">rizhi</span><span style="color: #800000">'</span>,<span style="color: #800000">'</span><span style="color: #800000">novel</span><span style="color: #800000">'</span><span style="color: #000000">]
  </span><span style="color: #0000ff">for</span> doc <span style="color: #0000ff">in</span><span style="color: #000000"> two_html:
      i</span>=1
          <span style="color: #0000ff">if</span> doc==<span style="color: #800000">'</span><span style="color: #800000">sanwen</span><span style="color: #800000">'</span><span style="color: #000000">:
            </span><span style="color: #0000ff">print</span> <span style="color: #800000">"</span><span style="color: #800000">running sanwen -----------------------------</span><span style="color: #800000">"</span>
          <span style="color: #0000ff">if</span> doc==<span style="color: #800000">'</span><span style="color: #800000">shige</span><span style="color: #800000">'</span><span style="color: #000000">:
            </span><span style="color: #0000ff">print</span> <span style="color: #800000">"</span><span style="color: #800000">running shige ------------------------------</span><span style="color: #800000">"</span>
          <span style="color: #0000ff">if</span> doc==<span style="color: #800000">'</span><span style="color: #800000">zawen</span><span style="color: #800000">'</span><span style="color: #000000">:
            </span><span style="color: #0000ff">print</span> <span style="color: #800000">'</span><span style="color: #800000">running zawen -------------------------------</span><span style="color: #800000">'</span>
          <span style="color: #0000ff">if</span> doc==<span style="color: #800000">'</span><span style="color: #800000">suibi</span><span style="color: #800000">'</span><span style="color: #000000">:
            </span><span style="color: #0000ff">print</span> <span style="color: #800000">'</span><span style="color: #800000">running suibi -------------------------------</span><span style="color: #800000">'</span>
          <span style="color: #0000ff">if</span> doc==<span style="color: #800000">'</span><span style="color: #800000">rizhi</span><span style="color: #800000">'</span><span style="color: #000000">:
            </span><span style="color: #0000ff">print</span> <span style="color: #800000">'</span><span style="color: #800000">running ruzhi -------------------------------</span><span style="color: #800000">'</span>
          <span style="color: #0000ff">if</span> doc==<span style="color: #800000">'</span><span style="color: #800000">nove</span><span style="color: #800000">'</span><span style="color: #000000">:
            </span><span style="color: #0000ff">print</span> <span style="color: #800000">'</span><span style="color: #800000">running xiaoxiaoshuo -------------------------</span><span style="color: #800000">'</span>
      <span style="color: #0000ff">while</span>(i<10<span style="color: #000000">):
        par </span>= {<span style="color: #800000">'</span><span style="color: #800000">p</span><span style="color: #800000">'</span><span style="color: #000000">:i}
        res </span>= requests.get(url+doc+<span style="color: #800000">'</span><span style="color: #800000">/</span><span style="color: #800000">'</span>,params=<span style="color: #000000">par)
        </span><span style="color: #0000ff">if</span> res.status_code==200<span style="color: #000000">:
          soup(res.text)
              i</span>+=i

Nach dem Login kopieren

In diesem Teil des Codes habe ich den res.status_code, der nicht 200 ist, nicht verarbeitet. Das daraus resultierende Problem besteht darin, dass Fehler nicht angezeigt werden und der gecrawlte Inhalt verloren geht. Dann habe ich die Webseite von Sanwen.net analysiert und festgestellt, dass es sich um www.sanwen.net/rizhi/&p=1 handelt.

<code class="python"><span class="hljs-function"><span class="hljs-keyword"> </span></span></code>

Nach dem Login kopieren

Der Maximalwert von p beträgt 10. Ich verstehe nicht. Das letzte Mal, als ich die Festplatte gecrawlt habe. Es waren 100 Seiten. Ich werde es später analysieren. Rufen Sie dann den Inhalt jeder Seite über die get-Methode ab.

Nach Erhalt des Inhalts jeder Seite werden der Autor und der Titel analysiert. Der Code lautet wie folgt

<span style="color: #0000ff">def</span><span style="color: #000000"> soup(html_text):
  s </span>= BeautifulSoup(html_text,<span style="color: #800000">'</span><span style="color: #800000">lxml</span><span style="color: #800000">'</span><span style="color: #000000">)
  link </span>= s.find(<span style="color: #800000">'</span><span style="color: #800000">div</span><span style="color: #800000">'</span>,class_=<span style="color: #800000">'</span><span style="color: #800000">categorylist</span><span style="color: #800000">'</span>).find_all(<span style="color: #800000">'</span><span style="color: #800000">li</span><span style="color: #800000">'</span><span style="color: #000000">)
  </span><span style="color: #0000ff">for</span> i <span style="color: #0000ff">in</span><span style="color: #000000"> link:
    </span><span style="color: #0000ff">if</span> i!=s.find(<span style="color: #800000">'</span><span style="color: #800000">li</span><span style="color: #800000">'</span>,class_=<span style="color: #800000">'</span><span style="color: #800000">page</span><span style="color: #800000">'</span><span style="color: #000000">):
      title </span>= i.find_all(<span style="color: #800000">'</span><span style="color: #800000">a</span><span style="color: #800000">'</span>)[1<span style="color: #000000">]
      author </span>= i.find_all(<span style="color: #800000">'</span><span style="color: #800000">a</span><span style="color: #800000">'</span>)[2<span style="color: #000000">].text
      url </span>= title.attrs[<span style="color: #800000">'</span><span style="color: #800000">href</span><span style="color: #800000">'</span><span style="color: #000000">]
      sign </span>= re.compile(r<span style="color: #800000">'</span><span style="color: #800000">(//)|/</span><span style="color: #800000">'</span><span style="color: #000000">)
      match </span>=<span style="color: #000000"> sign.search(title.text)
      file_name </span>=<span style="color: #000000"> title.text
      </span><span style="color: #0000ff">if</span><span style="color: #000000"> match:
        file_name </span>= sign.sub(<span style="color: #800000">'</span><span style="color: #800000">a</span><span style="color: #800000">'</span>,str(title.text))

Nach dem Login kopieren

Beim Abrufen der Übersetzung ist ein Problem aufgetreten. Leute, warum fügt ihr beim Schreiben der Prosa nicht nur einen, sondern auch zwei Schrägstriche ein? Schreiben Sie also einen regulären Ausdruck und ich werde ihn für Sie ändern.

<code class="python"><span class="hljs-function"><span class="hljs-keyword"> </span></span></code>

Nach dem Login kopieren

Der letzte Schritt besteht darin, den Prosainhalt zu erhalten. Durch die Analyse jeder Seite wird die Artikeladresse ermittelt, und dann wird der Inhalt direkt durch Ändern der Webseite ermittelt Adresse, was Ärger erspart.

<span style="color: #0000ff">def</span><span style="color: #000000"> get_content(url):
  res </span>= requests.get(<span style="color: #800000">'</span><span style="color: #800000"></span><span style="color: #800000">'</span>+<span style="color: #000000">url)
  </span><span style="color: #0000ff">if</span> res.status_code==200<span style="color: #000000">:
    soup </span>= BeautifulSoup(res.text,<span style="color: #800000">'</span><span style="color: #800000">lxml</span><span style="color: #800000">'</span><span style="color: #000000">)
    contents </span>= soup.find(<span style="color: #800000">'</span><span style="color: #800000">div</span><span style="color: #800000">'</span>,class_=<span style="color: #800000">'</span><span style="color: #800000">content</span><span style="color: #800000">'</span>).find_all(<span style="color: #800000">'</span><span style="color: #800000">p</span><span style="color: #800000">'</span><span style="color: #000000">)
    content </span>= <span style="color: #800000">''</span>
    <span style="color: #0000ff">for</span> i <span style="color: #0000ff">in</span><span style="color: #000000"> contents:
      content</span>+=i.text+<span style="color: #800000">'</span><span style="color: #800000">\n</span><span style="color: #800000">'</span>
    <span style="color: #0000ff">return</span> content

Nach dem Login kopieren

Als letztes schreiben Sie die Datei und speichern sie in Ordnung

<code class="python"><span class="hljs-function"><span class="hljs-keyword"> </span></span></code>

Nach dem Login kopieren

   f = open(file_name+<span style="color: #800000">'</span><span style="color: #800000">.txt</span><span style="color: #800000">'</span>,<span style="color: #800000">'</span><span style="color: #800000">w</span><span style="color: #800000">'</span><span style="color: #000000">)

      </span><span style="color: #0000ff">print</span> <span style="color: #800000">'</span><span style="color: #800000">running w txt</span><span style="color: #800000">'</span>+file_name+<span style="color: #800000">'</span><span style="color: #800000">.txt</span><span style="color: #800000">'</span><span style="color: #000000">
      f.write(title.text</span>+<span style="color: #800000">'</span><span style="color: #800000">\n</span><span style="color: #800000">'</span><span style="color: #000000">)
      f.write(author</span>+<span style="color: #800000">'</span><span style="color: #800000">\n</span><span style="color: #800000">'</span><span style="color: #000000">)
      content</span>=<span style="color: #000000">get_content(url)     
      f.write(content)
      f.close()</span>

Nach dem Login kopieren

Die drei Funktionen erhalten die Prosa von Prose.com, aber es gibt ein Problem. Das Problem ist, dass ich nicht weiß, warum einige Prosa verloren gehen, was sich stark von den Artikeln unterscheidet von Prose.com, aber es ist wahr. Ich hoffe, jemand kann mir bei dieser Frage helfen. Vielleicht sollten wir die Webseite unzugänglich machen. Natürlich denke ich, dass es etwas mit dem kaputten Netzwerk in meinem Wohnheim zu tun hat

     f = open(file_name+<span style="color: #800000">'</span><span style="color: #800000">.txt</span><span style="color: #800000">'</span>,<span style="color: #800000">'</span><span style="color: #800000">w</span><span style="color: #800000">'</span><span style="color: #000000">)
      </span><span style="color: #0000ff">print</span> <span style="color: #800000">'</span><span style="color: #800000">running w txt</span><span style="color: #800000">'</span>+file_name+<span style="color: #800000">'</span><span style="color: #800000">.txt</span><span style="color: #800000">'</span><span style="color: #000000">
      f.write(title.text</span>+<span style="color: #800000">'</span><span style="color: #800000">\n</span><span style="color: #800000">'</span><span style="color: #000000">)
      f.write(author</span>+<span style="color: #800000">'</span><span style="color: #800000">\n</span><span style="color: #800000">'</span><span style="color: #000000">)
      content</span>=<span style="color: #000000">get_content(url)     
      f.write(content)
      f.close()</span>

Nach dem Login kopieren

Ich hätte die Renderings fast vergessen

Obwohl der Code chaotisch ist, höre ich nie auf

Das obige ist der detaillierte Inhalt vonVerwenden Sie Python, um Artikel von der Prosa-Website zu crawlen. Für weitere Informationen folgen Sie bitte anderen verwandten Artikeln auf der PHP chinesischen Website!