Python多進程導入CSV至資料庫-Python教學-PHP中文網

首頁

後端開發

Python教學

Python多進程導入CSV至資料庫

Y2J

May 06, 2017 pm 02:54 PM

csv mysql python 多行程

本文要跟大家分享的是使用python實作多進程導入CSV檔案資料到MySQL的思路方法以及具體的程式碼分享，有相同需求的小夥伴可以參考下

前段時間幫同事處理了一個把CSV 資料匯入到MySQL 的需求。兩個很大的 CSV 文件，分別有 3GB、2100 萬筆記錄和 7GB、3500 萬筆記錄。對於這個量級的數據，用簡單的單進程/單線程導入會耗時很久，最終用了多進程的方式來實現。具體過程不贅述，記錄幾個要點：

批量插入而不是逐條插入
為了加快插入速度，先不要建造索引
生產者與消費者模型，主行程讀文件，多個worker 行程執行插入
#注意控制worker 的數量，避免對MySQL 造成太大的壓力
注意處理髒資料導致的例外狀況
##原始資料是GBK 編碼，所以也要注意轉換成UTF-8
用click 封裝指令列工具

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import codecs
import csv
import logging
import multiprocessing
import os
import warnings

import click
import MySQLdb
import sqlalchemy

warnings.filterwarnings(&#39;ignore&#39;, category=MySQLdb.Warning)

# 批量插入的记录数量
BATCH = 5000

DB_URI = &#39;mysql://root@localhost:3306/example?charset=utf8&#39;

engine = sqlalchemy.create_engine(DB_URI)


def get_table_cols(table):
  sql = &#39;SELECT * FROM `{table}` LIMIT 0&#39;.format(table=table)
  res = engine.execute(sql)
  return res.keys()


def insert_many(table, cols, rows, cursor):
  sql = &#39;INSERT INTO `{table}` ({cols}) VALUES ({marks})&#39;.format(
      table=table,
      cols=&#39;, &#39;.join(cols),
      marks=&#39;, &#39;.join([&#39;%s&#39;] * len(cols)))
  cursor.execute(sql, *rows)
  logging.info(&#39;process %s inserted %s rows into table %s&#39;, os.getpid(), len(rows), table)


def insert_worker(table, cols, queue):
  rows = []
  # 每个子进程创建自己的 engine 对象
  cursor = sqlalchemy.create_engine(DB_URI)
  while True:
    row = queue.get()
    if row is None:
      if rows:
        insert_many(table, cols, rows, cursor)
      break

    rows.append(row)
    if len(rows) == BATCH:
      insert_many(table, cols, rows, cursor)
      rows = []


def insert_parallel(table, reader, w=10):
  cols = get_table_cols(table)

  # 数据队列，主进程读文件并往里写数据，worker 进程从队列读数据
  # 注意一下控制队列的大小，避免消费太慢导致堆积太多数据，占用过多内存
  queue = multiprocessing.Queue(maxsize=w*BATCH*2)
  workers = []
  for i in range(w):
    p = multiprocessing.Process(target=insert_worker, args=(table, cols, queue))
    p.start()
    workers.append(p)
    logging.info(&#39;starting # %s worker process, pid: %s...&#39;, i + 1, p.pid)

  dirty_data_file = &#39;./{}_dirty_rows.csv&#39;.format(table)
  xf = open(dirty_data_file, &#39;w&#39;)
  writer = csv.writer(xf, delimiter=reader.dialect.delimiter)

  for line in reader:
    # 记录并跳过脏数据: 键值数量不一致
    if len(line) != len(cols):
      writer.writerow(line)
      continue

    # 把 None 值替换为 &#39;NULL&#39;
    clean_line = [None if x == &#39;NULL&#39; else x for x in line]

    # 往队列里写数据
    queue.put(tuple(clean_line))
    if reader.line_num % 500000 == 0:
      logging.info(&#39;put %s tasks into queue.&#39;, reader.line_num)

  xf.close()

  # 给每个 worker 发送任务结束的信号
  logging.info(&#39;send close signal to worker processes&#39;)
  for i in range(w):
    queue.put(None)

  for p in workers:
    p.join()


def convert_file_to_utf8(f, rv_file=None):
  if not rv_file:
    name, ext = os.path.splitext(f)
    if isinstance(name, unicode):
      name = name.encode(&#39;utf8&#39;)
    rv_file = &#39;{}_utf8{}&#39;.format(name, ext)
  logging.info(&#39;start to process file %s&#39;, f)
  with open(f) as infd:
    with open(rv_file, &#39;w&#39;) as outfd:
      lines = []
      loop = 0
      chunck = 200000
      first_line = infd.readline().strip(codecs.BOM_UTF8).strip() + &#39;\n&#39;
      lines.append(first_line)
      for line in infd:
        clean_line = line.decode(&#39;gb18030&#39;).encode(&#39;utf8&#39;)
        clean_line = clean_line.rstrip() + &#39;\n&#39;
        lines.append(clean_line)
        if len(lines) == chunck:
          outfd.writelines(lines)
          lines = []
          loop += 1
          logging.info(&#39;processed %s lines.&#39;, loop * chunck)

      outfd.writelines(lines)
      logging.info(&#39;processed %s lines.&#39;, loop * chunck + len(lines))


@click.group()
def cli():
  logging.basicConfig(level=logging.INFO,
            format=&#39;%(asctime)s - %(levelname)s - %(name)s - %(message)s&#39;)


@cli.command(&#39;gbk_to_utf8&#39;)
@click.argument(&#39;f&#39;)
def convert_gbk_to_utf8(f):
  convert_file_to_utf8(f)


@cli.command(&#39;load&#39;)
@click.option(&#39;-t&#39;, &#39;--table&#39;, required=True, help=&#39;表名&#39;)
@click.option(&#39;-i&#39;, &#39;--filename&#39;, required=True, help=&#39;输入文件&#39;)
@click.option(&#39;-w&#39;, &#39;--workers&#39;, default=10, help=&#39;worker 数量，默认 10&#39;)
def load_fac_day_pro_nos_sal_table(table, filename, workers):
  with open(filename) as fd:
    fd.readline()  # skip header
    reader = csv.reader(fd)
    insert_parallel(table, reader, w=workers)


if name == &#39;main&#39;:
  cli()

登入後複製

【相關推薦】

#1.

Python免費影片教學

Python學習手冊

##3.

極客學院Python影片教學#

以上是Python多進程導入CSV至資料庫的詳細內容。更多資訊請關注PHP中文網其他相關文章！

本網站聲明

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

熱AI工具

熱工具

熱門話題

gmail信箱登陸入口在哪裡

7831

Java教學

1648

CakePHP 教程

1402

Laravel 教程

1300

PHP教程

1239

Related knowledge

laravel入門實例 Apr 18, 2025 pm 12:45 PM

Laravel 是一款 PHP 框架，用於輕鬆構建 Web 應用程序。它提供一系列強大的功能，包括：安裝：使用 Composer 全局安裝 Laravel CLI，並在項目目錄中創建應用程序。路由：在 routes/web.php 中定義 URL 和處理函數之間的關係。視圖：在 resources/views 中創建視圖以呈現應用程序的界面。數據庫集成：提供與 MySQL 等數據庫的開箱即用集成，並使用遷移來創建和修改表。模型和控制器：模型表示數據庫實體，控制器處理 HTTP 請求。

laravel框架安裝方法 Apr 18, 2025 pm 12:54 PM

文章摘要：本文提供了詳細分步說明，指導讀者如何輕鬆安裝 Laravel 框架。 Laravel 是一個功能強大的 PHP 框架，它 упростил 和加快了 web 應用程序的開發過程。本教程涵蓋了從系統要求到配置數據庫和設置路由等各個方面的安裝過程。通過遵循這些步驟，讀者可以快速高效地為他們的 Laravel 項目打下堅實的基礎。