Simple implementation of sorting and deduplication of large files-Linux Operation and Maintenance-php.cn

Home

Operation and Maintenance

Linux Operation and Maintenance

Simple implementation of sorting and deduplication of large files

巴扎黑

Sep 04, 2017 pm 02:28 PM

accomplish Simple

有一道校招生的面试题，是要给一个很大的文件（不能全部放内存，比如1T）按行来排序和去重。

一种简单解决方案就是分而治之，先打大文件分词大小均匀的若干个小文件，然后对小文件排好序，最后再Merge所有的小文件，在Merge的过程中去掉重复的内容。

在Linux下实现这个逻辑甚至不用自己写代码，只要用shell内置的一些命令: split, sort就足够了。我们把这个流程用脚本串起来，写到shell脚本文件里。文件名叫sort_uniq.sh.

#!/bin/bash
lines=$(wc -l $1 | sed &#39;s/ .*//g&#39;)
lines_per_file=`expr $lines / 20`
split -d -l $lines_per_file $1 __part_$1
for file in __part_*
do
{
  sort $file > sort_$file
} &
done
wait
sort -smu sort_* > $2
rm -f __part_*
rm -f sort_*

Copy after login

使用方法：./sort_uniq.sh file_to_be_sort file_sorted

这段代码把大文件分词20或21个小文件，后台并行排序各个小文件，最后合并结果并去重。

如果只要去重，不需要排序，还有另外一种思路：对文件的每一行计算hash值，按照hash值把该行内容放到某个小文件中，假设需要分词100个小文件，则可以按照（hash % 100）来分发文件内容，然后在小文件中实现去重就可以了。

The above is the detailed content of Simple implementation of sorting and deduplication of large files. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Will R.E.P.O. Have Crossplay?

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7560

CakePHP Tutorial

1384

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

The easiest way to query the hard drive serial number Feb 26, 2024 pm 02:24 PM

The hard disk serial number is an important identifier of the hard disk and is usually used to uniquely identify the hard disk and identify the hardware. In some cases, we may need to query the hard drive serial number, such as when installing an operating system, finding the correct device driver, or performing hard drive repairs. This article will introduce some simple methods to help you check the hard drive serial number. Method 1: Use Windows Command Prompt to open the command prompt. In Windows system, press Win+R keys, enter "cmd" and press Enter key to open the command

How to implement dual WeChat login on Huawei mobile phones? Mar 24, 2024 am 11:27 AM

How to implement dual WeChat login on Huawei mobile phones? With the rise of social media, WeChat has become one of the indispensable communication tools in people's daily lives. However, many people may encounter a problem: logging into multiple WeChat accounts at the same time on the same mobile phone. For Huawei mobile phone users, it is not difficult to achieve dual WeChat login. This article will introduce how to achieve dual WeChat login on Huawei mobile phones. First of all, the EMUI system that comes with Huawei mobile phones provides a very convenient function - dual application opening. Through the application dual opening function, users can simultaneously

Use Java to write code to implement love animation Dec 23, 2023 pm 12:09 PM

Realizing love animation effects through Java code In the field of programming, animation effects are very common and popular. Various animation effects can be achieved through Java code, one of which is the heart animation effect. This article will introduce how to use Java code to achieve this effect and give specific code examples. The key to realizing the heart animation effect is to draw the heart-shaped pattern and achieve the animation effect by changing the position and color of the heart shape. Here is the code for a simple example: importjavax.swing.

PHP Programming Guide: Methods to Implement Fibonacci Sequence Mar 20, 2024 pm 04:54 PM

The programming language PHP is a powerful tool for web development, capable of supporting a variety of different programming logics and algorithms. Among them, implementing the Fibonacci sequence is a common and classic programming problem. In this article, we will introduce how to use the PHP programming language to implement the Fibonacci sequence, and attach specific code examples. The Fibonacci sequence is a mathematical sequence defined as follows: the first and second elements of the sequence are 1, and starting from the third element, the value of each element is equal to the sum of the previous two elements. The first few elements of the sequence

How to implement the WeChat clone function on Huawei mobile phones Mar 24, 2024 pm 06:03 PM

How to implement the WeChat clone function on Huawei mobile phones With the popularity of social software and people's increasing emphasis on privacy and security, the WeChat clone function has gradually become the focus of people's attention. The WeChat clone function can help users log in to multiple WeChat accounts on the same mobile phone at the same time, making it easier to manage and use. It is not difficult to implement the WeChat clone function on Huawei mobile phones. You only need to follow the following steps. Step 1: Make sure that the mobile phone system version and WeChat version meet the requirements. First, make sure that your Huawei mobile phone system version has been updated to the latest version, as well as the WeChat App.

Master how Golang enables game development possibilities Mar 16, 2024 pm 12:57 PM

In today's software development field, Golang (Go language), as an efficient, concise and highly concurrency programming language, is increasingly favored by developers. Its rich standard library and efficient concurrency features make it a high-profile choice in the field of game development. This article will explore how to use Golang for game development and demonstrate its powerful possibilities through specific code examples. 1. Golang’s advantages in game development. As a statically typed language, Golang is used in building large-scale game systems.

PHP Game Requirements Implementation Guide Mar 11, 2024 am 08:45 AM

PHP Game Requirements Implementation Guide With the popularity and development of the Internet, the web game market is becoming more and more popular. Many developers hope to use the PHP language to develop their own web games, and implementing game requirements is a key step. This article will introduce how to use PHP language to implement common game requirements and provide specific code examples. 1. Create game characters In web games, game characters are a very important element. We need to define the attributes of the game character, such as name, level, experience value, etc., and provide methods to operate these

How to implement exact division operation in Golang Feb 20, 2024 pm 10:51 PM

Implementing exact division operations in Golang is a common need, especially in scenarios involving financial calculations or other scenarios that require high-precision calculations. Golang's built-in division operator "/" is calculated for floating point numbers, and sometimes there is a problem of precision loss. In order to solve this problem, we can use third-party libraries or custom functions to implement exact division operations. A common approach is to use the Rat type from the math/big package, which provides a representation of fractions and can be used to implement exact division operations.

See all articles