Table of Contents
sudo add-apt-repository ppa:hsoft/ppasudo apt-get updatesudo apt-get install dupeguru*
Copy after login

方法三:使用Find命令解析" >
sudo add-apt-repository ppa:hsoft/ppasudo apt-get updatesudo apt-get install dupeguru*
Copy after login

方法三:使用Find命令解析

Find duplicate files using Linux

Aug 03, 2023 pm 03:51 PM
linux



Find duplicate files using Linux
#Method 1: Use the Find command

This section is an extended usage description of the powerful function of find. On the basis of find, we can combine it with other basic Linux commands (such as the xargs command) to create unlimited command line functions. For example, we can quickly find files in a Linux folder and its subfolders. List of duplicate files. The process to implement this function is relatively simple. Just search and traverse all the files, and then use the command to compare the MD5 of each file.

It sounds abstract, but in fact there is only one command:

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
Copy after login

  • find -not -empty -type f -printf "%sn" means use find The command searches out all non-empty files and prints out their sizes.

  • sort -rn Needless to say, this command is based on file size. Reverse sort

  • ##uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 means only duplicate lines are printed , used here means to print out files with the same file name

  • ##uniq -w32 –all-repeated=separate Finally, this means to print out the first 32 bytes of MD5 In contrast, the entire process of filtering out duplicate files using the command line is so simple and easy.
  • Method 2: Use dupeGuru tool
DupeGuru is a cross-platform application, available in Linux, Windows and Mac OS X versions. It can pass file size, Various standards such as MD5 and file names are used to help users find duplicate files in Linux. Ubuntu users can install it directly by adding the following PPA source:

sudo add-apt-repository ppa:hsoft/ppasudo apt-get updatesudo apt-get install dupeguru*
Copy after login

方法三:使用Find命令解析

在工作生活当中,我们很可能会遇到查找重复文件的问题。比如从某游戏提取的游戏文本有重复的,我们希望找出所有重复的文本,让翻译只翻译其中一份,而其他的直接替换。那么这个问题该怎么做呢?当然方法多种多样,而且无论那种方法应该都不会太难,但笔者第一次遇到这个问题的时候第一反应是是用Linux的Shell脚本,所以文本介绍这种方式。

先上代码:

find -not -empty -type f -printf "%sn" | sort -rn |uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate | cut -b 36-
Copy after login

大家先cd到自己想要查找重复文件的文件夹,然后copy上面代码就可以了,系统会对当前文件夹及子文件夹内的所有文件进行查重。

下面分析一下上面的命令。

首先看第一句:

find -not -empty -type f -printf "%sn"
Copy after login

find是查找命令;-not -empty是要寻找非空文件;-type f是指寻找常规文件;-printf “%sn”比较具有迷惑性,这里的%s并非C语言中的输出字符串,它实际表示的是文件的大小,单位为bytes(不懂就man,man一下find,就可以看到了),n是换行符。所以这句话的意思是输出所有非空文件的大小。

搜索公众号GitHub猿后台回复“UML”,获取一份惊喜礼包。

通过管道,上面的结果被传到第二句:

sort -rn
Copy after login

sort是排序,-n是指按大小排序,-r是指从大到小排序(逆序reverse)。

第三句:

uniq -d
Copy after login

uniq是把重复的只输出一次,而-d指只输出重复的部分(如9出现了5次,那么就输出1个9,而2只出现了1次,并非重复出现的数字,故不输出)。

第四句:

xargs -I{} -n1 find -type f -size {}c -print0
Copy after login

这一部分分两部分看,第一部分是xargs -I{} -n1,xargs命令将之前的结果转化为参数,供后面的find调用,其中-I{}是指把参数写成{},而-n1是指将之前的结果一个一个输入给下一个命令(-n8就是8个8个输入给下一句,不写-n就是把之前的结果一股脑的给下一句)。后半部分是find -type f -size {}c -print0,find指令我们前面见过,-size{}是指找出大小为{}bytes的文件,而-print0则是为了防止文件名里带空格而写的参数。

第五句:

xargs -0 md5sum
Copy after login

xargs我们之前说过,是将前面的结果转化为输入,那么这个-0又是什么意思?man一下xargs,我们看到-0表示读取参数的时候以null为分隔符读取,这也不难理解,毕竟null的二进制表示就是00。后面的md5sum是指计算输入的md5值。

第六句:sort是排序,这个我们前面也见过。

第七句:

uniq -w32 --all-repeated=separate
Copy after login

uniq -w32是指寻找前32个字符相同的行,原因在于md5值一定是32位的,而后面的--all-repeated=separate是指将重复的部分放在一类,分类输出。

第八句:

cut -b 36-
Copy after login

由于我们的结果带着md5值,不是很好看,所以我们截取md5值后面的部分,cut是文本处理函数,这里-b 36-是指只要每行36个字符之后的部分。

我们将上述每个命令用管道链接起来,存入result.txt:

find -not -empty -type f -printf "%sn" | sort -rn |uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate | cut -b 36- >result.txt
Copy after login

虽然结果很好看,但是有一个问题,这是在Linux下很好看,实际上如果有朋友把输出文件放到Windows上,就会发现换行全没了,这是由于Linux下的换行是n,而windows要求nr,为了解决这个问题,我们最后执行一条指令,将n转换为nr:

cat result.txt | cut -c 36- | tr -s 'n'
Copy after login

The above is the detailed content of Find duplicate files using Linux. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to start apache How to start apache Apr 13, 2025 pm 01:06 PM

The steps to start Apache are as follows: Install Apache (command: sudo apt-get install apache2 or download it from the official website) Start Apache (Linux: sudo systemctl start apache2; Windows: Right-click the "Apache2.4" service and select "Start") Check whether it has been started (Linux: sudo systemctl status apache2; Windows: Check the status of the "Apache2.4" service in the service manager) Enable boot automatically (optional, Linux: sudo systemctl

What to do if the apache80 port is occupied What to do if the apache80 port is occupied Apr 13, 2025 pm 01:24 PM

When the Apache 80 port is occupied, the solution is as follows: find out the process that occupies the port and close it. Check the firewall settings to make sure Apache is not blocked. If the above method does not work, please reconfigure Apache to use a different port. Restart the Apache service.

How to optimize the performance of debian readdir How to optimize the performance of debian readdir Apr 13, 2025 am 08:48 AM

In Debian systems, readdir system calls are used to read directory contents. If its performance is not good, try the following optimization strategy: Simplify the number of directory files: Split large directories into multiple small directories as much as possible, reducing the number of items processed per readdir call. Enable directory content caching: build a cache mechanism, update the cache regularly or when directory content changes, and reduce frequent calls to readdir. Memory caches (such as Memcached or Redis) or local caches (such as files or databases) can be considered. Adopt efficient data structure: If you implement directory traversal by yourself, select more efficient data structures (such as hash tables instead of linear search) to store and access directory information

How to restart the apache server How to restart the apache server Apr 13, 2025 pm 01:12 PM

To restart the Apache server, follow these steps: Linux/macOS: Run sudo systemctl restart apache2. Windows: Run net stop Apache2.4 and then net start Apache2.4. Run netstat -a | findstr 80 to check the server status.

How to learn Debian syslog How to learn Debian syslog Apr 13, 2025 am 11:51 AM

This guide will guide you to learn how to use Syslog in Debian systems. Syslog is a key service in Linux systems for logging system and application log messages. It helps administrators monitor and analyze system activity to quickly identify and resolve problems. 1. Basic knowledge of Syslog The core functions of Syslog include: centrally collecting and managing log messages; supporting multiple log output formats and target locations (such as files or networks); providing real-time log viewing and filtering functions. 2. Install and configure Syslog (using Rsyslog) The Debian system uses Rsyslog by default. You can install it with the following command: sudoaptupdatesud

How to solve the problem that apache cannot be started How to solve the problem that apache cannot be started Apr 13, 2025 pm 01:21 PM

Apache cannot start because the following reasons may be: Configuration file syntax error. Conflict with other application ports. Permissions issue. Out of memory. Process deadlock. Daemon failure. SELinux permissions issues. Firewall problem. Software conflict.

Does the internet run on Linux? Does the internet run on Linux? Apr 14, 2025 am 12:03 AM

The Internet does not rely on a single operating system, but Linux plays an important role in it. Linux is widely used in servers and network devices and is popular for its stability, security and scalability.

How to fix apache vulnerability How to fix apache vulnerability Apr 13, 2025 pm 12:54 PM

Steps to fix the Apache vulnerability include: 1. Determine the affected version; 2. Apply security updates; 3. Restart Apache; 4. Verify the fix; 5. Enable security features.

See all articles