How to manipulate text data using Python?-Python Tutorial-php.cn

Table of Contents

Use python to process text data

Use python to process numerical data

Home

Backend Development

Python Tutorial

How to manipulate text data using Python?

王林

May 08, 2023 am 10:07 AM

python

Use python to process text data

Experimental purpose

Be familiar with the basic data structure of python, as well as the input and output of files.

Experimental data

Use the evaluation data and evaluation tasks of the xx machine learning conference in xxxx. The data includes training sets and test sets. The evaluation task is to pass the given training Data, predict whether the relationship in the test set is a positive or negative example, and give 1 or 0 at the end of each sample.

The data is described as follows. The first column is the relationship type, the second and third columns are the names of the people, the fourth column is the title, the fifth column is whether the relationship is a positive or negative example, 1 is a positive example, 0 is a negative example; the sixth column represents the training set.

Event	Character 1	Character 2	Title	Relationship (0 or 1 )	Training set

The test set is described as follows. The format is basically similar to the training set. The only difference is that the fifth column does not matter whether it is a positive or negative example. Example mark.

##RelationshipCharacter 1Character 2Event

Experimental content

Process the training set data, leaving only the first five columns, and the output text is named exp1_1.txt.

Classify 19 types of relationships based on the data obtained in the first step. The generated text is stored in the exp1_train folder. According to the order in which the relationship categories appear, the data of the first relationship category is stored in 1 .txt, the second relationship category is stored in 2.txt until 19.txt.

The test set classifies each sample according to the relationship category in the order of the 19 categories of the training set, that is, the data of the same relationship type is put into a text file, and test files of 19 categories are also generated. The format is still the same Be consistent with the test file. Stored in the exp1_test folder, the files of each category are still named 1_test.txt, 2_test.txt... At the same time, the position of each sample in the original test set is recorded, and corresponds to the 19 test files one by one. For example, the line of each sample of the first type of "rumored discord" in the original text is recorded in the index file and saved in the files index1.txt, index2.txt...

Solution Question ideas

1. The first question is to test our knowledge of file operations and lists. The main difficulty is to read the new file. After processing according to the requirements, we will generate a txt file. Let us Take a look at the specific code implementation:

import os
# 创建一个列表用来存储新的内容
list = []                                     
with open("task1.trainSentence.new", "r",encoding=&#39;xxx&#39;) as file_input: # 打开.new文件,xxx根据自己的编码格式填写
    with open("exp1_1.txt", "w", encoding=&#39;xxx&#39;) as file_output:        # 打开exp1_1.txt,xxx根据自己的编码格式填写文件如果没有就创建一个
 
        for Line in file_input:                                         # 遍历每一行的文件
            arr = Line.split(&#39;\t&#39;)                                      # 以\t为分隔符读取
            if arr[0] not in list:                                      # if the word is not in the list
                list.append(arr[0])                                     # add the word to the list
            file_output.write(arr[0]+"\t"+arr[1]+"\t"+arr[2]+"\t"+arr[3]+"\t"+arr[4]+"\n")  # write the line to the file
file_input.close()                                                      #关闭.new文件
file_output.close()                                                     #关闭创建的txt文件

Copy after login

2. The second question still examines file operations. Based on the files generated in question 1, events are classified according to the same type of events to see whether they can be grouped efficiently. Use loop conditions to solve, let's take a look at the specific

code implementation

import os
file_1 = open("exp1_1.txt", encoding=&#39;xxx&#39;)             # 打开文件,xxx根据自己的编码格式填写
os.mkdir("exp1_train")                                  # 创建目录
os.chdir("exp1_train")                                  # 修改进程的工作目录（使用该目录）
a = file.readline()                                     # 按行读取exp1_1.txt文件
arr = a.split("\t")                                     # 按\t间隔符作为分割
b = 1                                                   #设置分组文件的序列
file_2 = open("{}.txt".format(b), "w", encoding="xxx")  # 打开文件,xxx根据自己的编码格式填写
for line in file_1:                                     # 按行读取文件
    arr_1 = line.split("\t")                            # 按\t间隔符作为分割
    if arr[0] != arr_1[0]:                              # 如果读取文件的第一列内容与存入新文件的第一列类型不同
        file_2.close()                                  # 关掉该文件
        b += 1                                          # 文件序列加一
        f_2 = open("{}.txt".format(b), "w", encoding="xxx") # 创建新文件，以另一种类型分类,xxx根据自己的编码格式填写
    arr = line.split("\t")                              # 按\t间隔符作为分割
    f_2.write(arr[0]+"\t"+arr[1]+"\t"+arr[2]+"\t"+arr[3]+"t"+arr[4]+"\t""\n") # 将相同类型的文件写入
f_1.close()                                             # 关闭题目一创建的exp1_1.txt文件
f_2.close()                                             # 关闭创建的最后一个类型的文件

Copy after login

3. Further classify the 19 categories of the training set according to the relationship between the characters , we can traverse the data through the dictionary, find the relationship, put the content with the same relationship into a folder, and create a new one if it is different.

import os

with open("exp1_1.txt", encoding=&#39;xxx&#39;) as file_in1: # 打开文件,xxx根据自己的编码格式填写
    i = 1                                            # 类型序列
    arr2 = {}                                        # 创建字典
    for line in file_in1:                            # 按行遍历
        arr3 = line[0:2]                             # 读取关系
        if arr3 not in arr2.keys():
            arr2[arr3] = i                           
            i += 1                                   # 类型+1
    file_in = open("task1.test.new")                 # 打开文件task1.test.new
    os.mkdir("exp1_test")                            # 创建目录
    os.chdir("exp1_test")                            # 修改进程的工作目录（使用该目录）
    for line in file_in:
        arr = line[0:2]
        with open("{}_test.txt".format(arr2[arr]), "a", encoding=&#39;xxx&#39;) as file_out:
            arr = line.split(&#39;\t&#39;)
            file_out.write(line)
    i = 1
    file_in.seek(0)
    os.mkdir("exp1_index")
    os.chdir("exp1_index")
    for line in file_in:
        arr = line[0:2]
        with open("index{}.txt".format(arr2[arr]), "a", encoding=&#39;xxx&#39;) as file_out:
            arr = line.split(&#39;\t&#39;)
            line = line[0:-1]
            file_out.write(line + &#39;\t&#39; + "{}".format(i) + "\n")
        i += 1

Copy after login

Use python to process numerical data

Experimental purpose

Be familiar with the basic data structure of python, as well as the input and output of files.

Experimental Data

The XX Tianchi Competition in XXXX is also the data of the XXth Big Data Challenge of Chinese Universities. The data includes two tables, namely the user behavior table mars_tianchi_user_actions.csv and the song artist table mars_tianchi_songs.csv. The competition opens sampled song artist data, as well as user behavior history records related to these artists within 6 months (20150301-20150831). Contestants need to predict the artist's playback data for the next 2 months, that is, 60 days (20150901-20151030).

How to manipulate text data using Python?

##Experimental content

Process the song artist data mars_tianchi_songs and count the number of artists and the number of songs for each artist. The output file format is exp2_1.csv. The first column is the artist's ID, and the second column is the number of songs by the artist. The last line outputs the number of artists.
Merge the user behavior table and the song artist table into one large table using the song song_id as the association. The names of each column are the first to fifth columns, which are consistent with the column names of the user behavior table, and the sixth to tenth columns are the column names of the second to sixth columns in the song artist table. The output file name is exp2_2.csv.
According to artist statistics, the playback volume of all songs of each artist every day, the output file is exp2_3.csv, and each column is artist id, date Ds, and total song playback volume. Note: Only the number of song plays are counted here, not the number of downloads and collections.

Problem-solving ideas: (Using pandas library)

(1) Use .drop_duplicates() to delete duplicate values

(2) Use .loc[:,‘artist_id’].value_counts() to find the number of times the singer repeats, that is, the number of songs for each singer

(3) Use .loc[:,‘ songs_id’].value_counts() Find out if there are no duplicate songs

import pandas as pd
data = pd.read_csv(r"C:\mars_tianchi_songs.csv")       # 读取数据
Newdata = data.drop_duplicates(subset=[&#39;artist_id&#39;])   # 删除重复值
artist_sum = Newdata[&#39;artist_id&#39;].count()              
#artistChongFu_count = data.duplicated(subset=[&#39;artist_id&#39;]).count() artistChongFu_count = data.loc[:,&#39;artist_id&#39;].value_counts() 重复次数，即每个歌手的歌曲数目
songChongFu_count = data.loc[:,&#39;songs_id&#39;].value_counts()  # 没有重复（歌手）
artistChongFu_count.loc[&#39;artist_sum&#39;] = artist_sum         # 没有重复（歌曲）artistChongFu_count.to_csv(&#39;exp2_1.csv&#39;)                   # 输出文件格式为exp2_1.csv

Copy after login

Use merge() to merge two tables

import pandas as pd import os

data = pd.read_csv(r"C:\mars_tianchi_songs.csv")
data_two = pd.read_csv(r"C:\mars_tianchi_user_actions.csv")
num=pd.merge(data_two, data) num.to_csv(&#39;exp2_2.csv&#39;)

Copy after login

Use groupby()[].sum() for repetitive addition

import pandas as pd
data =pd.read_csv(&#39;exp2_2.csv&#39;)
DataCHongfu = data.groupby([&#39;artist_id&#39;,&#39;Ds&#39;])[&#39;gmt_create&#39;].sum()#重复项相加DataCHongfu.to_csv(&#39;exp2_3.csv&#39;)

Copy after login

The above is the detailed content of How to manipulate text data using Python?. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks ago By DDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

4 weeks ago By DDD

Two Point Museum: All Exhibits And Where To Find Them

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7374

Java Tutorial

1628

CakePHP Tutorial

1355

Laravel Tutorial

1267

PHP Tutorial

1215

Related knowledge

Is the conversion speed fast when converting XML to PDF on mobile phone? Apr 02, 2025 pm 10:09 PM

The speed of mobile XML to PDF depends on the following factors: the complexity of XML structure. Mobile hardware configuration conversion method (library, algorithm) code quality optimization methods (select efficient libraries, optimize algorithms, cache data, and utilize multi-threading). Overall, there is no absolute answer and it needs to be optimized according to the specific situation.

Is there any mobile app that can convert XML into PDF? Apr 02, 2025 pm 08:54 PM

An application that converts XML directly to PDF cannot be found because they are two fundamentally different formats. XML is used to store data, while PDF is used to display documents. To complete the transformation, you can use programming languages and libraries such as Python and ReportLab to parse XML data and generate PDF documents.

How to control the size of XML converted to images? Apr 02, 2025 pm 07:24 PM

To generate images through XML, you need to use graph libraries (such as Pillow and JFreeChart) as bridges to generate images based on metadata (size, color) in XML. The key to controlling the size of the image is to adjust the values of the <width> and <height> tags in XML. However, in practical applications, the complexity of XML structure, the fineness of graph drawing, the speed of image generation and memory consumption, and the selection of image formats all have an impact on the generated image size. Therefore, it is necessary to have a deep understanding of XML structure, proficient in the graphics library, and consider factors such as optimization algorithms and image format selection.

How to convert XML files to PDF on your phone? Apr 02, 2025 pm 10:12 PM

It is impossible to complete XML to PDF conversion directly on your phone with a single application. It is necessary to use cloud services, which can be achieved through two steps: 1. Convert XML to PDF in the cloud, 2. Access or download the converted PDF file on the mobile phone.

How to open xml format Apr 02, 2025 pm 09:00 PM

Use most text editors to open XML files; if you need a more intuitive tree display, you can use an XML editor, such as Oxygen XML Editor or XMLSpy; if you process XML data in a program, you need to use a programming language (such as Python) and XML libraries (such as xml.etree.ElementTree) to parse.

Recommended XML formatting tool Apr 02, 2025 pm 09:03 PM

XML formatting tools can type code according to rules to improve readability and understanding. When selecting a tool, pay attention to customization capabilities, handling of special circumstances, performance and ease of use. Commonly used tool types include online tools, IDE plug-ins, and command-line tools.

What is the function of C language sum? Apr 03, 2025 pm 02:21 PM

There is no built-in sum function in C language, so it needs to be written by yourself. Sum can be achieved by traversing the array and accumulating elements: Loop version: Sum is calculated using for loop and array length. Pointer version: Use pointers to point to array elements, and efficient summing is achieved through self-increment pointers. Dynamically allocate array version: Dynamically allocate arrays and manage memory yourself, ensuring that allocated memory is freed to prevent memory leaks.

What is the process of converting XML into images? Apr 02, 2025 pm 08:24 PM

To convert XML images, you need to determine the XML data structure first, then select a suitable graphical library (such as Python's matplotlib) and method, select a visualization strategy based on the data structure, consider the data volume and image format, perform batch processing or use efficient libraries, and finally save it as PNG, JPEG, or SVG according to the needs.

See all articles