How to split a dataframe string column into two columns?
When working with tabular data, it's often necessary to manipulate the data to extract specific pieces of information. One common task is splitting a single column of string values into multiple columns, each containing a portion of the original string.
Problem and Requirement
Suppose we have a DataFrame named df with one column called row that contains string values in the following format:
row 0 00000 UNITED STATES 1 01000 ALABAMA 2 01001 Autauga County, AL 3 01003 Baldwin County, AL 4 01005 Barbour County, AL
Our goal is to split the row column into two new columns: fips and row, where fips contains the first five characters of each string and row contains the remaining characters.
Solution using str.split()
One way to split the row column is to use the str.split() method. This method takes a regular expression as an argument, and it splits the string based on the pattern specified by the regular expression. In our case, we can use the following regular expression:
r'(\d{5}) +'
This regular expression will match a sequence of five digits followed by one or more spaces. We can then use the str.split() method to split the row column using this regular expression, and assign the resulting lists to the fips and row columns as follows:
import pandas as pd # Split the 'row' column into 'fips' and 'row' columns df[['fips', 'row']] = df['row'].str.split(r'(\d{5}) +', n=1, expand=True)
The expand=True parameter is used to specify that the str.split() method should return a DataFrame with multiple columns, rather than a Series of lists.
Result
After executing the above code, our DataFrame df will look like this:
fips row 0 00000 UNITED STATES 1 01000 ALABAMA 2 01001 Autauga County, AL 3 01003 Baldwin County, AL 4 01005 Barbour County, AL
Alternative Solution using str.extract()
Another way to split the row column is to use the str.extract() method. This method takes a regular expression as an argument, and it returns a DataFrame containing the matches for the regular expression. In our case, we can use the following regular expression:
r'(\d{5}) +\D+'
This regular expression will match a sequence of five digits followed by one or more non-digits. We can then use the str.extract() method to extract the matches for this regular expression, and assign the resulting DataFrame to the fips and row columns as follows:
import pandas as pd # Split the 'row' column into 'fips' and 'row' columns df[['fips', 'row']] = df['row'].str.extract(r'(\d{5}) +\D+')
Result
After executing the above code, our DataFrame df will look like this:
fips row 0 00000 UNITED STATES 1 01000 ALABAMA 2 01001 Autauga County, AL 3 01003 Baldwin County, AL 4 01005 Barbour County, AL
Both of the above solutions will achieve the desired result, splitting the row column into fips and row columns. The str.split() solution is more flexible and can be used to split the column based on any regular expression, while the str.extract() solution is more straightforward and easier to understand.
The above is the detailed content of How to Split a Pandas DataFrame String Column into Two Columns?. For more information, please follow other related articles on the PHP Chinese website!