In this article, I will show you how to use Pandas to perform data analysis with SQL-style filtering. Most enterprise data is stored in databases that require SQL to retrieve and manipulate. For example, companies like Oracle, IBM, Microsoft have their own databases and their own SQL implementations.
Data scientists must deal with SQL at some point in their career because data is not always stored in CSV files. I personally prefer to use Oracle because most of my company's data is stored in Oracle.
Scenario – 1 Suppose we have a task to find all movies from our dataset that have the following criteria.
SELECT FROM WHERE title AS movie_title ,original_language AS movie_language ,popularityAS movie_popularity ,statusAS movie_status ,vote_count AS movie_vote_count movies_data original_languageIN ('en', 'es') AND status=('Released') AND popularitybetween 500 AND 1000 AND vote_count > 5000;
Now that you have seen the SQL statement that meets your needs, let us proceed step by step using pandas. I'm going to show you two ways.
1. Load the movies_data dataset into the DataFrame.
import pandas as pd movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv")
Assign a variable to each condition.
languages = [ "en" , "es" ] condition_on_languages = movies . original_language . isin ( languages ) condition_on_status = movies . status == "Released" condition_on_popularity = movies . popularity . between ( 500 , 1000 ) condition_on_votecount = movies . vote_count > 5000
3. Combine all conditions (boolean array) together.
final_conditions = ( condition_on_languages & condition_on_status & condition_on_popularity & condition_on_votecount ) columns = [ "title" , "original_language" , "status" , "popularity" , "vote_count" ] # clubbing all together movies . loc [ final_conditions , columns ]
Title | original_language | Status | Popularity | vote_count< p> |
---|---|---|---|---|
##95星 | zhpublished | 724.247784 | 10867 | < /td> |
zh | Released | 514.569956 | 10995 |
Method 2:- .query() method.
If there are spaces in the column names, use the python replace function to replace them with underscores.
In my experience, I have found that the query() method is faster than previous methods when applied to larger DataFrames. < /p>
import pandas as pd movies = pd . read_csv ( "https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv" )
4. Build the query string and execute the method.
Please note that the .query method does not work with triple quoted strings that span multiple lines.final_conditions = (
"original_language in ['en','es']"
"and status == 'Released' "
"and popularity > 500 "
"and popularity < 1000"
"and vote_count > 5000"
) final_result = movies . query ( final_conditions )
final_result
##Budget | < /p>id | original_language | original_title | Popularity | Release Date | Revenue | Runtime | st | |
---|---|---|---|---|---|---|---|---|---|
165000000 | 157336 | zh | 星 | 724.247784 | May 11, 2014 | ##675120017< /p> ##169.0 | Relationship | ##788 | |
293660 | zh | < td>Deadpool514.569956 | September 2, 2016 | p>783112979 | 108.0 | Relationship | ## There are many more, usually in my coding I have Multiple values to check my "in" clause. So the syntax above is not ideal. Python variables can be referenced using the at symbol (@). |
movie_languages = [ 'en' , 'es' ] final_conditions = ( "original_language in @movie_languages " "and status == 'Released' " "and popularity > 500 " "and popularity < 1000" "and vote_count > 5000" ) final_result = movies . query ( final_conditions ) final_result
##Budget
original_language | original_title | Popularity | Release Date | Revenue |
| Runtimest | ## 95 | ||
---|---|---|---|---|---|---|---|---|---|
157336 zh |
星 |
724.247784 |
May 11, 2014 |
675120017 |
169.0 |
Relationship |
##788 | ##58000000 | |
zh | Deadpool | 514.569956September 2, 2016 | 783112979 | 108.0 p> | Relationship |
The above is the detailed content of How to select a subset of data in SQL query style in Pandas?. For more information, please follow other related articles on the PHP Chinese website!