Filtering a Pyspark DataFrame with a SQL-like IN clause can be achieved with string formatting.
In the given example:
sc = SparkContext() sqlc = SQLContext(sc) df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')
Strings passed to SQLContext are evaluated in the SQL environment and do not capture closures. To pass variables explicitly, use string formatting:
df.registerTempTable("df") sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
Alternatively, the DataFrame DSL provides a better option for dynamic queries:
from pyspark.sql.functions import col df.where(col("v").isin({"foo", "bar"})).count()
The above is the detailed content of How to Efficiently Filter PySpark DataFrames Using an IN Clause?. For more information, please follow other related articles on the PHP Chinese website!