pyspark.sql.functions.tuple_difference_integer#
- pyspark.sql.functions.tuple_difference_integer(col1, col2)[source]#
Returns the set difference of two Datasketches TupleSketch objects with integer summaries (elements in first sketch but not in second).
New in version 4.2.0.
- Parameters
- Returns
ColumnThe binary representation of the difference TupleSketch.
See also
Examples
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([(1, 10, 4, 40), (2, 20, 4, 40), (3, 30, 5, 50), (4, 40, 5, 50)], ["key1", "v1", "key2", "v2"]) # noqa >>> df = df.agg( ... sf.tuple_sketch_agg_integer("key1", "v1").alias("sketch1"), ... sf.tuple_sketch_agg_integer("key2", "v2").alias("sketch2") ... ) >>> df.select(sf.tuple_sketch_estimate_integer(sf.tuple_difference_integer(df.sketch1, "sketch2"))).show() # noqa +-------------------------------------------------------------------------+ |tuple_sketch_estimate_integer(tuple_difference_integer(sketch1, sketch2))| +-------------------------------------------------------------------------+ | 3.0| +-------------------------------------------------------------------------+