spark-hyperloglog

Algebird's HyperLogLog support for Apache Spark. This package can be used in concert with presto-hyperloglog to share HyperLogLog sets between Spark and Presto.

Installing

This project is published as mozilla/spark-hyperloglog on spark-packages.org, so is available via:

spark --packages mozilla:spark-hyperloglog:2.2.0

Example usage

import com.mozilla.spark.sql.hyperloglog.aggregates._
import com.mozilla.spark.sql.hyperloglog.functions._

val hllMerge = new HyperLogLogMerge
spark.udf.register("hll_merge", hllMerge)
spark.udf.register("hll_create", hllCreate _)
spark.udf.register("hll_cardinality", hllCardinality _)

val frame = sc.parallelize(List("a", "b", "c", "c")).toDF("id")
(frame
  .select(expr("hll_create(id, 12) as hll"))
  .groupBy()
  .agg(expr("hll_cardinality(hll_merge(hll)) as count"))
  .show())

yields:

+-----+
|count|
+-----+
|    3|
+-----+

Deployment

To publish a new version of the package, you need to create a new release on GitHub with a tag version starting with v like v2.2.0. The tag will trigger a CircleCI build that publishes to Mozilla's maven repo in S3.

The CircleCI build will also attempt to publish the new tag to spark-packages.org, but due to an outstanding bug in the sbt-spark-package plugin that publish will likely fail. You can retry locally until is succeeds by creating a GitHub personal access token and, exporting the environment variables GITHUB_USERNAME and GITHUB_PERSONAL_ACCESS_TOKEN, and then repeatedly running sbt spPublish until you get a non-404 response.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.circleci		.circleci
project		project
python		python
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
scalastyle-config.xml		scalastyle-config.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-hyperloglog

Installing

Example usage

Deployment

About

Releases 2

Packages

Languages

License

mozilla/spark-hyperloglog

Folders and files

Latest commit

History

Repository files navigation

spark-hyperloglog

Installing

Example usage

Deployment

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages