Skip to content

rodwill/sf-crime-data-p2

Repository files navigation

How did changing values on the SparkSession property parameters affect the throughput and latency of the data?

The only parameter working in standalone mode is the "master" and "spark.default.parallelism" parameters. The number inside brackets in the "master" parameter indicate the number of threads used by the workers.

What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

It's difficult to measure the variations of results with this small size of dataset (i've produced around 200k messages in kafka topic). But, in standalone, the two main parameters was the "master(local[8]", which means 8 threads, and the "spark.default.parallelism=2".

I've tried changes in parameters spark.executor.cores, spark.executor.memory and spark.cores.max but with no changes in metrics values.

Screenshots

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published