SlideShare a Scribd company logo
The Typesafe Reactive Platform and
Apache Spark: Experiences,
Challenges and Roadmaps
Stavros Kontopoulos, MSc
Fast Data and
Typesafe’s Reactive Platform
Fast Data for Reactive Applications
Typesafe’s Fast Data Strategy
• Reactive Platform, Fast Data Architecture
This strategy aims different market needs .
• Microservice architecture with an analytics extension
• Analytics oriented based setup where core infrastructure can be
managed by mesos-like tools and where Kafka, HDFS and several DBs
like Riak, Cassandra are first class citizens.
3
Fast Data for Reactive Applications
Reactive Platform (RP):
• Core elements: Play, Akka, Spark. Scala is the common language.
• ConductR is the glue for managing these elements.
Fast Data Architecture utilizes RP and is meant to provide end-to-end
solutions for highly scalable web apps, IoT and other use cases /
requirements.
4
Fast Data for Reactive Applications
5
Reactive Application traits
Partnerships
Fast Data Partnerships
• Databricks
•Scala insights, backpressure feature
• IBM
•Datapalooza, Big data university (check http://www.spark.tc/)
• Mesosphere
• We deliver production-grade distro of spark on Mesos and DCOS
Reactive Applications 7
“If I have seen further it is by standing on the shoulders of giants”
Isaac Newton
The Team
The Team
A dedicated team which
• Contributes to the Spark project: add features, reviews PRs, test
releases etc.
• Supports customers deploying spark with online support, on-site
trainings.
• Promotes spark technology and/or our RP through talks and other
activities.
• Educates community with high quality courses.
9
More on Contribution
The Project - Contributing
• Where to start?
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spar
k
Describes the steps to create a PR.
• Tip: Bug fixes and specifically short fixes can be easier to contribute.
Documentation update etc.
• Things you need to understand as usual:
• local development/test/debugging lifecycle
• How about code style: https://github.com/databricks/scala-style-guide
11
The Project - Contributing...
Tips about debugging:
• Your IDE is your friend, especially with debugging.
You could use SPARK_JAVA_OPTS with spark-shell
SPARK_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,
address=5005"
then remote debug your code.
For driver pass the value to: --driver-java-options
For executors pass the value to: spark.executor.extraJavaOptions (SparkConfig)
• As long your IDE has the sources of the code under examination it could
be only spark for example, then you can attach and debug that code
only.
12
The project - A Software Engineering View
•Most active Apache project. Spark is big.
•Project size? A draft impression via… LOC (physical number of lines, CL +
SLOC) weak metric but...
•gives you an idea when you first jump into code
•area size
•you can derive comment density which leads to some interesting
properties (Arafat, O.; Riehle, D.: The Comment Density of Open Source Software Code. IEEE ICSE 2009)
•of course you need to consider complexity, ext libs etc when you actually
start reading the code…
13
The project - A Software Engineering View
Loc Spark: 601396 (scala/ java/ python)
Loc metrics for some major components:
spark/sql: 124898(Scala), 132028 (Java)
spark/core: 114637 (Scala)
spark/mllib: 70278 (Scala)
spark/streaming: 25807 (Scala)
spark/graphx: 7508 (Scala)
14
The Project - Contributing...
Features:
• Spark streaming backpressure for 1.5 (joint work with Databricks,
SPARK-7398)
• Add support for dynamic allocation in the Mesos coarse-grained
scheduler (SPARK-6287)
• Reactive Streams Receiver (SPARK-10420) on going work…
Integration Tests: Created missing integration tests for mesos
deployments:
• https://github.com/typesafehub/mesos-spark-integration-tests
Other:
• Fixes
• PR reviews
• Voting (http://www.apache.org/foundation/voting.html)
15
Back-pressure
Spark Streaming - The Big Picture:
Receivers receive data streams and cut them into batches. Spark
processes the batches each batch interval and emits the output.
16
data streams receivers Spark output
batches
Spark Streaming
Back-pressure
The problem:
“Spark Streaming ingests data through receivers at the rate of the producer (or a
user-configured rate limit). When the processing time for a batch is longer than the
batch interval, the system is unstable, data queues up, exhaust resources and fails
(with an OOM, for example).”
17
receiver
Executor
spark.streaming.receiver.maxRate (default infinite)
number of records per second
block generator
blocks
data stream
Receiver Tracker
block ids
Spark Streaming Driver
Job Generator
Job Scheduler
Spark Context
Spark Driver
runJob
jobSet
clock
tick
Back-pressure
Solution:
For each batch completed estimate a new rate based on previous batch
processing and scheduling delay. Propagate the estimated rate to the block
generator (via ReceiverTracker) which has a RateLimiter (guava 13.0).
Details:
• We need to listen for batch completion
• We need an algorithm to actually estimate the new limit.
RateEstimator algorithm used: PID control
https://en.wikipedia.org/wiki/PID_controller
18
Back-pressure - PID Controller
K{p,i,d} are the coefficients.
What to use for error signal: ingestion speed - processing speed.
It can be shown scheduling delay is kept within a constant factor of the
integral term. Assume processing rate did not change much between to
calculations.
Default coefficients: proportional 1.0, integral 0.2, derived 0.0
19
Back-pressure
Results:
• Backpressure prevents receiver’s buffer to overflow.
• Allows to build end-to-end reactive applications.
• Composability possible.
20
Dynamic Allocation
The problem: Auto-scaling executors in Spark, already available in Yarn was
missing for Mesos.
The general model for cluster managers such as Yarn, Mesos:
Application driver/scheduler uses the cluster to acquire resources and create
executors to run its tasks.
Each executor runs tasks. How many executors you need to run your tasks?
I
21
Dynamic Allocation
How Spark (essentially with an application side) requests executors?
In Coarse-grained mode if dynamic allocation flag is enabled
(spark.dynamicAllocation.enabled property) an instance of
ExecutorAllocationManager (thread) is started from within SparkContext.
Every 100 mills it checks the executors assigned for the current task load
and adjusts the executors needed .
22
Dynamic Allocation
The logic behind executor adjustment in ExecutorAllocationManager ...
Calculate max number of executors needed:
maxNeeded = (pending + running + tasksPerExecutor -1 )/
tasksPersExecutor
numExecutorsTarget= Min (maxNeeded,
spark.dynamicAllocation.executors)
if (numExecutorsTarget < oldTargert) downscale
If (scheduling delay timer expires) upscale is done
Also check executor expire times to kill them.
23
Dynamic Allocation
Connecting to the cluster manager:
Executor number adjust logic calls sc.requestTotalExecutors which calls
the corresponding method in CoarseGrainedSchedulerBackend ( Yarn,
Mesos scheduler classes extend it ) which does the actual executor
management.
• What we did is provide the appropriate methods to Mesos
CoarseGrainScheduler:
def doKillExecutors(executorIds: Seq[String])
def doRequestTotalExecutors(requestedTotal: Int)
24
Dynamic Allocation
In Yarn/Mesos you can call the following api to autoscale your app from
your sparkcontext (supported only in coarse-grained mode):
sc.requestExecutors
sc.killExecutors
But… “the mesos coarse grain scheduler only supports scaling down
since it is already designed to run one executor per slave with the
configured amount of resources.“
“...can scale back up to the same amount of executors”
25
Dynamic Allocation
A smaller problem solved...
Dynamic allocation needs an external shuffle service
However, there is no reliable way to let the shuffle service clean up the
shuffle data when the driver exits, since it may crash before it notifies
the shuffle service and shuffle data will be cached forever.
We need to implement a reliable way to detect driver termination and
clean up shuffle data accordingly.
SPARK-7820, SPARK-8873
26
Mesos Integration Tests
Why?
• This is joint work with Mesosphere.
• Good software engineering practice. Coverage (nice to have)...
• Prohibit mesos spark integration being broken.
• Faster release for spark on mesos.
• Give the spark developer the option to create a local mesos cluster to
test his PR. Anyone can use it, check our repo.
27
Mesos Integration Tests
• It is easy… just build your spark distro, checkout our repository
… and execute ./run_tests.sh distro.tgz
• Optimization on dev lifecycle is needed (still under development).
• Consists of two parts:
• Scripts to create the cluster
• Test runner which runs the tests against the suite.
28
Mesos Integration Tests
• Docker is the technology used to launch the cluster.
• Supports DCOS and local mode.
• Challenges we faced:
• Docker in bridge mode (not supported: SPARK-11638 )
• Write meaningful tests with good assertions.
• Currently the cluster integrates HDFS. We will integrate Zookeeper and
Apache Hive as well.
29
More on Support
Customer Support
• We provide SLAs for different needs eg. 24/7 production issues.
• We offer on-site training / on-line support.
• What customers want so far:
•Training
•On-site consulting / On-line support
• What do customers ask in support cases?
•Customers usually face problems learning the technology eg. how we start
with Spark, but there are also more mature issues eg. large scale
deployment problems.
31
Next Steps
RoadMap
• What is coming...
• Introduce Kerberos security - Challenge here is to deliver the whole thing
authentication, authorization, encryption..
• Work with Mesosphere for Typesafe spark distro and the mesos spark
code area.
• Evaluate Tachyon.
• Officially offer support to other spark libs (graphx, mllib)
•ConductR integration
•Spark notebook
33
©Typesafe 2015 – All Rights Reserved

More Related Content

PDF
Apache Kafka Women Who Code Meetup
Snehal Nagmote
 
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
PPTX
Project Deimos
Simon Suo
 
PDF
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Spark Summit
 
PPTX
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Spark Summit
 
PPTX
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
PPTX
Embedded Mirror Maker
Simon Suo
 
PPTX
Why your Spark Job is Failing
DataWorks Summit
 
Apache Kafka Women Who Code Meetup
Snehal Nagmote
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
Project Deimos
Simon Suo
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Spark Summit
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Spark Summit
 
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
Embedded Mirror Maker
Simon Suo
 
Why your Spark Job is Failing
DataWorks Summit
 

What's hot (20)

PDF
Real-time streams and logs with Storm and Kafka
Andrew Montalenti
 
PDF
Fast and Reliable Apache Spark SQL Releases
DataWorks Summit
 
PPTX
Inferno Scalable Deep Learning on Spark
DataWorks Summit/Hadoop Summit
 
PDF
Low Latency Execution For Apache Spark
Jen Aman
 
PDF
Spark streaming + kafka 0.10
Joan Viladrosa Riera
 
PDF
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
PDF
DevoxxUK: Optimizating Application Performance on Kubernetes
Dinakar Guniguntala
 
PPTX
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
PDF
Reactive Streams, Linking Reactive Application To Spark Streaming
Spark Summit
 
PDF
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Big Data Spain
 
PDF
Scaling Apache Storm - Strata + Hadoop World 2014
P. Taylor Goetz
 
PPTX
A fun cup of joe with open liberty
Andy Mauer
 
PDF
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
Joan Viladrosa Riera
 
PPTX
Resource Aware Scheduling in Apache Storm
DataWorks Summit/Hadoop Summit
 
PDF
PaaSTA: Autoscaling at Yelp
Nathan Handler
 
PPTX
Zoo keeper in the wild
datamantra
 
PPTX
Have your cake and eat it too
Gwen (Chen) Shapira
 
PDF
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Chris Fregly
 
PDF
Real-Time Analytics with Kafka, Cassandra and Storm
John Georgiadis
 
PDF
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Databricks
 
Real-time streams and logs with Storm and Kafka
Andrew Montalenti
 
Fast and Reliable Apache Spark SQL Releases
DataWorks Summit
 
Inferno Scalable Deep Learning on Spark
DataWorks Summit/Hadoop Summit
 
Low Latency Execution For Apache Spark
Jen Aman
 
Spark streaming + kafka 0.10
Joan Viladrosa Riera
 
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
DevoxxUK: Optimizating Application Performance on Kubernetes
Dinakar Guniguntala
 
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Spark Summit
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Big Data Spain
 
Scaling Apache Storm - Strata + Hadoop World 2014
P. Taylor Goetz
 
A fun cup of joe with open liberty
Andy Mauer
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
Joan Viladrosa Riera
 
Resource Aware Scheduling in Apache Storm
DataWorks Summit/Hadoop Summit
 
PaaSTA: Autoscaling at Yelp
Nathan Handler
 
Zoo keeper in the wild
datamantra
 
Have your cake and eat it too
Gwen (Chen) Shapira
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Chris Fregly
 
Real-Time Analytics with Kafka, Cassandra and Storm
John Georgiadis
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Databricks
 
Ad

Similar to Typesafe spark- Zalando meetup (20)

PDF
Productionizing Spark and the Spark Job Server
Evan Chan
 
PPTX
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
PPTX
Intro to Spark development
Spark Summit
 
PDF
Introduction to Spark Training
Spark Summit
 
PPTX
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
gmalouf678
 
PPTX
Apache Spark
masifqadri
 
ODP
Spark Deep Dive
Corey Nolet
 
PDF
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
PDF
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark Summit
 
PDF
eScience Cluster Arch. Overview
Francesco Bongiovanni
 
PPTX
Spark Overview and Performance Issues
Antonios Katsarakis
 
PDF
Apache spark - Installation
Martin Zapletal
 
PDF
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PDF
Dev Ops Training
Spark Summit
 
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
PDF
Spark: Interactive To Production
Jen Aman
 
Productionizing Spark and the Spark Job Server
Evan Chan
 
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Intro to Spark development
Spark Summit
 
Introduction to Spark Training
Spark Summit
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
gmalouf678
 
Apache Spark
masifqadri
 
Spark Deep Dive
Corey Nolet
 
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark Summit
 
eScience Cluster Arch. Overview
Francesco Bongiovanni
 
Spark Overview and Performance Issues
Antonios Katsarakis
 
Apache spark - Installation
Martin Zapletal
 
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Apache Spark Fundamentals
Zahra Eskandari
 
Dev Ops Training
Spark Summit
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Spark: Interactive To Production
Jen Aman
 
Ad

More from Stavros Kontopoulos (11)

PDF
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Stavros Kontopoulos
 
PDF
Online machine learning in Streaming Applications
Stavros Kontopoulos
 
PPTX
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
Stavros Kontopoulos
 
PDF
Streaming analytics state of the art
Stavros Kontopoulos
 
PDF
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
PDF
Apache Flink London Meetup - Let's Talk ML on Flink
Stavros Kontopoulos
 
PDF
Spark Summit EU Supporting Spark (Brussels 2016)
Stavros Kontopoulos
 
PDF
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Stavros Kontopoulos
 
PPTX
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
PPTX
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
PDF
Cassandra at Pollfish
Stavros Kontopoulos
 
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Stavros Kontopoulos
 
Online machine learning in Streaming Applications
Stavros Kontopoulos
 
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
Stavros Kontopoulos
 
Streaming analytics state of the art
Stavros Kontopoulos
 
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
Apache Flink London Meetup - Let's Talk ML on Flink
Stavros Kontopoulos
 
Spark Summit EU Supporting Spark (Brussels 2016)
Stavros Kontopoulos
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Stavros Kontopoulos
 
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
Cassandra at Pollfish
Stavros Kontopoulos
 

Recently uploaded (20)

PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 

Typesafe spark- Zalando meetup

  • 1. The Typesafe Reactive Platform and Apache Spark: Experiences, Challenges and Roadmaps Stavros Kontopoulos, MSc
  • 2. Fast Data and Typesafe’s Reactive Platform
  • 3. Fast Data for Reactive Applications Typesafe’s Fast Data Strategy • Reactive Platform, Fast Data Architecture This strategy aims different market needs . • Microservice architecture with an analytics extension • Analytics oriented based setup where core infrastructure can be managed by mesos-like tools and where Kafka, HDFS and several DBs like Riak, Cassandra are first class citizens. 3
  • 4. Fast Data for Reactive Applications Reactive Platform (RP): • Core elements: Play, Akka, Spark. Scala is the common language. • ConductR is the glue for managing these elements. Fast Data Architecture utilizes RP and is meant to provide end-to-end solutions for highly scalable web apps, IoT and other use cases / requirements. 4
  • 5. Fast Data for Reactive Applications 5 Reactive Application traits
  • 7. Fast Data Partnerships • Databricks •Scala insights, backpressure feature • IBM •Datapalooza, Big data university (check http://www.spark.tc/) • Mesosphere • We deliver production-grade distro of spark on Mesos and DCOS Reactive Applications 7 “If I have seen further it is by standing on the shoulders of giants” Isaac Newton
  • 9. The Team A dedicated team which • Contributes to the Spark project: add features, reviews PRs, test releases etc. • Supports customers deploying spark with online support, on-site trainings. • Promotes spark technology and/or our RP through talks and other activities. • Educates community with high quality courses. 9
  • 11. The Project - Contributing • Where to start? https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spar k Describes the steps to create a PR. • Tip: Bug fixes and specifically short fixes can be easier to contribute. Documentation update etc. • Things you need to understand as usual: • local development/test/debugging lifecycle • How about code style: https://github.com/databricks/scala-style-guide 11
  • 12. The Project - Contributing... Tips about debugging: • Your IDE is your friend, especially with debugging. You could use SPARK_JAVA_OPTS with spark-shell SPARK_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y, address=5005" then remote debug your code. For driver pass the value to: --driver-java-options For executors pass the value to: spark.executor.extraJavaOptions (SparkConfig) • As long your IDE has the sources of the code under examination it could be only spark for example, then you can attach and debug that code only. 12
  • 13. The project - A Software Engineering View •Most active Apache project. Spark is big. •Project size? A draft impression via… LOC (physical number of lines, CL + SLOC) weak metric but... •gives you an idea when you first jump into code •area size •you can derive comment density which leads to some interesting properties (Arafat, O.; Riehle, D.: The Comment Density of Open Source Software Code. IEEE ICSE 2009) •of course you need to consider complexity, ext libs etc when you actually start reading the code… 13
  • 14. The project - A Software Engineering View Loc Spark: 601396 (scala/ java/ python) Loc metrics for some major components: spark/sql: 124898(Scala), 132028 (Java) spark/core: 114637 (Scala) spark/mllib: 70278 (Scala) spark/streaming: 25807 (Scala) spark/graphx: 7508 (Scala) 14
  • 15. The Project - Contributing... Features: • Spark streaming backpressure for 1.5 (joint work with Databricks, SPARK-7398) • Add support for dynamic allocation in the Mesos coarse-grained scheduler (SPARK-6287) • Reactive Streams Receiver (SPARK-10420) on going work… Integration Tests: Created missing integration tests for mesos deployments: • https://github.com/typesafehub/mesos-spark-integration-tests Other: • Fixes • PR reviews • Voting (http://www.apache.org/foundation/voting.html) 15
  • 16. Back-pressure Spark Streaming - The Big Picture: Receivers receive data streams and cut them into batches. Spark processes the batches each batch interval and emits the output. 16 data streams receivers Spark output batches Spark Streaming
  • 17. Back-pressure The problem: “Spark Streaming ingests data through receivers at the rate of the producer (or a user-configured rate limit). When the processing time for a batch is longer than the batch interval, the system is unstable, data queues up, exhaust resources and fails (with an OOM, for example).” 17 receiver Executor spark.streaming.receiver.maxRate (default infinite) number of records per second block generator blocks data stream Receiver Tracker block ids Spark Streaming Driver Job Generator Job Scheduler Spark Context Spark Driver runJob jobSet clock tick
  • 18. Back-pressure Solution: For each batch completed estimate a new rate based on previous batch processing and scheduling delay. Propagate the estimated rate to the block generator (via ReceiverTracker) which has a RateLimiter (guava 13.0). Details: • We need to listen for batch completion • We need an algorithm to actually estimate the new limit. RateEstimator algorithm used: PID control https://en.wikipedia.org/wiki/PID_controller 18
  • 19. Back-pressure - PID Controller K{p,i,d} are the coefficients. What to use for error signal: ingestion speed - processing speed. It can be shown scheduling delay is kept within a constant factor of the integral term. Assume processing rate did not change much between to calculations. Default coefficients: proportional 1.0, integral 0.2, derived 0.0 19
  • 20. Back-pressure Results: • Backpressure prevents receiver’s buffer to overflow. • Allows to build end-to-end reactive applications. • Composability possible. 20
  • 21. Dynamic Allocation The problem: Auto-scaling executors in Spark, already available in Yarn was missing for Mesos. The general model for cluster managers such as Yarn, Mesos: Application driver/scheduler uses the cluster to acquire resources and create executors to run its tasks. Each executor runs tasks. How many executors you need to run your tasks? I 21
  • 22. Dynamic Allocation How Spark (essentially with an application side) requests executors? In Coarse-grained mode if dynamic allocation flag is enabled (spark.dynamicAllocation.enabled property) an instance of ExecutorAllocationManager (thread) is started from within SparkContext. Every 100 mills it checks the executors assigned for the current task load and adjusts the executors needed . 22
  • 23. Dynamic Allocation The logic behind executor adjustment in ExecutorAllocationManager ... Calculate max number of executors needed: maxNeeded = (pending + running + tasksPerExecutor -1 )/ tasksPersExecutor numExecutorsTarget= Min (maxNeeded, spark.dynamicAllocation.executors) if (numExecutorsTarget < oldTargert) downscale If (scheduling delay timer expires) upscale is done Also check executor expire times to kill them. 23
  • 24. Dynamic Allocation Connecting to the cluster manager: Executor number adjust logic calls sc.requestTotalExecutors which calls the corresponding method in CoarseGrainedSchedulerBackend ( Yarn, Mesos scheduler classes extend it ) which does the actual executor management. • What we did is provide the appropriate methods to Mesos CoarseGrainScheduler: def doKillExecutors(executorIds: Seq[String]) def doRequestTotalExecutors(requestedTotal: Int) 24
  • 25. Dynamic Allocation In Yarn/Mesos you can call the following api to autoscale your app from your sparkcontext (supported only in coarse-grained mode): sc.requestExecutors sc.killExecutors But… “the mesos coarse grain scheduler only supports scaling down since it is already designed to run one executor per slave with the configured amount of resources.“ “...can scale back up to the same amount of executors” 25
  • 26. Dynamic Allocation A smaller problem solved... Dynamic allocation needs an external shuffle service However, there is no reliable way to let the shuffle service clean up the shuffle data when the driver exits, since it may crash before it notifies the shuffle service and shuffle data will be cached forever. We need to implement a reliable way to detect driver termination and clean up shuffle data accordingly. SPARK-7820, SPARK-8873 26
  • 27. Mesos Integration Tests Why? • This is joint work with Mesosphere. • Good software engineering practice. Coverage (nice to have)... • Prohibit mesos spark integration being broken. • Faster release for spark on mesos. • Give the spark developer the option to create a local mesos cluster to test his PR. Anyone can use it, check our repo. 27
  • 28. Mesos Integration Tests • It is easy… just build your spark distro, checkout our repository … and execute ./run_tests.sh distro.tgz • Optimization on dev lifecycle is needed (still under development). • Consists of two parts: • Scripts to create the cluster • Test runner which runs the tests against the suite. 28
  • 29. Mesos Integration Tests • Docker is the technology used to launch the cluster. • Supports DCOS and local mode. • Challenges we faced: • Docker in bridge mode (not supported: SPARK-11638 ) • Write meaningful tests with good assertions. • Currently the cluster integrates HDFS. We will integrate Zookeeper and Apache Hive as well. 29
  • 31. Customer Support • We provide SLAs for different needs eg. 24/7 production issues. • We offer on-site training / on-line support. • What customers want so far: •Training •On-site consulting / On-line support • What do customers ask in support cases? •Customers usually face problems learning the technology eg. how we start with Spark, but there are also more mature issues eg. large scale deployment problems. 31
  • 33. RoadMap • What is coming... • Introduce Kerberos security - Challenge here is to deliver the whole thing authentication, authorization, encryption.. • Work with Mesosphere for Typesafe spark distro and the mesos spark code area. • Evaluate Tachyon. • Officially offer support to other spark libs (graphx, mllib) •ConductR integration •Spark notebook 33
  • 34. ©Typesafe 2015 – All Rights Reserved