Closed
Description
I have #16320/#16498 out to bring Spark up to existing standards for our examples. This bug is both a friction log of things I ran into while working on that PR, and a set of anticipated issues with Spark.
I organized this into two sections, Spark-specific-ish and Kubernetes-general. Some of these bullets have open issues and I'm documenting them as part of the vertical slice involved.
Please feel free to correct me if I messed something up obvious about Spark or Kubernetes. I was a former Spark user in a bygone era, but it's moved quite quickly since then so any expertise has rusted.
Spark specific
- Where?: RFC: Where to put official Kubernetes packages / templates / etc.? #17027 talks about whether this even belongs in
examples/
, since it works over multiple versions, and isn't a teaching example. - Config: Spark: Update to current example standards, add GCS connector #16320 puts up a static configuration of Spark. The Spark configuration alone is pretty large, and that's not including the Hadoop configuration that also influences it. It's difficult to say what a customer would actually want to tune in production, but it's clear that treating Spark as a fully black-box is probably wrong (but has legs for a while). Don't get me wrong, it's not unreasonable to offer an out-of-the-box configuration that just works, but we're going to need a way to inload configuration. In this case, the container is also going to need to compose configuration with the configuration that "just works" (i.e. as a simple example, if we took in a secret for additional
spark-defaults.conf
, we should probably just splat the existing configuration together, or compose them to allow the additional config to override KVs in the base image). See Config Resource Proposal #6477, Improve user experience for creating / updating secrets #4822. - Jar Files: We're going to need a way to quickly compose base images and other modules. See Image volumes and container volumes #831.
- Parameterized Resources: It's not clear how to merge spark and spark-gluster. The long term answer is probably things like Facilitate generation of configs for common scenarios #10527, Config template parameterization #11492, etc. Short term, one option might be to assume that you have some storage available that can be mounted as
ReadWriteMany
and instead of assuming gluster, we rewrite the example as either gluster or NFS and use a PV claim. But that's not the only example of a resource that needs to be parameterized / templatized in some fashion. - Pod Local Storage: The existing example doesn't override
SPARK_LOCAL_DIRS
, so it's only useful for certain workloads, namely in-memory only workloads. This used to be the bread-and-butter of Spark because it was all it could do, but it proved rather limiting not to spill. However, using network or distributed storage for Spark intermediate storage is somewhat naive, but is the most convenient way on our system to configure a ReplicationController for storage. But, for the short term, given that the replication controller itself is<N>
wide, for anyone in GCE we could add a script to this directory to provision a set of<N>
volumes and use a claim to resolve it. - Storage Testing: Spark doesn't necessarily need storage that will outlive the life of the pod (if you blow away a worker on a running job it will recompute the RDD around it, as you'd expect, since storage within Spark is largely annotated caching with
.persist()
and shuffle-spill, both of which are losable). The pod case is different than e.g. Netflix's Chaos Monkey with Spark tests, though, because the pod comes back with a different name. - Scheduling: I kept Spark configured in Standalone mode with a fixed width ReplicationController for the worker pool. There's a lot of really long, interesting conversations to have how to handle this correctly. Mesos does a lot here with fine-grained scheduling to pick up slack capacity on the cluster, YARN integrates slightly differently, we could still take yet another. (Obviously a large TBD bullet.)
- Resiliency: I put some basic liveness probes on the master/workers, but there's some obvious stuff missing: (a) The master either needs to be running in
FILESYSTEM
mode (easiest) or (b) in multi-master backed by ZooKeeper (hardest) in order to withstand a restart with a running application. Similarly, the driver pod itself isn't protect at all right now, but before we protect that, we just want to look at the driver mode, how we want to support Spark app submission in the cluster, and the Zeppelin bullet below. - Test: If we actually want to productionize an app, we really should have tests of some sort.
- Zeppelin: We should definitely get Zeppelin working. I morphed the current example into an example familiar to Googlers, a Shakespeare word count example, but it's using the pyspark driver and lacks "pop".
- Spark UI: The Spark WebUI worker links won't work because they're cluster IPs. Spark presents it's worker IPs as if they're globally accessible, not anticipating that we're proxying the connection to the master's WebUI (which seems like an utterly sane). We could obviously proxy all of the worker webuis as well, but Spark doesn't seem to support a URL-rewrite scheme for the master WebUI page, so we'd probably have to slap an nginx proxy in front of it. (See Service proxy for cluster IP instead of service name #16949 for one way to maybe solve it.)
Kubernetes general
- Pod Hostname/Service Mismatch: During the conversion from a
spark-master
pod to aspark-master
single-pod-replication-controller, the Spark master objected to the slaves because the master started with a hostname that didn't match the service name the slaves were contacting it on. In this case, the slaves connected to the DNSspark-master
, andspark-master
saw messages forspark-master
and said "nope, that ain't me, I'mspark-master-a1b2d3
, you must have me confused for someone else". (c.f. Container downward/upward API umbrella issue #386) - Environment Variable Injection Issues: If you're not actually using the service environment variables, injection can actually throw you off when an environment variable from Docker or Kubernetes overrides something that a vendor script was expecting. For example, I had a service named
spark-master
, which resulted inSPARK_MASTER_PORT
, which is an environment variable that the Sparkstart-master.sh
script will happily pick up. Unfortunately, the script was expecting a single integer, nottcp://<ip>:7070
. It would be nice if there was a way to disable service env variable injection. See Pods need to pre-declare service links iff they want the environment variables created #1768 (which seems to cover that possibility, maybe). - Best Practices Friction: Config best practices is at odds with how we test examples currently: The resources probably belong in the same file, but I didn't do it that way because of the example schema test. See examples_test.go tension with config best practices #16444, just filed.
- Slow Dev Iteration: This was probably the longest I've spent iterating on an ensemble of Docker containers, and I found it somewhat painful (until I automated it slightly) to bump the tag in the Makefile, bumping it in the
.yamls
, etc. Our best practices seem to suggest using image hardcoded tags (which make sense), and I probably would've scripted it further if I had to go much longer. The process itself before presenting a PR is interesting because prior to a PR you basically end up "hiding" the entire thing on a private project, then you end up displaying one "bump" to the world. Do we have any issues discussing how to iterate on "the package" (the set of.yaml
, Dockerfiles, etc.)? As we get anyhere near packaging, that's going to be one killer feature - the ability to rapidly iterate on actually developing package blobs for k8s (the resources and images both).