Focus on Spark

I have #16320/#16498 out to bring Spark up to existing standards for our examples. This bug is both a friction log of things I ran into while working on that PR, and a set of anticipated issues with Spark.

I organized this into two sections, Spark-specific-ish and Kubernetes-general. Some of these bullets have open issues and I'm documenting them as part of the vertical slice involved.

**Please** feel free to correct me if I messed something up obvious about Spark or Kubernetes. I was a former Spark user in a bygone era, but it's moved quite quickly since then so any expertise has rusted.
## Spark specific
- [x] **Where?**: #17027 talks about whether this even belongs in `examples/`, since it works over multiple versions, and isn't a _teaching_ example.
- [ ] **Config**: #16320 puts up a static configuration of Spark. [The Spark configuration alone](http://spark.apache.org/docs/latest/configuration.html) is pretty large, and that's not including [the Hadoop configuration](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html) that also influences it. It's difficult to say what a customer would actually want to tune in production, but it's clear that treating Spark as a fully black-box is probably wrong (but has legs for a while). Don't get me wrong, it's not unreasonable to offer an out-of-the-box configuration that just works, but we're going to need a way to inload configuration. In this case, the container is also going to need to compose configuration with the configuration that "just works" (i.e. as a simple example, if we took in a secret for additional `spark-defaults.conf`, we should probably just splat the existing configuration together, or compose them to allow the additional config to override KVs in the base image). See #6477, #4822.
- [ ] **Jar Files**: We're going to need a way to quickly compose base images and other modules. See #831.
- [ ] **Parameterized Resources**: It's not clear how to merge [spark](../tree/master/examples/spark) and [spark-gluster](../tree/master/examples/spark/spark-gluster). The long term answer is probably things like #10527, #11492, etc. Short term, one option might be to assume that you have _some_ storage available that can be mounted as `ReadWriteMany` and instead of assuming gluster, we rewrite the example as either gluster or NFS and use a PV claim. But that's not the only example of a resource that needs to be parameterized / templatized in some fashion.
- [ ] **Pod Local Storage**: The existing example doesn't override `SPARK_LOCAL_DIRS`, so it's only useful for certain workloads, namely in-memory only workloads. This used to be the bread-and-butter of Spark because it was all it could do, but it proved rather limiting not to spill. However, using network or distributed storage for Spark intermediate storage is somewhat naive, but is the most convenient way on our system to configure a ReplicationController for storage. But, for the short term, given that the replication controller itself is `<N>` wide, for anyone in GCE we could add a script to this directory to provision a set of `<N>` volumes and use a claim to resolve it.
- [ ] **Storage Testing**: Spark doesn't necessarily need storage that will outlive the life of the pod (if you blow away a worker on a running job it will recompute the RDD around it, as you'd expect, since storage within Spark is largely annotated caching with `.persist()` and shuffle-spill, both of which are losable). The pod case is different than e.g. [Netflix's Chaos Monkey with Spark](http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html) tests, though, because the pod comes back with a different name.
- [ ] **Scheduling**: I kept Spark configured in Standalone mode with a fixed width ReplicationController for the worker pool. There's a lot of really long, interesting conversations to have how to handle this correctly. Mesos does a lot here with fine-grained scheduling to pick up slack capacity on the cluster, YARN integrates slightly differently, we could still take yet another. (Obviously a large TBD bullet.)
- [ ] **Resiliency**: I put some basic liveness probes on the master/workers, but there's some obvious stuff missing: (a) The master either needs to be running in `FILESYSTEM` mode (easiest) or (b) in multi-master backed by ZooKeeper (hardest) in order to withstand a restart with a running application. Similarly, the driver pod itself isn't protect at all right now, but before we protect that, we just want to look at the driver mode, how we want to support Spark app submission in the cluster, and the Zeppelin bullet below.
- [ ] **Test**: If we actually want to productionize an app, we really should have tests of some sort.
- [x] **Zeppelin**: We should definitely get [Zeppelin](https://zeppelin.incubator.apache.org/) working. I morphed the current example into an example familiar to Googlers, a Shakespeare word count example, but it's using the pyspark driver and lacks "pop".
- [ ] **Spark UI**: The Spark WebUI worker links won't work because they're cluster IPs. Spark presents it's worker IPs as if they're globally accessible, not anticipating that we're proxying the connection to the master's WebUI (which seems like an utterly sane). We could obviously proxy all of the worker webuis as well, but Spark doesn't seem to support a URL-rewrite scheme for the master WebUI page, so we'd probably have to slap an nginx proxy in front of it. (See #16949 for one way to maybe solve it.)
## Kubernetes general
- [ ] **Pod Hostname/Service Mismatch**: During the conversion from a `spark-master` pod to a `spark-master` single-pod-replication-controller, the Spark master objected to the slaves because the master started with a hostname that didn't match the service name the slaves were contacting it on. In this case, the slaves connected to the DNS `spark-master`, and `spark-master` saw messages for `spark-master` and said "nope, that ain't me, I'm `spark-master-a1b2d3`, you must have me confused for someone else". (c.f. #386)
- [ ] **Environment Variable Injection Issues**: If you're not actually using the service environment variables, injection can actually throw you off when an environment variable from Docker or Kubernetes overrides something that a vendor script was expecting. For example, I had a service named `spark-master`, which resulted in `SPARK_MASTER_PORT`, which is an environment variable that the Spark `start-master.sh` script will happily pick up. Unfortunately, the script was expecting a single integer, not `tcp://<ip>:7070`. It would be nice if there was a way to disable service env variable injection. See #1768 (which seems to cover that possibility, maybe).
- [ ] **Best Practices Friction**: [Config best practices](../tree/fe6b15ba/docs/user-guide/config-best-practices.md) is at odds with [how we test examples currently](../tree/fe6b15ba/examples/examples_test.go): The resources probably belong in the same file, but I didn't do it that way because of the example schema test. See #16444, just filed.
- [ ] **Slow Dev Iteration**: This was probably the longest I've spent iterating on an ensemble of Docker containers, and I found it somewhat painful (until I automated it slightly) to bump the tag in the Makefile, bumping it in the `.yamls`, etc. Our best practices seem to suggest using image hardcoded tags (which make sense), and I probably would've scripted it further if I had to go much longer. The process itself before presenting a PR is interesting because prior to a PR you basically end up "hiding" the entire thing on a private project, then you end up displaying one "bump" to the world. Do we have any issues discussing how to iterate on "the package" (the set of `.yaml`, Dockerfiles, etc.)? As we get anyhere near packaging, that's going to be one killer feature - the ability to rapidly iterate on actually developing package blobs for k8s (the resources and images both).

cc @davidopp @timstclair @timothysc @wattsteve 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Focus on Spark #16517

Spark specific

Kubernetes general

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development