Tags: Yelp/mrjob
Tags
Docker, concurrent steps, and pooling * library requirement changes: * [emr] requires boto3>=1.10.0, botocore>=1.13.26 (#2193) * [google] requires google-cloud-dataproc<=1.1.0 * cloud runners (Dataproc, EMR): * mrjob is now bootstrapped through py_files, not at bootstrap time * EMR Runner: * default image_version is now 6.0.0 * support Docker on 6.x AMIs (#2179) * added docker_client_config, docker_image, docker_mounts opts * allow concurrent steps on EMR clusters (#2185) * max_concurrent_steps option * for multi-step jobs, can add steps to cluster one at a time * by default, does this if cluster supports concurrent steps * can be controlled directly with add_steps_in_batch option * pooling: * join pooled clusters based on YARN cluster metrics (#2191) * min_available_mb, min_available_virtual_cores opts * upgrades to timing and cluster management: * max_clusters_in_pool option (#2192) * pool_timeout_minutes (#2199) * pool_jitter_seconds to prevent race conditions (#2200) * wait for S3 sync after uploading to S3, not before launching cluster * don't wait pool_wait_minutes if no clusters to wait for (#2198) * get_job_steps() is deprecated
API-efficient cluster pooling * cluster pooling changes: * clusters locking now uses EMR tags, not S3 objects (#2160) * cluster locks always expire after one minute (#2162) * deprecated --max-mins-locked (terminate-idle-clusters), does nothing * pooling uses API more efficiently * most cluster pooling info is in job name (#2160) * don't list pooled clusters' steps (#2159) * use any matching cluster, not just the "best" one (#2164) * "best" cluster determined by NormalizedInstanceHours / hours run * matching rules are slightly more strict: * mrjob version must always match * application list must match exactly * terminate_idle_clusters no longer locks pooled clusters * spark runner: * counters work when spark_tmp_dir is a local path (#2176) * manifest download script correctly handles errors with dash (#2175)
archives on all Spark platforms * archives work on non-YARN Spark installations (#1993) * mrjob.util.file_ext() ignores initial dots * archives in setup scripts etc. are auto-named without file extension * bootstrap now recognizes archives with names like *.0.7.tar.gz * don't copy SSH key to master node when accessing other nodes on EMR (#1209) * added ssh_add_bin option * extra_cluster_params merges dict params rather than overwriting them (#2154) * default python_bin on Python 2 is now 'python2.7' (#2151) * ensure working PyYAML installs on Python 3.4 (#2149)
fall cleaning * moved support for AWS and Google Cloud to extras_require (#1935) * use e.g. `pip install mrjob[aws]` * removed support for non-Python MRJobs (#2087) * removed interpreter and steps_interpreter options (see below) * removed the `mrjob run` command * removed mr_wc.rb from mrjob/examples/ * merged the MRJobLauncher class back into MRJob * MRJob classes initialized without args read them from sys.argv (#2124) * use SomeMRJob([]) to simulate running with no args (e.g. for tests) * revamped and tested mrjob/examples/ (#2122) * mr_grep.py no longer errors on no matches * mr_log_sampler.py correctly randomizes lines * mr_spark_wordcount.py is no longer case sensitive * same with mr_spark_wordcount_script.py * mr_text_classifier.py now reads text files directly, no need to encode * public domain examples are in mrjob/examples/docs-to-classify * renamed mr_words_containing_u_freq_count.py * removed some examples that were difficult to test or maintain * mrjob audit-emr-usage no longer reads pre-v0.6.0 cluster pool names (#1815) * filesystem methods now have consistent arg naming * removed the following deprecated code: * runner options: * emr_api_params * interpreter * max_hours_idle * mins_to_end_of_hour * steps_interpreter * steps_python_bin * visible_to_all_users * singular switches (use --archives, etc.): * --archive * --dir * --file * --hadoop-arg * --libjar * --py-file * --spark-arg * --steps switch from MRJobs (#2046) * use --help -v to see help for --mapper etc. * MRJob: * optparse simulation: * add_file_option() * add_passthrough_option() * configure_options() * load_options() * pass_through_option() * self.args * self.OPTION_CLASS * parse_output_line() * MRJobRunner: * file_upload_args kwarg to runner constructor * stream_output() * mrjob.util: * parse_and_save_options() * read_file() * read_input() * filesystems: * arguments to CompositeFilesystem constructor (use add_fs()) * useless local_tmp_dir arg to GCSFilesystem constructor * chunk_size arg to GCSFilesystem.put()
Spark log parsing * Python 3.4 is again supported, except for Google libraries (#2090) * can intermix positional (input file) args to MRJobs on Python 3.7 (#1701) * all runners * can parse logs to find cause of error in Spark (#2056) * EMR runner * retrying on transient API errors now works with pagination (#2005) * default image_version (AMI) is now 5.27.0 (#2105) * restored m4.large as default instance type for pre-5.13.0 AMIs (#2098) * can override emr_configurations with !clear or by Classification (#2097) * Spark runner * can run scripts with spark-submit without pyspark in $PYTHONPATH (#2091)
official PyPy support * officially support PyPy (#1011) * when launched in PyPy, defaults python_bin to pypy or pypy3 * Spark runner * turn off internal protocol with --skip-internal-protocol (#1952) * spark Harness can run inside EMR (#2070) * EMR runner * default instance type is now m5.xlarge (#2071) * log DNS of master node as soon as we know it (#2074) * better error when reading YAML conf file without YAML library (#2047) v0.6.9, 2019-05-29 -- Better emulation * formally dropped support for Python 3.4 * (still seems to work except for Google libraries) * jobs: * deprecated add_*_option() methods can take types as their type arg (#2058) * all runners * archives no longer go into working dir mirror (#2059) * fixes bug in v0.6.8 that could break archives on Hadoop * sim runners (local, inline) * simulated mapreduce.map.input.file is now a file:// URL (#2066) * Spark runner * added emulate_map_input_file option (#2061) * can optionally emulate mapreduce.map.input.file in first step's mapper * increment counter() emulation now uses correct arg names (#2060) * warns if spark_tmp_dir and master aren't both local/remote (#2062) * mrjob spark-submit can take switches to script without using "--" (#2055) (This tag is one revision ahead of the released version on PyPI. The only difference is in docs/requirements.txt so that readthedocs.org builds)
Better emulation * formally dropped support for Python 3.4 * (still seems to work except for Google libraries) * jobs: * deprecated add_*_option() methods can take types as their type arg (#2058) * all runners * archives no longer go into working dir mirror (#2059) * fixes bug in v0.6.8 that could break archives on Hadoop * sim runners (local, inline) * simulated mapreduce.map.input.file is now a file:// URL (#2066) * Spark runner * added emulate_map_input_file option (#2061) * can optionally emulate mapreduce.map.input.file in first step's mapper * increment counter() emulation now uses correct arg names (#2060) * warns if spark_tmp_dir and master aren't both local/remote (#2062) * mrjob spark-submit can take switches to script without using "--" (#2055)
Spark runner * updated library dependencies (#2019, #2025) * google-cloud-dataproc>=0.3.0 * google-cloud-logging>=1.9.0 * google-cloud-storage>=1.13.1 * PyYAML>=3.10 * jobs: * MRJobs are now Spark-serializable (without calling sandbox()) * spark() can pass job methods to rdd.map() etc. (#2039) * all runners: * inline runner runs Spark jobs through PySpark (#1965) * local runner runs Spark jobs on local-cluster master (#1361) * cat_output() now ignores files and subdirs starting with "." too (#1337) * this includes Spark checksum files (e.g. .part-00000.crc) * empty *_bin options mean use the default, not a no-args command (#1926) * affected gcloud_bin, hadoop_bin, sh_bin, ssh_bin * *python_bin options already worked this way * improved Spark support * full support for setup scripts (was just YARN) (#2048) * fully supports uploading files to Spark working dir (#1922) * including renaming files (#2017) * uploading archives/dirs is still unsupported except on YARN * spark.yarn.appMasterEnv.* now only set on YARN (#1919) * add_file_arg() works on Spark * even on local[*] master (#2031) * uses file:// as appropriate when running locally (#1985) * won't hang if Hadoop or Spark binary can't be run (#2024) * spark master/deploy mode can't be overridden by jobconf (#2032) * can search for spark-submit binary in pyspark installation (#1984) * (Dataproc runner does not yet support Spark) * EMR runner: * fixed fs bug that prevented running with non-default temp bucket (#2015) * less API calls when job retries joining a pooled cluster (#1990) * extra_cluster_params can set nested sub-params (#1934) * e.g. Instances.EmrManagedMasterSecurityGroup * --subnet '' un-sets subnet set in mrjob.conf (#1931) * added Spark runner (#1940) * runs jobs entirely on Spark, uses `hadoop fs` for HDFS only * can use any fs mrjob supports (HDFS, EMR, Dataproc, local) * can run "classic" MRJobs normally run on Hadoop streaming (#1972) * supports mappers, combiners, reducers, including _init() and _final() * makes efficient use of combiners, if available (#1946) * supports Hadoop input/output format set in job (#1944) * can run consecutive MRSteps in a single Spark step (#1950) * respects SORT_VALUES (#1945) * emulates Hadoop output compression (#1943) * set the same jobconf variables you would in Hadoop * can control number of output files * set Hadoop jobconf variables to control # of reducers (#1953) * or use --max-output-files (#2040) * can simulate counters with accumulators (#1955) * can handle jobs that load file args in their constructor (#2044) * does not support commands (e.g. mapper_cmd(), mapper_pre_filter()) * (Spark runner does not yet parse logs for probable cause of error) * Spark harness renamed to mrjob/spark/harness.py, no need to run directly * `mrjob spark-submit` now defaults to spark runner * works on emr, hadoop, and local runners as well (#1975) * runner filesystems: * added put() method to all filesystems (#1977) * part size for uploads is now set at fs init time * CompositeFilesystem can can give up on an un-configured filesystem (#1974) * used by the Spark runner when GCS/S3 aren't set up * mkdir() can now create buckets (#2014) * fs-specific methods now accessed through fs.<name> * e.g. runner.fs.s3.make_s3_client() * deprecated useless local_tmp_dir arg to GCSFilesystem (#1961) * missing mrjob.examples support files now installed
PreviousNext