Skip to content

Tags: Yelp/mrjob

Tags

v0.7.4

Toggle v0.7.4's commit message
Docker, concurrent steps, and pooling

 * library requirement changes:
   * [emr] requires boto3>=1.10.0, botocore>=1.13.26 (#2193)
   * [google] requires google-cloud-dataproc<=1.1.0
 * cloud runners (Dataproc, EMR):
   * mrjob is now bootstrapped through py_files, not at bootstrap time
 * EMR Runner:
   * default image_version is now 6.0.0
   * support Docker on 6.x AMIs (#2179)
     * added docker_client_config, docker_image, docker_mounts opts
   * allow concurrent steps on EMR clusters (#2185)
     * max_concurrent_steps option
     * for multi-step jobs, can add steps to cluster one at a time
       * by default, does this if cluster supports concurrent steps
       * can be controlled directly with add_steps_in_batch option
   * pooling:
     * join pooled clusters based on YARN cluster metrics (#2191)
       * min_available_mb, min_available_virtual_cores opts
     * upgrades to timing and cluster management:
       * max_clusters_in_pool option (#2192)
       * pool_timeout_minutes (#2199)
       * pool_jitter_seconds to prevent race conditions (#2200)
         * wait for S3 sync after uploading to S3, not before launching cluster
       * don't wait pool_wait_minutes if no clusters to wait for (#2198)
   * get_job_steps() is deprecated

v0.7.3

Toggle v0.7.3's commit message
API-efficient cluster pooling

 * cluster pooling changes:
   * clusters locking now uses EMR tags, not S3 objects (#2160)
     * cluster locks always expire after one minute (#2162)
       * deprecated --max-mins-locked (terminate-idle-clusters), does nothing
   * pooling uses API more efficiently
     * most cluster pooling info is in job name (#2160)
     * don't list pooled clusters' steps (#2159)
     * use any matching cluster, not just the "best" one (#2164)
   * "best" cluster determined by NormalizedInstanceHours / hours run
   * matching rules are slightly more strict:
      * mrjob version must always match
      * application list must match exactly
 * terminate_idle_clusters no longer locks pooled clusters
 * spark runner:
   * counters work when spark_tmp_dir is a local path (#2176)
 * manifest download script correctly handles errors with dash (#2175)

v0.7.2

Toggle v0.7.2's commit message
archives on all Spark platforms

 * archives work on non-YARN Spark installations (#1993)
   * mrjob.util.file_ext() ignores initial dots
   * archives in setup scripts etc. are auto-named without file extension
   * bootstrap now recognizes archives with names like *.0.7.tar.gz
 * don't copy SSH key to master node when accessing other nodes on EMR (#1209)
   * added ssh_add_bin option
 * extra_cluster_params merges dict params rather than overwriting them (#2154)
 * default python_bin on Python 2 is now 'python2.7' (#2151)
 * ensure working PyYAML installs on Python 3.4 (#2149)

v0.7.1

Toggle v0.7.1's commit message
Fixed some bugs and enable VisibleToAllUsers by default

v0.7.0

Toggle v0.7.0's commit message
fall cleaning

 * moved support for AWS and Google Cloud to extras_require (#1935)
   * use e.g. `pip install mrjob[aws]`
 * removed support for non-Python MRJobs (#2087)
   * removed interpreter and steps_interpreter options (see below)
   * removed the `mrjob run` command
   * removed mr_wc.rb from mrjob/examples/
   * merged the MRJobLauncher class back into MRJob
 * MRJob classes initialized without args read them from sys.argv (#2124)
   * use SomeMRJob([]) to simulate running with no args (e.g. for tests)
 * revamped and tested mrjob/examples/ (#2122)
   * mr_grep.py no longer errors on no matches
   * mr_log_sampler.py correctly randomizes lines
   * mr_spark_wordcount.py is no longer case sensitive
     * same with mr_spark_wordcount_script.py
   * mr_text_classifier.py now reads text files directly, no need to encode
     * public domain examples are in mrjob/examples/docs-to-classify
   * renamed mr_words_containing_u_freq_count.py
   * removed some examples that were difficult to test or maintain
 * mrjob audit-emr-usage no longer reads pre-v0.6.0 cluster pool names (#1815)
 * filesystem methods now have consistent arg naming
 * removed the following deprecated code:
   * runner options:
     * emr_api_params
     * interpreter
     * max_hours_idle
     * mins_to_end_of_hour
     * steps_interpreter
     * steps_python_bin
     * visible_to_all_users
   * singular switches (use --archives, etc.):
     * --archive
     * --dir
     * --file
     * --hadoop-arg
     * --libjar
     * --py-file
     * --spark-arg
   * --steps switch from MRJobs (#2046)
     * use --help -v to see help for --mapper etc.
   * MRJob:
     * optparse simulation:
       * add_file_option()
       * add_passthrough_option()
       * configure_options()
       * load_options()
       * pass_through_option()
       * self.args
       * self.OPTION_CLASS
     * parse_output_line()
   * MRJobRunner:
     * file_upload_args kwarg to runner constructor
     * stream_output()
   * mrjob.util:
     * parse_and_save_options()
     * read_file()
     * read_input()
   * filesystems:
     * arguments to CompositeFilesystem constructor (use add_fs())
     * useless local_tmp_dir arg to GCSFilesystem constructor
     * chunk_size arg to GCSFilesystem.put()

v0.6.12

Toggle v0.6.12's commit message
unbreak Google

 * default image_version on Dataproc is now 1.3 (#2110)
 * local filesystem can now handle file:// URIs (#1986)
   * sim runners accept file:// URIs as input files, upload files/archives

v0.6.11

Toggle v0.6.11's commit message
Spark log parsing

 * Python 3.4 is again supported, except for Google libraries (#2090)
 * can intermix positional (input file) args to MRJobs on Python 3.7 (#1701)
 * all runners
   * can parse logs to find cause of error in Spark (#2056)
 * EMR runner
   * retrying on transient API errors now works with pagination (#2005)
   * default image_version (AMI) is now 5.27.0 (#2105)
   * restored m4.large as default instance type for pre-5.13.0 AMIs (#2098)
   * can override emr_configurations with !clear or by Classification (#2097)
 * Spark runner
   * can run scripts with spark-submit without pyspark in $PYTHONPATH (#2091)

v0.6.10

Toggle v0.6.10's commit message
official PyPy support

 * officially support PyPy (#1011)
   * when launched in PyPy, defaults python_bin to pypy or pypy3
 * Spark runner
   * turn off internal protocol with --skip-internal-protocol (#1952)
   * spark Harness can run inside EMR (#2070)
 * EMR runner
   * default instance type is now m5.xlarge (#2071)
   * log DNS of master node as soon as we know it (#2074)
 * better error when reading YAML conf file without YAML library (#2047)

v0.6.9, 2019-05-29 -- Better emulation
 * formally dropped support for Python 3.4
   * (still seems to work except for Google libraries)
 * jobs:
   * deprecated add_*_option() methods can take types as their type arg (#2058)
 * all runners
   * archives no longer go into working dir mirror (#2059)
     * fixes bug in v0.6.8 that could break archives on Hadoop
 * sim runners (local, inline)
   * simulated mapreduce.map.input.file is now a file:// URL (#2066)
 * Spark runner
   * added emulate_map_input_file option (#2061)
     * can optionally emulate mapreduce.map.input.file in first step's mapper
   * increment counter() emulation now uses correct arg names (#2060)
   * warns if spark_tmp_dir and master aren't both local/remote (#2062)
 * mrjob spark-submit can take switches to script without using "--" (#2055)

(This tag is one revision ahead of the released version on PyPI. The only
difference is in docs/requirements.txt so that readthedocs.org builds)

v0.6.9

Toggle v0.6.9's commit message
Better emulation

 * formally dropped support for Python 3.4
   * (still seems to work except for Google libraries)
 * jobs:
   * deprecated add_*_option() methods can take types as their type arg (#2058)
 * all runners
   * archives no longer go into working dir mirror (#2059)
     * fixes bug in v0.6.8 that could break archives on Hadoop
 * sim runners (local, inline)
   * simulated mapreduce.map.input.file is now a file:// URL (#2066)
 * Spark runner
   * added emulate_map_input_file option (#2061)
     * can optionally emulate mapreduce.map.input.file in first step's mapper
   * increment counter() emulation now uses correct arg names (#2060)
   * warns if spark_tmp_dir and master aren't both local/remote (#2062)
 * mrjob spark-submit can take switches to script without using "--" (#2055)

v0.6.8

Toggle v0.6.8's commit message
Spark runner

 * updated library dependencies (#2019, #2025)
   * google-cloud-dataproc>=0.3.0
   * google-cloud-logging>=1.9.0
   * google-cloud-storage>=1.13.1
   * PyYAML>=3.10
 * jobs:
   * MRJobs are now Spark-serializable (without calling sandbox())
     * spark() can pass job methods to rdd.map() etc. (#2039)
 * all runners:
   * inline runner runs Spark jobs through PySpark (#1965)
   * local runner runs Spark jobs on local-cluster master (#1361)
   * cat_output() now ignores files and subdirs starting with "." too (#1337)
     * this includes Spark checksum files (e.g. .part-00000.crc)
   * empty *_bin options mean use the default, not a no-args command (#1926)
     * affected gcloud_bin, hadoop_bin, sh_bin, ssh_bin
     * *python_bin options already worked this way
   * improved Spark support
     * full support for setup scripts (was just YARN) (#2048)
     * fully supports uploading files to Spark working dir (#1922)
       * including renaming files (#2017)
       * uploading archives/dirs is still unsupported except on YARN
     * spark.yarn.appMasterEnv.* now only set on YARN (#1919)
     * add_file_arg() works on Spark
       * even on local[*] master (#2031)
     * uses file:// as appropriate when running locally (#1985)
     * won't hang if Hadoop or Spark binary can't be run (#2024)
     * spark master/deploy mode can't be overridden by jobconf (#2032)
     * can search for spark-submit binary in pyspark installation (#1984)
     * (Dataproc runner does not yet support Spark)
 * EMR runner:
   * fixed fs bug that prevented running with non-default temp bucket (#2015)
   * less API calls when job retries joining a pooled cluster (#1990)
   * extra_cluster_params can set nested sub-params (#1934)
     * e.g. Instances.EmrManagedMasterSecurityGroup
   * --subnet '' un-sets subnet set in mrjob.conf (#1931)
 * added Spark runner (#1940)
   * runs jobs entirely on Spark, uses `hadoop fs` for HDFS only
   * can use any fs mrjob supports (HDFS, EMR, Dataproc, local)
   * can run "classic" MRJobs normally run on Hadoop streaming (#1972)
     * supports mappers, combiners, reducers, including _init() and _final()
     * makes efficient use of combiners, if available (#1946)
     * supports Hadoop input/output format set in job (#1944)
     * can run consecutive MRSteps in a single Spark step (#1950)
     * respects SORT_VALUES (#1945)
     * emulates Hadoop output compression (#1943)
       * set the same jobconf variables you would in Hadoop
     * can control number of output files
       * set Hadoop jobconf variables to control # of reducers (#1953)
       * or use --max-output-files (#2040)
     * can simulate counters with accumulators (#1955)
     * can handle jobs that load file args in their constructor (#2044)
     * does not support commands (e.g. mapper_cmd(), mapper_pre_filter())
     * (Spark runner does not yet parse logs for probable cause of error)
   * Spark harness renamed to mrjob/spark/harness.py, no need to run directly
 * `mrjob spark-submit` now defaults to spark runner
   * works on emr, hadoop, and local runners as well (#1975)
 * runner filesystems:
   * added put() method to all filesystems (#1977)
     * part size for uploads is now set at fs init time
   * CompositeFilesystem can can give up on an un-configured filesystem (#1974)
     * used by the Spark runner when GCS/S3 aren't set up
   * mkdir() can now create buckets (#2014)
   * fs-specific methods now accessed through fs.<name>
     * e.g. runner.fs.s3.make_s3_client()
   * deprecated useless local_tmp_dir arg to GCSFilesystem (#1961)
 * missing mrjob.examples support files now installed