Tags · Yelp/mrjob

v0.7.4

Docker, concurrent steps, and pooling

 * library requirement changes:
   * [emr] requires boto3>=1.10.0, botocore>=1.13.26 (#2193)
   * [google] requires google-cloud-dataproc<=1.1.0
 * cloud runners (Dataproc, EMR):
   * mrjob is now bootstrapped through py_files, not at bootstrap time
 * EMR Runner:
   * default image_version is now 6.0.0
   * support Docker on 6.x AMIs (#2179)
     * added docker_client_config, docker_image, docker_mounts opts
   * allow concurrent steps on EMR clusters (#2185)
     * max_concurrent_steps option
     * for multi-step jobs, can add steps to cluster one at a time
       * by default, does this if cluster supports concurrent steps
       * can be controlled directly with add_steps_in_batch option
   * pooling:
     * join pooled clusters based on YARN cluster metrics (#2191)
       * min_available_mb, min_available_virtual_cores opts
     * upgrades to timing and cluster management:
       * max_clusters_in_pool option (#2192)
       * pool_timeout_minutes (#2199)
       * pool_jitter_seconds to prevent race conditions (#2200)
         * wait for S3 sync after uploading to S3, not before launching cluster
       * don't wait pool_wait_minutes if no clusters to wait for (#2198)
   * get_job_steps() is deprecated

Sep 17, 2020
0ec70fb
zip
tar.gz

v0.7.3

API-efficient cluster pooling

 * cluster pooling changes:
   * clusters locking now uses EMR tags, not S3 objects (#2160)
     * cluster locks always expire after one minute (#2162)
       * deprecated --max-mins-locked (terminate-idle-clusters), does nothing
   * pooling uses API more efficiently
     * most cluster pooling info is in job name (#2160)
     * don't list pooled clusters' steps (#2159)
     * use any matching cluster, not just the "best" one (#2164)
   * "best" cluster determined by NormalizedInstanceHours / hours run
   * matching rules are slightly more strict:
      * mrjob version must always match
      * application list must match exactly
 * terminate_idle_clusters no longer locks pooled clusters
 * spark runner:
   * counters work when spark_tmp_dir is a local path (#2176)
 * manifest download script correctly handles errors with dash (#2175)

Jun 6, 2020
1edf3a2
zip
tar.gz

v0.7.2

archives on all Spark platforms

 * archives work on non-YARN Spark installations (#1993)
   * mrjob.util.file_ext() ignores initial dots
   * archives in setup scripts etc. are auto-named without file extension
   * bootstrap now recognizes archives with names like *.0.7.tar.gz
 * don't copy SSH key to master node when accessing other nodes on EMR (#1209)
   * added ssh_add_bin option
 * extra_cluster_params merges dict params rather than overwriting them (#2154)
 * default python_bin on Python 2 is now 'python2.7' (#2151)
 * ensure working PyYAML installs on Python 3.4 (#2149)

Apr 13, 2020
7990974
zip
tar.gz

v0.7.1

Fixed some bugs and enable VisibleToAllUsers by default

Dec 27, 2019
2ced6ec
zip
tar.gz

v0.7.0

fall cleaning

 * moved support for AWS and Google Cloud to extras_require (#1935)
   * use e.g. `pip install mrjob[aws]`
 * removed support for non-Python MRJobs (#2087)
   * removed interpreter and steps_interpreter options (see below)
   * removed the `mrjob run` command
   * removed mr_wc.rb from mrjob/examples/
   * merged the MRJobLauncher class back into MRJob
 * MRJob classes initialized without args read them from sys.argv (#2124)
   * use SomeMRJob([]) to simulate running with no args (e.g. for tests)
 * revamped and tested mrjob/examples/ (#2122)
   * mr_grep.py no longer errors on no matches
   * mr_log_sampler.py correctly randomizes lines
   * mr_spark_wordcount.py is no longer case sensitive
     * same with mr_spark_wordcount_script.py
   * mr_text_classifier.py now reads text files directly, no need to encode
     * public domain examples are in mrjob/examples/docs-to-classify
   * renamed mr_words_containing_u_freq_count.py
   * removed some examples that were difficult to test or maintain
 * mrjob audit-emr-usage no longer reads pre-v0.6.0 cluster pool names (#1815)
 * filesystem methods now have consistent arg naming
 * removed the following deprecated code:
   * runner options:
     * emr_api_params
     * interpreter
     * max_hours_idle
     * mins_to_end_of_hour
     * steps_interpreter
     * steps_python_bin
     * visible_to_all_users
   * singular switches (use --archives, etc.):
     * --archive
     * --dir
     * --file
     * --hadoop-arg
     * --libjar
     * --py-file
     * --spark-arg
   * --steps switch from MRJobs (#2046)
     * use --help -v to see help for --mapper etc.
   * MRJob:
     * optparse simulation:
       * add_file_option()
       * add_passthrough_option()
       * configure_options()
       * load_options()
       * pass_through_option()
       * self.args
       * self.OPTION_CLASS
     * parse_output_line()
   * MRJobRunner:
     * file_upload_args kwarg to runner constructor
     * stream_output()
   * mrjob.util:
     * parse_and_save_options()
     * read_file()
     * read_input()
   * filesystems:
     * arguments to CompositeFilesystem constructor (use add_fs())
     * useless local_tmp_dir arg to GCSFilesystem constructor
     * chunk_size arg to GCSFilesystem.put()

Nov 22, 2019
84945ec
zip
tar.gz

v0.6.12

unbreak Google

 * default image_version on Dataproc is now 1.3 (#2110)
 * local filesystem can now handle file:// URIs (#1986)
   * sim runners accept file:// URIs as input files, upload files/archives

Oct 23, 2019
df72827
zip
tar.gz

v0.6.11

Spark log parsing

 * Python 3.4 is again supported, except for Google libraries (#2090)
 * can intermix positional (input file) args to MRJobs on Python 3.7 (#1701)
 * all runners
   * can parse logs to find cause of error in Spark (#2056)
 * EMR runner
   * retrying on transient API errors now works with pagination (#2005)
   * default image_version (AMI) is now 5.27.0 (#2105)
   * restored m4.large as default instance type for pre-5.13.0 AMIs (#2098)
   * can override emr_configurations with !clear or by Classification (#2097)
 * Spark runner
   * can run scripts with spark-submit without pyspark in $PYTHONPATH (#2091)

Oct 7, 2019
ed59dbc
zip
tar.gz

v0.6.10

official PyPy support

 * officially support PyPy (#1011)
   * when launched in PyPy, defaults python_bin to pypy or pypy3
 * Spark runner
   * turn off internal protocol with --skip-internal-protocol (#1952)
   * spark Harness can run inside EMR (#2070)
 * EMR runner
   * default instance type is now m5.xlarge (#2071)
   * log DNS of master node as soon as we know it (#2074)
 * better error when reading YAML conf file without YAML library (#2047)

v0.6.9, 2019-05-29 -- Better emulation
 * formally dropped support for Python 3.4
   * (still seems to work except for Google libraries)
 * jobs:
   * deprecated add_*_option() methods can take types as their type arg (#2058)
 * all runners
   * archives no longer go into working dir mirror (#2059)
     * fixes bug in v0.6.8 that could break archives on Hadoop
 * sim runners (local, inline)
   * simulated mapreduce.map.input.file is now a file:// URL (#2066)
 * Spark runner
   * added emulate_map_input_file option (#2061)
     * can optionally emulate mapreduce.map.input.file in first step's mapper
   * increment counter() emulation now uses correct arg names (#2060)
   * warns if spark_tmp_dir and master aren't both local/remote (#2062)
 * mrjob spark-submit can take switches to script without using "--" (#2055)

(This tag is one revision ahead of the released version on PyPI. The only
difference is in docs/requirements.txt so that readthedocs.org builds)

Jul 23, 2019
93cab2d
zip
tar.gz

v0.6.9

Better emulation

 * formally dropped support for Python 3.4
   * (still seems to work except for Google libraries)
 * jobs:
   * deprecated add_*_option() methods can take types as their type arg (#2058)
 * all runners
   * archives no longer go into working dir mirror (#2059)
     * fixes bug in v0.6.8 that could break archives on Hadoop
 * sim runners (local, inline)
   * simulated mapreduce.map.input.file is now a file:// URL (#2066)
 * Spark runner
   * added emulate_map_input_file option (#2061)
     * can optionally emulate mapreduce.map.input.file in first step's mapper
   * increment counter() emulation now uses correct arg names (#2060)
   * warns if spark_tmp_dir and master aren't both local/remote (#2062)
 * mrjob spark-submit can take switches to script without using "--" (#2055)

May 29, 2019
e353d94
zip
tar.gz

v0.6.8

Spark runner

 * updated library dependencies (#2019, #2025)
   * google-cloud-dataproc>=0.3.0
   * google-cloud-logging>=1.9.0
   * google-cloud-storage>=1.13.1
   * PyYAML>=3.10
 * jobs:
   * MRJobs are now Spark-serializable (without calling sandbox())
     * spark() can pass job methods to rdd.map() etc. (#2039)
 * all runners:
   * inline runner runs Spark jobs through PySpark (#1965)
   * local runner runs Spark jobs on local-cluster master (#1361)
   * cat_output() now ignores files and subdirs starting with "." too (#1337)
     * this includes Spark checksum files (e.g. .part-00000.crc)
   * empty *_bin options mean use the default, not a no-args command (#1926)
     * affected gcloud_bin, hadoop_bin, sh_bin, ssh_bin
     * *python_bin options already worked this way
   * improved Spark support
     * full support for setup scripts (was just YARN) (#2048)
     * fully supports uploading files to Spark working dir (#1922)
       * including renaming files (#2017)
       * uploading archives/dirs is still unsupported except on YARN
     * spark.yarn.appMasterEnv.* now only set on YARN (#1919)
     * add_file_arg() works on Spark
       * even on local[*] master (#2031)
     * uses file:// as appropriate when running locally (#1985)
     * won't hang if Hadoop or Spark binary can't be run (#2024)
     * spark master/deploy mode can't be overridden by jobconf (#2032)
     * can search for spark-submit binary in pyspark installation (#1984)
     * (Dataproc runner does not yet support Spark)
 * EMR runner:
   * fixed fs bug that prevented running with non-default temp bucket (#2015)
   * less API calls when job retries joining a pooled cluster (#1990)
   * extra_cluster_params can set nested sub-params (#1934)
     * e.g. Instances.EmrManagedMasterSecurityGroup
   * --subnet '' un-sets subnet set in mrjob.conf (#1931)
 * added Spark runner (#1940)
   * runs jobs entirely on Spark, uses `hadoop fs` for HDFS only
   * can use any fs mrjob supports (HDFS, EMR, Dataproc, local)
   * can run "classic" MRJobs normally run on Hadoop streaming (#1972)
     * supports mappers, combiners, reducers, including _init() and _final()
     * makes efficient use of combiners, if available (#1946)
     * supports Hadoop input/output format set in job (#1944)
     * can run consecutive MRSteps in a single Spark step (#1950)
     * respects SORT_VALUES (#1945)
     * emulates Hadoop output compression (#1943)
       * set the same jobconf variables you would in Hadoop
     * can control number of output files
       * set Hadoop jobconf variables to control # of reducers (#1953)
       * or use --max-output-files (#2040)
     * can simulate counters with accumulators (#1955)
     * can handle jobs that load file args in their constructor (#2044)
     * does not support commands (e.g. mapper_cmd(), mapper_pre_filter())
     * (Spark runner does not yet parse logs for probable cause of error)
   * Spark harness renamed to mrjob/spark/harness.py, no need to run directly
 * `mrjob spark-submit` now defaults to spark runner
   * works on emr, hadoop, and local runners as well (#1975)
 * runner filesystems:
   * added put() method to all filesystems (#1977)
     * part size for uploads is now set at fs init time
   * CompositeFilesystem can can give up on an un-configured filesystem (#1974)
     * used by the Spark runner when GCS/S3 aren't set up
   * mkdir() can now create buckets (#2014)
   * fs-specific methods now accessed through fs.<name>
     * e.g. runner.fs.s3.make_s3_client()
   * deprecated useless local_tmp_dir arg to GCSFilesystem (#1961)
 * missing mrjob.examples support files now installed

Apr 26, 2019
707c48b
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.7.4

v0.7.3

v0.7.2

v0.7.1

v0.7.0

v0.6.12

v0.6.11

v0.6.10

v0.6.9

v0.6.8

Tags: Yelp/mrjob