Mesos v0.14.0 Release Notes

    • ๐Ÿš€ The primary feature in this release is "Slave Recovery" which allows restarted slaves (e.g., deploys, crashes) to reconnect with old live executors/tasks. To enable slave recovery:

      • First you need to enable checkpointing on slaves with "--checkpoint" flag.
      • Frameworks can opt in to this feature by setting "FrameworkInfo.checkpoint" when registering with the master.
      • Once a Framework opts in, a restarted slave will recover all the framework's tasks and executors. The tasks/executors stay alive through a slave restart and reconnect with the restarted slave.
      • Slave recovery also improves the reliability of delivering status updates.
    • ๐Ÿš€ The release also includes a new feature called "Resource Reservations" which allows reserving resources on a slave to particular roles (This is an experimental feature).

    • ๐Ÿš€ This release also includes a new Mesos plugin for Jenkins which allows Jenkins to dynamically launch Jenkins slaves on a Mesos cluster (This is an experimental feature).

    ๐Ÿ›  There are also several bug fixes and stability improvements.

    All Issues: ** Sub-task

    • [MESOS-548] - Upgrade angular.js to use the full angular-ui.js
    • [MESOS-549] - Change truncated IDs to show on hover
    • [MESOS-630] - Improve the performance of Master::Http::stats().

    ** ๐Ÿ› Bug

    • [MESOS-235] - Mesos daemon ignores --conf option
    • [MESOS-368] - HTTP.Endpoints test is flaky.
    • [MESOS-370] - The process based isolation module should walk the process tree to collect resource usage.
    • [MESOS-380] - Command Executor doesn't send TASK_KILLED for killed tasks.
    • [MESOS-434] - Process isolator libprocess throws exception
    • [MESOS-449] - CgroupsTests are flaky on Ubuntu
    • [MESOS-451] - Always update resources for reregistered executors.
    • [MESOS-461] - Freezer failure while in FREEZING state.
    • [MESOS-479] - SlaveRecoveryTest/0.CleanupExecutor failure.
    • [MESOS-485] - Latest trunk fails on strict aliasing on CentOS
    • [MESOS-490] - Update mesos-daemon.sh (and associated scripts) to work with new flags mechanisms.
    • [MESOS-497] - Queued tasks should be launched in the order they were received
    • [MESOS-499] - Local slave run crashes on startup
    • [MESOS-508] - Master crash due to Broken Pipe
    • [MESOS-514] - FaultToleranceTest.ReconcileIncompleteTasks is flaky
    • [MESOS-522] - ZooKeeperMasterDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster
    • [MESOS-534] - ReaperTest.TerminatedChildProcess is flaky on Jenkins.
    • [MESOS-545] - Remove hack in post-reviews.py for tracking parent branch
    • [MESOS-582] - HTTP.Endpoints is flaky
    • [MESOS-594] - Add CXXFLAGS='-fno-strict-aliasing' if using gcc 4.4.*.
    • [MESOS-597] - Set MESOS_NATIVE_LIBRARY or (DY)LD_LIBRARY_PATH before launching an executor in order to enable JVM based executors to easily find libmesos.so.
    • [MESOS-599] - Make sure stderr/stdout get launcher output.
    • [MESOS-607] - Slave recovery should properly handle executors that were cleanly terminated in the previous run
    • [MESOS-611] - Refactor slave recovery to ensure slave recovers its state first
    • [MESOS-612] - Slave should not recover completed executors
    • [MESOS-614] - Master should remove checkpointing slave that gets disconnected when the new slave tries to register
    • [MESOS-616] - The Master / Slave should not store frameworks as both active and completed.
    • [MESOS-619] - Master should properly reconcile KillTasks
    • [MESOS-627] - Slave should offer total disk instead of available disk by default
    • [MESOS-628] - A non-checkpointing slave should still cleanup the latest slave symlink
    • [MESOS-632] - Executor driver should commit suicide if it cannot re-connect with a slave after a timeout
    • [MESOS-633] - Master should inform a recovered slave about frameworks that were completed
    • [MESOS-635] - Master doesn't update the task state when it generates TASK_LOST
    • [MESOS-636] - Executors under cgroups isolator die immediately when a slave dies if it has a controlling TTY attached
    • [MESOS-637] - Executor should reregister with the updates in the same order as it received them
    • [MESOS-638] - Slave should not send command executor infos to master when it reregisters
    • [MESOS-640] - Duplicate status update with same UUID crashes the slave
    • [MESOS-644] - Slave doesn't correctly handle checkpointed terminal update whose ack doesn't reach the executor
    • [MESOS-646] - Slave recovery doesn't properly handle checkpointed queued tasks
    • [MESOS-648] - Slave should properly handle partial writes of status updates
    • [MESOS-657] - SlaveRecoveryTest/1.PartitionedSlave fails with cgroups
    • [MESOS-668] - SlaveRecoveryTest/0.MultipleFrameworks flaky
    • [MESOS-671] - CgroupsIsolator does not listen for OOM events on recovered executors.
    • [MESOS-673] - Task reconciliation does not properly release executor resources.
    • [MESOS-675] - CHECK failure in the Master.
    • [MESOS-676] - Slave::reregistered LOG(FATAL)s due to being in RECOVERING state.
    • [MESOS-689] - Master incorrectly rejects tasks for long lived executors if they don't have FrameworkID set

    ** ๐Ÿ‘Œ Improvement

    • [MESOS-179] - Need to check for Python development headers
    • [MESOS-221] - New Allocators
    • [MESOS-329] - Add 'help' endpoints to libprocess.
    • [MESOS-552] - Jenkins scheduler should use the latest Mesos jar built from the repo
    • [MESOS-553] - Jenkins plugin should bundle the native Mesos library
    • [MESOS-554] - Jenkins scheduler should properly handle TASK_LOST
    • [MESOS-555] - Jenkins scheduler should reuse a Jenkins slave
    • [MESOS-557] - Upgrade to Bootstrap CSS v2.3.2
    • [MESOS-558] - Upgrade to full release of Angular JS
    • [MESOS-559] - Replace Bootstrap's JS with Angular UI Bootstrap
    • [MESOS-580] - Improve Command Executor
    • [MESOS-613] - Give better guidance when recovery fails
    • [MESOS-626] - Add the ability for example frameworks to checkpoint
    • [MESOS-634] - Make slave recovery more robust by ignoring absence of files
    • [MESOS-651] - Expose slave re-registration time in the Web UI
    • [MESOS-663] - Expose recovery errors when running recovery in --no-strict mode

    ** ๐Ÿ†• New Feature

    • [MESOS-110] - Slave Recovery: A slave restart should not restart tasks
    • [MESOS-203] - Killtree that recursively kills sessions
    • [MESOS-504] - Add weighted DRF.
    • [MESOS-505] - Add resource reservations/pools per role.
    • [MESOS-506] - Implement Jenkins scheduler for Mesos

    ** Task

    • [MESOS-643] - Revert the semantics of newly introduced changes to FrameworkReregistered messages
    • [MESOS-647] - Revert the default recovery mode to strict