Mesos v0.14.0 Release Notes
-
๐ The primary feature in this release is "Slave Recovery" which allows restarted slaves (e.g., deploys, crashes) to reconnect with old live executors/tasks. To enable slave recovery:
- First you need to enable checkpointing on slaves with "--checkpoint" flag.
- Frameworks can opt in to this feature by setting "FrameworkInfo.checkpoint" when registering with the master.
- Once a Framework opts in, a restarted slave will recover all the framework's tasks and executors. The tasks/executors stay alive through a slave restart and reconnect with the restarted slave.
- Slave recovery also improves the reliability of delivering status updates.
๐ The release also includes a new feature called "Resource Reservations" which allows reserving resources on a slave to particular roles (This is an experimental feature).
๐ This release also includes a new Mesos plugin for Jenkins which allows Jenkins to dynamically launch Jenkins slaves on a Mesos cluster (This is an experimental feature).
๐ There are also several bug fixes and stability improvements.
All Issues: ** Sub-task
- [MESOS-548] - Upgrade angular.js to use the full angular-ui.js
- [MESOS-549] - Change truncated IDs to show on hover
- [MESOS-630] - Improve the performance of Master::Http::stats().
** ๐ Bug
- [MESOS-235] - Mesos daemon ignores --conf option
- [MESOS-368] - HTTP.Endpoints test is flaky.
- [MESOS-370] - The process based isolation module should walk the process tree to collect resource usage.
- [MESOS-380] - Command Executor doesn't send TASK_KILLED for killed tasks.
- [MESOS-434] - Process isolator libprocess throws exception
- [MESOS-449] - CgroupsTests are flaky on Ubuntu
- [MESOS-451] - Always update resources for reregistered executors.
- [MESOS-461] - Freezer failure while in FREEZING state.
- [MESOS-479] - SlaveRecoveryTest/0.CleanupExecutor failure.
- [MESOS-485] - Latest trunk fails on strict aliasing on CentOS
- [MESOS-490] - Update mesos-daemon.sh (and associated scripts) to work with new flags mechanisms.
- [MESOS-497] - Queued tasks should be launched in the order they were received
- [MESOS-499] - Local slave run crashes on startup
- [MESOS-508] - Master crash due to Broken Pipe
- [MESOS-514] - FaultToleranceTest.ReconcileIncompleteTasks is flaky
- [MESOS-522] - ZooKeeperMasterDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster
- [MESOS-534] - ReaperTest.TerminatedChildProcess is flaky on Jenkins.
- [MESOS-545] - Remove hack in post-reviews.py for tracking parent branch
- [MESOS-582] - HTTP.Endpoints is flaky
- [MESOS-594] - Add CXXFLAGS='-fno-strict-aliasing' if using gcc 4.4.*.
- [MESOS-597] - Set MESOS_NATIVE_LIBRARY or (DY)LD_LIBRARY_PATH before launching an executor in order to enable JVM based executors to easily find libmesos.so.
- [MESOS-599] - Make sure stderr/stdout get launcher output.
- [MESOS-607] - Slave recovery should properly handle executors that were cleanly terminated in the previous run
- [MESOS-611] - Refactor slave recovery to ensure slave recovers its state first
- [MESOS-612] - Slave should not recover completed executors
- [MESOS-614] - Master should remove checkpointing slave that gets disconnected when the new slave tries to register
- [MESOS-616] - The Master / Slave should not store frameworks as both active and completed.
- [MESOS-619] - Master should properly reconcile KillTasks
- [MESOS-627] - Slave should offer total disk instead of available disk by default
- [MESOS-628] - A non-checkpointing slave should still cleanup the latest slave symlink
- [MESOS-632] - Executor driver should commit suicide if it cannot re-connect with a slave after a timeout
- [MESOS-633] - Master should inform a recovered slave about frameworks that were completed
- [MESOS-635] - Master doesn't update the task state when it generates TASK_LOST
- [MESOS-636] - Executors under cgroups isolator die immediately when a slave dies if it has a controlling TTY attached
- [MESOS-637] - Executor should reregister with the updates in the same order as it received them
- [MESOS-638] - Slave should not send command executor infos to master when it reregisters
- [MESOS-640] - Duplicate status update with same UUID crashes the slave
- [MESOS-644] - Slave doesn't correctly handle checkpointed terminal update whose ack doesn't reach the executor
- [MESOS-646] - Slave recovery doesn't properly handle checkpointed queued tasks
- [MESOS-648] - Slave should properly handle partial writes of status updates
- [MESOS-657] - SlaveRecoveryTest/1.PartitionedSlave fails with cgroups
- [MESOS-668] - SlaveRecoveryTest/0.MultipleFrameworks flaky
- [MESOS-671] - CgroupsIsolator does not listen for OOM events on recovered executors.
- [MESOS-673] - Task reconciliation does not properly release executor resources.
- [MESOS-675] - CHECK failure in the Master.
- [MESOS-676] - Slave::reregistered LOG(FATAL)s due to being in RECOVERING state.
- [MESOS-689] - Master incorrectly rejects tasks for long lived executors if they don't have FrameworkID set
** ๐ Improvement
- [MESOS-179] - Need to check for Python development headers
- [MESOS-221] - New Allocators
- [MESOS-329] - Add 'help' endpoints to libprocess.
- [MESOS-552] - Jenkins scheduler should use the latest Mesos jar built from the repo
- [MESOS-553] - Jenkins plugin should bundle the native Mesos library
- [MESOS-554] - Jenkins scheduler should properly handle TASK_LOST
- [MESOS-555] - Jenkins scheduler should reuse a Jenkins slave
- [MESOS-557] - Upgrade to Bootstrap CSS v2.3.2
- [MESOS-558] - Upgrade to full release of Angular JS
- [MESOS-559] - Replace Bootstrap's JS with Angular UI Bootstrap
- [MESOS-580] - Improve Command Executor
- [MESOS-613] - Give better guidance when recovery fails
- [MESOS-626] - Add the ability for example frameworks to checkpoint
- [MESOS-634] - Make slave recovery more robust by ignoring absence of files
- [MESOS-651] - Expose slave re-registration time in the Web UI
- [MESOS-663] - Expose recovery errors when running recovery in --no-strict mode
** ๐ New Feature
- [MESOS-110] - Slave Recovery: A slave restart should not restart tasks
- [MESOS-203] - Killtree that recursively kills sessions
- [MESOS-504] - Add weighted DRF.
- [MESOS-505] - Add resource reservations/pools per role.
- [MESOS-506] - Implement Jenkins scheduler for Mesos
** Task
- [MESOS-643] - Revert the semantics of newly introduced changes to FrameworkReregistered messages
- [MESOS-647] - Revert the default recovery mode to strict