[01:04:06] <icinga-wm>	 PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:04:25] <icinga-wm>	 PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:04:26] <icinga-wm>	 PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={GET,LIST,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:04:36] <icinga-wm>	 PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:04:36] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:05:55] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:16:09] <wikibugs>	 (03CR) 10Alex Monk: "This is looking fairly good, I'll give it a go later on" (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez)
[02:06:46] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:06:46] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:07:35] <icinga-wm>	 RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:07:45] <icinga-wm>	 RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:07:55] <icinga-wm>	 RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:07:56] <icinga-wm>	 RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:34:27] <icinga-wm>	 PROBLEM - Host labservices1001 is DOWN: CRITICAL - Host Unreachable (208.80.155.117)
[02:41:06] <andrewbogott>	 !log rebooting labservices1001 from mgmt
[02:41:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:44:07] <icinga-wm>	 RECOVERY - Host labservices1001 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[02:47:21] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team, 10Patch-For-Review: Labservices1001 crashed - https://phabricator.wikimedia.org/T196252 (10Andrew) This just happened again -- thanks to better paging I caught it sooner :)  There's nothing of interest in the syslog, just a sudden stop:   ``` Aug  5 02:32:0...
[02:48:26] <icinga-wm>	 PROBLEM - toolschecker: Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.749 second response time
[02:48:45] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team, 10Patch-For-Review: Labservices1001 crashed - https://phabricator.wikimedia.org/T196252 (10Andrew) Ah, there were some temp warnings a few minutes earlier:   ``` Aug  5 02:29:02 labservices1001 kernel: [3025868.972351] CPU3: Core temperature above threshold...
[02:49:08] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team, 10Patch-For-Review: Labservices1001 crashing, probable overheating - https://phabricator.wikimedia.org/T196252 (10Andrew)
[02:53:36] <icinga-wm>	 RECOVERY - toolschecker: Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.407 second response time
[03:02:42] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team, 10Patch-For-Review: Labservices1001 crashing, probable overheating - https://phabricator.wikimedia.org/T196252 (10Andrew) p:05Normal>03High
[03:21:46] <icinga-wm>	 RECOVERY - Memory correctable errors -EDAC- on mw2157 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw2157&var-datasource=codfw%2520prometheus%252Fops
[03:27:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 866.64 seconds
[03:33:25] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 155.53 seconds
[04:55:36] <icinga-wm>	 PROBLEM - HHVM rendering on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:56:58] <wikibugs>	 (03PS3) 10Nehajha: Removing gridengine as default and encouraging the use of Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504)
[04:57:55] <icinga-wm>	 RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 74762 bytes in 0.570 second response time
[06:28:16] <icinga-wm>	 PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/ferm.conf]
[06:30:06] <icinga-wm>	 PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats]
[06:31:46] <icinga-wm>	 PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean]
[06:33:15] <icinga-wm>	 PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/biocLite.R]
[06:55:56] <icinga-wm>	 RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[06:57:46] <icinga-wm>	 RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:59:15] <icinga-wm>	 RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[06:59:26] <icinga-wm>	 RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:47:39] <wikibugs>	 (03CR) 10Vgutierrez: provide ACMEv2 support based on certbot/acme library (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez)
[10:24:55] <wikibugs>	 (03PS1) 10Gerg艖 Tisza: Give hewiki interface-admins the rights interface-editors have [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450441 (https://phabricator.wikimedia.org/T200698)
[10:24:57] <wikibugs>	 (03PS1) 10Gerg艖 Tisza: Remove hewiki interface-editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450442 (https://phabricator.wikimedia.org/T200698)
[10:41:05] <wikibugs>	 (03CR) 10Merlijn van Deen: [C: 031] "lgtm; you could consider also setting the logo_width" [puppet] - 10https://gerrit.wikimedia.org/r/448999 (owner: 10Alex Monk)
[10:42:12] <wikibugs>	 (03CR) 10Merlijn van Deen: [C: 031] "...and we should probably update the logo and text to mention Toolforge rather than tool labs ;-)" [puppet] - 10https://gerrit.wikimedia.org/r/448999 (owner: 10Alex Monk)
[11:41:33] <wikibugs>	 (03PS1) 10Gerg艖 Tisza: Revert "Configure group management for interface-admin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450450
[12:14:08] <wikibugs>	 (03CR) 10Alex Monk: provide ACMEv2 support based on certbot/acme library (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez)
[13:12:55] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 311 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[13:18:06] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 14 probes of 311 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[13:21:34] <wikibugs>	 (03PS1) 10Nehajha: Following pep8 coding conventions [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450458
[13:22:06] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[13:29:34] <wikibugs>	 (03CR) 10Zhuyifei1999: [C: 031] "Is there any other changes you want to do in this patch? Or shall I merge?" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450458 (owner: 10Nehajha)
[13:30:27] <wikibugs>	 (03CR) 10Nehajha: "> Is there any other changes you want to do in this patch? Or shall I" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450458 (owner: 10Nehajha)
[13:39:05] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[13:40:26] <wikibugs>	 (03PS2) 10Nehajha: Following pep8 coding conventions [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450458
[13:42:14] <wikibugs>	 (03CR) 10Nehajha: "> Is there any other changes you want to do in this patch? Or shall I" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450458 (owner: 10Nehajha)
[15:00:05] <wikibugs>	 (03PS1) 10Ori.livneh: Declare and manage a /var/cache/coal_web dir [puppet] - 10https://gerrit.wikimedia.org/r/450468
[16:07:16] <wikibugs>	 (03PS1) 10Urbanecm: Add correct sitename for satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450469 (https://phabricator.wikimedia.org/T198400)
[16:26:03] <Krenair>	 Doesn't happen to be anyone around who could tell me what OSes /^(actinium|alcyone|alsafi|aluminium)\.wikimedia\.org$/ run?
[16:26:06] <Krenair>	 it's looking like trusty to me
[16:34:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Following pep8 coding conventions [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450458 (owner: 10Nehajha)
[17:17:11] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T201245 (10Marostegui) a:03Papaul Can we get this disk replaced? Thanks!
[17:24:02] <wikibugs>	 (03PS2) 10Gerg艖 Tisza: Allow all bureaucrats to remove interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450450
[20:09:26] <icinga-wm>	 PROBLEM - Host scb2006 is DOWN: PING CRITICAL - Packet loss = 100%
[20:10:05] <icinga-wm>	 RECOVERY - Host scb2006 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms