[00:03:10] <shinken-wm>	 RECOVERY - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms  
[00:03:11] <bblack>	 hmmm I should add this to my list of autojoins
[00:03:16] <bblack>	 so....
[00:03:19] <bblack>	 root@deployment-salt:/var/lib/git/operations/puppet# git rebase origin/production
[00:03:22] <bblack>	 It seems that there is already a rebase-merge directory, and
[00:03:36] <bblack>	 anyone know what's up there? nobody currently logged in either, but rebase sitting there half-done
[00:04:01] <YuviPanda>	 bblack: just blow it away and do it again I guess
[00:04:22] <bblack>	 as in just hard reset to origin/prod, or should I look around for an obvious set of patches we were keeping on top of the rebase before?
[00:04:38] <YuviPanda>	 bblack: git rebase —abort
[00:04:38] <bblack>	 I guess I could rebase --abort, too
[00:04:38] <YuviPanda>	 ?
[00:04:46] <YuviPanda>	 and then just fetch / rebase again
[00:07:23] <shinken-wm>	 RECOVERY - Puppet failure on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[00:08:11] <shinken-wm>	 FLAPPINGSTART - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2369.91 ms  
[00:10:05] <wikibugs_>	 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1232934 (10thcipriani) This problem is still ongoing, although @coren and @andrew may have found the root cause: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917  After updating the kernel on labvirt1001 and...
[00:10:06] <wikibugs>	 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1232934 (10thcipriani) This problem is still ongoing, although @coren and @andrew may have found the root cause: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917  After updating the kernel on labvirt1001 and...
[00:12:01] <shinken-wm>	 FLAPPINGSTART - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms  
[00:12:37] <bblack>	 ok deployment-salt ops/puppet is clean now.  going to also cherry-pick a patch into there for testing myself now....
[00:13:53] <shinken-wm>	 FLAPPINGSTOP - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms  
[00:17:18] <bblack>	 !log cherry-picked https://gerrit.wikimedia.org/r/#/c/196009/13/ onto deployment-salt ops/puppet
[00:18:29] <bblack>	 (I'm guessing this is where to do that, docs say wikimedia-qa :P)
[00:19:07] <greg-g>	 heh which docs?
[00:19:22] <bblack>	 https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code_is_updated#Cherry-picking_a_patch_from_gerrit
[00:20:43] <greg-g>	 {{fixed}}
[00:21:05] <bblack>	 :)
[00:26:40] <wikibugs>	 10Continuous-Integration: Zuul-cloner should use git cache via hard links - https://phabricator.wikimedia.org/T97098#1232972 (10Krinkle) 3NEW
[00:26:45] <wikibugs_>	 10Continuous-Integration: Zuul-cloner should use git cache via hard links - https://phabricator.wikimedia.org/T97098#1232972 (10Krinkle) 3NEW
[00:26:52] <shinken-wm>	 PROBLEM - Host deployment-sentry2 is DOWN: CRITICAL - Host Unreachable (10.68.17.204)  
[00:27:15] <wikibugs>	 10Continuous-Integration: Zuul-cloner should use git cache via hard links - https://phabricator.wikimedia.org/T97098#1232972 (10Krinkle) I've tested this on integration-trusty-slave-1021 with mediawiki/core and noticed that it's not much  faster to copy from `/mnt/git` than to clone from gerrit.wikimedia.org. In...
[00:27:16] <wikibugs_>	 10Continuous-Integration: Zuul-cloner should use git cache via hard links - https://phabricator.wikimedia.org/T97098#1232972 (10Krinkle) I've tested this on integration-trusty-slave-1021 with mediawiki/core and noticed that it's not much  faster to copy from `/mnt/git` than to clone from gerrit.wikimedia.org. In...
[00:27:32] <wikibugs>	 10Continuous-Integration, 5Patch-For-Review: Set up git replication on integration slaves - https://phabricator.wikimedia.org/T96687#1224500 (10Krinkle)
[00:27:33] <wikibugs_>	 10Continuous-Integration, 5Patch-For-Review: Set up git replication on integration slaves - https://phabricator.wikimedia.org/T96687#1224500 (10Krinkle)
[00:27:33] <wikibugs>	 10Continuous-Integration: Zuul-cloner should use git cache via hard links - https://phabricator.wikimedia.org/T97098#1232982 (10Krinkle)
[00:27:34] <wikibugs_>	 10Continuous-Integration: Zuul-cloner should use git cache via hard links - https://phabricator.wikimedia.org/T97098#1232982 (10Krinkle)
[00:28:41] <bblack>	 ok one more dumb question: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Overview -> "The mobile traffic (*.m.*.beta.wmflabs.org) is served by the deployment-cache-mobile03 instance"
[00:28:54] <bblack>	 $ ssh deployment-cache-mobile03.beta.wmflabs.org
[00:28:55] <bblack>	 Linux deployment-cache-text02
[00:29:28] <bblack>	 yet when I hit http://en.m.wikipedia.beta.wmflabs.org/wiki/Main_Page the headers say it did flow through a deployment-cache-mobile03 ...
[00:29:54] <bblack>	 why does ssh-ing there drop me on text02?
[00:31:52] <shinken-wm>	 RECOVERY - Host deployment-sentry2 is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms  
[00:36:58] <shinken-wm>	 FLAPPINGSTOP - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 298.11 ms  
[00:37:16] <shinken-wm>	 PROBLEM - Host deployment-fluoride is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 3302.80 ms  
[00:42:13] <shinken-wm>	 RECOVERY - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms  
[00:51:02] <shinken-wm>	 PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145)  
[00:53:16] <shinken-wm>	 PROBLEM - Host deployment-restbase02 is DOWN: CRITICAL - Host Unreachable (10.68.17.189)  
[00:55:51] <shinken-wm>	 RECOVERY - Host deployment-restbase02 is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms  
[00:55:53] <shinken-wm>	 RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms  
[00:59:52] <shinken-wm>	 PROBLEM - SSH on deployment-parsoidcache02 is CRITICAL: No route to host  
[01:01:16] <shinken-wm>	 PROBLEM - Host deployment-elastic07 is DOWN: PING CRITICAL - Packet loss = 57%, RTA = 7109.24 ms  
[01:01:38] <shinken-wm>	 PROBLEM - Host integration-saltmaster is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2231.75 ms  
[01:01:54] <YuviPanda>	 thcipriani|afk: andrewbogott ^
[01:01:56] <YuviPanda>	 not sure if known?
[01:02:02] <shinken-wm>	 RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 344.77 ms  
[01:02:03] <YuviPanda>	 or if they’re on non labsvirt1000
[01:02:05] <YuviPanda>	 1
[01:08:48] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12)  
[01:11:31] <shinken-wm>	 RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms  
[01:11:37] <shinken-wm>	 RECOVERY - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms  
[01:16:18] <shinken-wm>	 PROBLEM - Host deployment-logstash1 is DOWN: CRITICAL - Host Unreachable (10.68.16.134)  
[01:22:11] <shinken-wm>	 FLAPPINGSTART - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 242.34 ms  
[01:22:53] <shinken-wm>	 PROBLEM - Host deployment-logstash1 is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 6631.08 ms  
[01:24:31] <shinken-wm>	 PROBLEM - Content Translation Server on deployment-cxserver03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[01:25:55] <shinken-wm>	 PROBLEM - Host deployment-test is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2005.45 ms  
[01:29:24] <shinken-wm>	 RECOVERY - Content Translation Server on deployment-cxserver03 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.009 second response time  
[01:35:54] <shinken-wm>	 RECOVERY - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms  
[01:40:55] <shinken-wm>	 PROBLEM - Host integration-saltmaster is DOWN: CRITICAL - Host Unreachable (10.68.18.24)  
[01:41:45] <bblack>	 !log re-cherry-picked ops/puppet https://gerrit.wikimedia.org/r/#/c/196009/13 on deployment-salt (analytics last-access testing)
[01:46:36] <shinken-wm>	 RECOVERY - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 83.98 ms  
[01:50:53] <shinken-wm>	 FLAPPINGSTART - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms  
[01:51:21] <shinken-wm>	 FLAPPINGSTART - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms  
[02:01:02] <shinken-wm>	 PROBLEM - Host deployment-kafka02 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 3843.12 ms  
[02:02:48] <shinken-wm>	 PROBLEM - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187)  
[02:04:20] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13)  
[02:06:08] <shinken-wm>	 RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms  
[02:07:14] <shinken-wm>	 FLAPPINGSTOP - Host deployment-fluoride is UP: PING WARNING - Packet loss = 14%, RTA = 704.75 ms  
[02:07:57] <shinken-wm>	 PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 37%, RTA = 2472.22 ms  
[02:08:06] <thcipriani|afk>	 YuviPanda: integration-saltmaster is on labvirt1005 (possibly) either way, wasn't on labvirt1001 or 1002
[02:08:23] <shinken-wm>	 RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms  
[02:14:19] <shinken-wm>	 PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 37%, RTA = 4072.31 ms  
[02:17:53] <shinken-wm>	 RECOVERY - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms  
[02:21:28] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12)  
[02:23:32] <shinken-wm>	 RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms  
[02:29:24] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 93%, RTA = 7295.20 ms  
[02:33:03] <shinken-wm>	 PROBLEM - Host deployment-parsoid05 is DOWN: CRITICAL - Host Unreachable (10.68.16.120)  
[02:33:11] <shinken-wm>	 FLAPPINGSTOP - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms  
[02:36:54] <shinken-wm>	 RECOVERY - Host deployment-parsoid05 is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms  
[02:41:48] <shinken-wm>	 PROBLEM - Host integration-saltmaster is DOWN: PING CRITICAL - Packet loss = 73%, RTA = 2107.13 ms  
[02:42:04] <wmf-insecte>	 Yippee, build fixed!
[02:42:05] <wmf-insecte>	 Project browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #515: FIXED in 1 min 4 sec: https://integration.wikimedia.org/ci/job/browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/515/
[02:42:54] <shinken-wm>	 FLAPPINGSTART - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms  
[02:46:37] <shinken-wm>	 RECOVERY - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms  
[02:47:06] <shinken-wm>	 PROBLEM - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187)  
[02:47:44] <shinken-wm>	 RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 276.00 ms  
[02:48:36] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0]  
[02:53:42] <shinken-wm>	 PROBLEM - Host deployment-kafka02 is DOWN: CRITICAL - Host Unreachable (10.68.17.156)  
[02:55:01] <shinken-wm>	 RECOVERY - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms  
[02:55:51] <shinken-wm>	 PROBLEM - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149)  
[02:56:33] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: PING CRITICAL - Packet loss = 25%, RTA = 2574.53 ms  
[02:56:55] <wikibugs>	 10Continuous-Integration, 7Zuul: Zuul-cloner should use hard links when fetching from cache-dir  - https://phabricator.wikimedia.org/T97106#1233116 (10Krinkle) 3NEW
[02:58:34] <shinken-wm>	 RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms  
[02:59:06] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-mobile03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[03:00:10] <shinken-wm>	 PROBLEM - Host deployment-kafka02 is DOWN: PING CRITICAL - Packet loss = 100%  
[03:00:54] <shinken-wm>	 RECOVERY - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 68.03 ms  
[03:03:40] <shinken-wm>	 PROBLEM - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187)  
[03:04:30] <wikibugs>	 10Beta-Cluster, 10Deployment-Systems: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1233131 (10greg) @mmodell @thcipriani ping (this looks unrelated to the flappiness of wmf labs as it started before all of that)
[03:05:52] <shinken-wm>	 PROBLEM - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149)  
[03:06:00] <shinken-wm>	 PROBLEM - Host deployment-kafka02 is DOWN: PING CRITICAL - Packet loss = 54%, RTA = 8120.05 ms  
[03:06:11] <shinken-wm>	 RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms  
[03:10:55] <shinken-wm>	 RECOVERY - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 496.01 ms  
[03:15:52] <shinken-wm>	 FLAPPINGSTART - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149)  
[03:19:25] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-mobile03 is UP: PING WARNING - Packet loss = 61%, RTA = 41.10 ms  
[03:21:21] <shinken-wm>	 FLAPPINGSTOP - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms  
[03:23:07] <shinken-wm>	 FLAPPINGSTOP - Host deployment-restbase01 is UP: PING WARNING - Packet loss = 73%, RTA = 0.92 ms  
[03:26:33] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 2341.18 ms  
[03:28:30] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13)  
[03:38:23] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13)  
[03:38:23] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[03:38:35] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0]  
[03:45:52] <shinken-wm>	 FLAPPINGSTOP - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 383.99 ms  
[03:46:18] <shinken-wm>	 PROBLEM - Host deployment-logstash1 is DOWN: CRITICAL - Host Unreachable (10.68.16.134)  
[03:48:24] <shinken-wm>	 RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms  
[03:51:20] <shinken-wm>	 RECOVERY - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms  
[03:53:32] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 86%, RTA = 2249.61 ms  
[03:54:57] <shinken-wm>	 PROBLEM - Puppet staleness on deployment-kafka02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0]  
[03:55:57] <shinken-wm>	 FLAPPINGSTOP - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms  
[03:58:23] <shinken-wm>	 RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms  
[04:01:09] <shinken-wm>	 FLAPPINGSTART - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms  
[04:06:01] <shinken-wm>	 PROBLEM - Host deployment-restbase02 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[04:06:22] <shinken-wm>	 RECOVERY - Host deployment-restbase02 is UP: PING OK - Packet loss = 16%, RTA = 0.56 ms  
[04:07:03] <wikibugs>	 10Continuous-Integration: Update jobs to use zuul-cloner with git cache via hard links - https://phabricator.wikimedia.org/T97098#1233144 (10Krinkle)
[04:07:12] <wikibugs>	 10Continuous-Integration: Update jobs to use zuul-cloner with git cache via hard links - https://phabricator.wikimedia.org/T97098#1232972 (10Krinkle)
[04:07:13] <wikibugs>	 10Continuous-Integration, 7Zuul: Zuul-cloner should use hard links when fetching from cache-dir - https://phabricator.wikimedia.org/T97106#1233147 (10Krinkle)
[04:09:03] <wikibugs>	 10Continuous-Integration: Update jobs to use zuul-cloner with git cache via hard links - https://phabricator.wikimedia.org/T97098#1232972 (10Krinkle) I applied https://review.openstack.org/#/c/117626/4 to zuul on the depooled integration-slave-trusty-1021. A quick test confirms it works properly and speeds up th...
[04:25:01] <shinken-wm>	 FLAPPINGSTOP - Host deployment-mx is UP: PING WARNING - Packet loss = 0%, RTA = 992.08 ms  
[04:27:01] <shinken-wm>	 FLAPPINGSTOP - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 112.36 ms  
[04:28:41] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 46153 bytes in 8.993 second response time  
[04:34:25] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 73%, RTA = 9741.31 ms  
[04:46:10] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14)  
[04:47:27] <shinken-wm>	 PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 61%, RTA = 10674.78 ms  
[04:53:35] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 100%  
[04:54:15] <shinken-wm>	 RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms  
[04:55:11] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 4022.02 ms  
[04:57:18] <shinken-wm>	 RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms  
[05:01:29] <shinken-wm>	 PROBLEM - Host deployment-logstash1 is DOWN: PING CRITICAL - Packet loss = 100%  
[05:04:14] <shinken-wm>	 FLAPPINGSTART - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms  
[05:04:37] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms  
[05:06:47] <shinken-wm>	 PROBLEM - Puppet failure on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[05:06:47] <shinken-wm>	 RECOVERY - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms  
[05:18:20] <shinken-wm>	 PROBLEM - Host integration-slave-trusty-1021 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2970.65 ms  
[05:18:21] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 100%  
[05:18:54] <kart_>	 beta o beta!
[05:19:05] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-mobile03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[05:23:21] <shinken-wm>	 RECOVERY - Host integration-slave-trusty-1021 is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms  
[05:25:03] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 4317.80 ms  
[05:25:57] <shinken-wm>	 PROBLEM - Host deployment-parsoidcache02 is DOWN: PING CRITICAL - Packet loss = 37%, RTA = 6045.16 ms  
[05:29:37] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms  
[05:30:51] <shinken-wm>	 RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms  
[05:36:28] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms  
[05:48:17] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 4092.52 ms  
[05:51:17] <shinken-wm>	 PROBLEM - Host deployment-logstash1 is DOWN: CRITICAL - Host Unreachable (10.68.16.134)  
[05:54:44] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 30%, RTA = 5790.14 ms  
[05:58:11] <shinken-wm>	 RECOVERY - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms  
[05:59:37] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 24.52 ms  
[06:01:19] <shinken-wm>	 RECOVERY - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 460.42 ms  
[06:02:58] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[06:05:50] <shinken-wm>	 PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145)  
[06:09:38] <shinken-wm>	 FLAPPINGSTART - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms  
[06:11:32] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 2194.36 ms  
[06:13:33] <shinken-wm>	 RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms  
[06:17:06] <shinken-wm>	 PROBLEM - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187)  
[06:18:11] <shinken-wm>	 FLAPPINGSTART - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms  
[06:22:01] <shinken-wm>	 RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 30.70 ms  
[06:22:17] <shinken-wm>	 PROBLEM - Host deployment-parsoidcache02 is DOWN: PING CRITICAL - Packet loss = 100%  
[06:25:55] <shinken-wm>	 RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms  
[06:30:54] <shinken-wm>	 FLAPPINGSTOP - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms  
[06:32:10] <shinken-wm>	 PROBLEM - Host deployment-fluoride is DOWN: CRITICAL - Host Unreachable (10.68.16.190)  
[06:37:11] <shinken-wm>	 RECOVERY - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms  
[06:51:26] <shinken-wm>	 PROBLEM - Host deployment-logstash1 is DOWN: PING CRITICAL - Packet loss = 58%, RTA = 7360.91 ms  
[07:15:00] <shinken-wm>	 RECOVERY - SSH on deployment-cache-mobile03 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0)  
[07:19:35] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mx is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[07:26:20] <shinken-wm>	 PROBLEM - Host deployment-elastic07 is DOWN: PING CRITICAL - Packet loss = 100%  
[07:26:50] <shinken-wm>	 RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms  
[07:30:20] <shinken-wm>	 FLAPPINGSTOP - Host deployment-upload is UP: PING OK - Packet loss = 0%, RTA = 136.31 ms  
[07:31:01] <shinken-wm>	 PROBLEM - Host deployment-parsoidcache02 is DOWN: PING CRITICAL - Packet loss = 45%, RTA = 2472.03 ms  
[07:32:09] <shinken-wm>	 PROBLEM - Host deployment-fluoride is DOWN: CRITICAL - Host Unreachable (10.68.16.190)  
[07:32:53] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 29317 bytes in 3.570 second response time  
[07:33:31] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12)  
[07:36:28] <shinken-wm>	 RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms  
[07:37:10] <shinken-wm>	 RECOVERY - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms  
[07:37:24] <shinken-wm>	 PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 3714.69 ms  
[07:38:23] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[07:38:53] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: No route to host  
[07:42:03] <shinken-wm>	 FLAPPINGSTART - Host deployment-elastic07 is DOWN: PING CRITICAL - Packet loss = 14%, RTA = 4375.19 ms  
[07:42:19] <shinken-wm>	 RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms  
[07:44:36] <shinken-wm>	 FLAPPINGSTOP - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms  
[07:45:54] <shinken-wm>	 RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms  
[07:48:51] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 29317 bytes in 3.558 second response time  
[07:49:59] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2082.32 ms  
[07:52:15] <shinken-wm>	 FLAPPINGSTART - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 445.04 ms  
[07:54:37] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 38.43 ms  
[07:54:57] <shinken-wm>	 PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[07:57:20] <shinken-wm>	 PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 3230.12 ms  
[07:59:04] <shinken-wm>	 PROBLEM - Host deployment-parsoidcache02 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2329.54 ms  
[07:59:26] <shinken-wm>	 PROBLEM - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149)  
[08:00:52] <shinken-wm>	 RECOVERY - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms  
[08:04:59] <shinken-wm>	 RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 29317 bytes in 9.301 second response time  
[08:05:38] <wikibugs>	 6Release-Engineering, 10MediaWiki-Debug-Logging, 6Security-Team, 6operations, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1233253 (10Joe) @Andrew, @fgiunchedi is no one working on this? It is an old ticket, marked as high priority and it's unassigned.
[08:05:53] <shinken-wm>	 FLAPPINGSTART - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms  
[08:09:49] <shinken-wm>	 RECOVERY - SSH on deployment-parsoidcache02 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0)  
[08:26:57] <shinken-wm>	 FLAPPINGSTOP - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 232.15 ms  
[08:28:24] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-mobile03 is UP: PING WARNING - Packet loss = 0%, RTA = 946.79 ms  
[08:35:55] <shinken-wm>	 FLAPPINGSTART - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms  
[08:43:47] <wikibugs>	 10Continuous-Integration, 7Regression, 7Upstream: ERROR: Failed to notify endpoint 'HTTP:http://127.0.0.1:8001/jenkins_endpoint' - https://phabricator.wikimedia.org/T93321#1233301 (10hashar) >>! In T93321#1145026, @hashar wrote: > Before Zuul migrated to Gearman, the jobs had to notify Zuul on start and comp...
[08:54:36] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: CRITICAL - Host Unreachable (10.68.17.78)  
[08:55:06] <hashar>	 !log Enabling puppet on deployment-eventlogging02.eqiad.wmflabs {{bug|T96921}}
[08:58:23] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2166.59 ms  
[08:58:37] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms  
[08:59:13] <shinken-wm>	 RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms  
[08:59:23] <shinken-wm>	 PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 7360.96 ms  
[08:59:37] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 260.11 ms  
[09:02:53] <shinken-wm>	 RECOVERY - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 362.24 ms  
[09:05:53] <wikibugs>	 10Beta-Cluster: upgrade salt on deployment-prep to 2014.7 - https://phabricator.wikimedia.org/T92276#1233322 (10ArielGlenn) Packages out two days ago, will be testing with those on deployment prep shortly.
[09:06:24] <shinken-wm>	 RECOVERY - Puppet staleness on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [3600.0]  
[09:07:21] <wikibugs>	 10Beta-Cluster, 10Analytics-EventLogging: puppet agent disabled on beta cluster deployment-eventlogging02.eqiad.wmflabs instance - https://phabricator.wikimedia.org/T96921#1233323 (10hashar) Seems it has not been reenabled so I have did it and ran puppet again. From the diff it changes a bunch of files under /...
[09:08:59] <hashar>	 !log beta: manually rebased operations/puppet.git
[09:10:00] <shinken-wm>	 FLAPPINGSTART - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 202.09 ms  
[09:13:59] <shinken-wm>	 PROBLEM - Puppet failure on deployment-eventlogging02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]  
[09:22:19] <shinken-wm>	 FLAPPINGSTOP - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 178.74 ms  
[09:23:23] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13)  
[09:23:59] <shinken-wm>	 RECOVERY - Puppet failure on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[09:24:13] <shinken-wm>	 RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 343.79 ms  
[09:30:54] <shinken-wm>	 FLAPPINGSTOP - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms  
[09:32:15] <wikibugs>	 10Beta-Cluster, 10Analytics-EventLogging: puppet agent disabled on beta cluster deployment-eventlogging02.eqiad.wmflabs instance - https://phabricator.wikimedia.org/T96921#1233348 (10hashar) I have disabled puppet agent again on the eventlogging02 instance and manually fixed the paths in all `/etc/eventlogging...
[09:34:38] <wikibugs>	 10Beta-Cluster: Can't connect to Beta Cluster database deployment-db1 - https://phabricator.wikimedia.org/T96905#1233359 (10hashar)
[09:34:57] <hashar>	 !sal
[09:34:57] <wm-bot>	 https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[09:36:58] <wikibugs>	 10Beta-Cluster: Can't connect to Beta Cluster database deployment-db1 or deployment-db2 (MariaDB down) - https://phabricator.wikimedia.org/T96905#1233363 (10hashar)
[09:40:03] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3144.61 ms  
[09:40:23] <shinken-wm>	 PROBLEM - Host deployment-upload is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2050.93 ms  
[09:45:00] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms  
[09:45:20] <shinken-wm>	 RECOVERY - Host deployment-upload is UP: PING OK - Packet loss = 0%, RTA = 167.00 ms  
[09:48:12] <shinken-wm>	 FLAPPINGSTOP - Host deployment-memc02 is UP: PING WARNING - Packet loss = 0%, RTA = 969.22 ms  
[09:48:20] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13)  
[09:51:26] <wikibugs>	 10Beta-Cluster: Can't connect to Beta Cluster database deployment-db1 or deployment-db2 (MariaDB down) - https://phabricator.wikimedia.org/T96905#1233371 (10hashar) The beta cluster has two databases instances: 10.68.17.94 and 10.68.16.193 which are respectively deployment-db2 and deployment-db1.  The instances...
[09:52:11] <shinken-wm>	 FLAPPINGSTOP - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms  
[09:52:39] <wikibugs>	 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1231310 (10hashar) From T96905 it seems MySQL/MariaDB is not started on boot and deployment-db1 got rebooted on Thu Apr 23 23:53. I have restated MySQL :-)
[09:52:59] <hashar>	 !log restarted mysql on both deployment-db1 and deployment-db2. The service is apparently not started on instance boot.
[09:53:31] <hashar>	 !log mysql down is https://phabricator.wikimedia.org/T96905
[09:53:33] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13)  
[09:56:09] <wikibugs>	 10Beta-Cluster, 7Monitoring: Beta cluster: monitor MySQL on deployment-db1 and deployment-db2 - https://phabricator.wikimedia.org/T97120#1233374 (10hashar) 3NEW
[09:56:44] <wikibugs>	 10Beta-Cluster: Can't connect to Beta Cluster database deployment-db1 or deployment-db2 (MariaDB down) - https://phabricator.wikimedia.org/T96905#1233383 (10hashar) a:3hashar
[09:56:55] <hashar>	 stupid log bot
[09:58:37] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12)  
[10:00:01] <shinken-wm>	 FLAPPINGSTOP - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms  
[10:03:37] <wikibugs>	 10Browser-Tests, 6Collaboration-Team, 10Collaboration-Team-Sprint-A-2015-05-06, 10Flow, 5Patch-For-Review: A5. Fix failed Flow browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94153#1233388 (10hashar)
[10:04:13] <wikibugs>	 10Browser-Tests, 6Collaboration-Team, 10Collaboration-Team-Sprint-A-2015-05-06, 10Flow, 5Patch-For-Review: A5. Fix failed Flow browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94153#1156329 (10hashar) Rephrasing the task details. The Echo and Flow jobs are listed at https://integration.wiki...
[10:04:59] <shinken-wm>	 PROBLEM - Host deployment-kafka02 is DOWN: CRITICAL - Host Unreachable (10.68.17.156)  
[10:15:54] <shinken-wm>	 FLAPPINGSTOP - Host deployment-test is UP: PING WARNING - Packet loss = 0%, RTA = 714.22 ms  
[10:16:18] <shinken-wm>	 FLAPPINGSTOP - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms  
[10:30:01] <shinken-wm>	 PROBLEM - Host deployment-kafka02 is DOWN: CRITICAL - Host Unreachable (10.68.17.156)  
[10:33:24] <shinken-wm>	 PROBLEM - Puppet staleness on deployment-rsync01 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [43200.0]  
[10:33:40] <shinken-wm>	 PROBLEM - Host deployment-logstash1 is DOWN: CRITICAL - Host Unreachable (10.68.16.134)  
[10:35:27] <wikibugs>	 6Release-Engineering, 10MediaWiki-Debug-Logging, 6Security-Team, 6operations, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1233411 (10fgiunchedi) no I don't think anyone is working on this, I mostly worked on it when on clinic duty, my plate is full alrea...
[10:35:56] <shinken-wm>	 RECOVERY - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms  
[10:36:18] <shinken-wm>	 RECOVERY - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 47.03 ms  
[10:36:18] <shinken-wm>	 RECOVERY - Puppet failure on deployment-logstash1 is OK: OK: Less than 1.00% above the threshold [0.0]  
[10:39:55] <shinken-wm>	 PROBLEM - Puppet staleness on deployment-kafka02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0]  
[10:48:31] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 22.72 ms  
[10:49:21] <shinken-wm>	 PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 58%, RTA = 3278.81 ms  
[10:52:55] <shinken-wm>	 RECOVERY - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms  
[10:53:25] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2308.11 ms  
[10:58:24] <wikibugs>	 10Continuous-Integration, 6Release-Engineering, 6Project-Creators: Create "Continuous-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1233421 (10hashar) The idea is to stop using the too generic `Continuous-Integration` and split requests into two disjunct groups:  * Continuous-Integ...
[11:03:54] <wikibugs>	 10Continuous-Integration, 5Patch-For-Review: Disable xdebug's html formatting of PHP errors for Apache on Jenkins slaves - https://phabricator.wikimedia.org/T97040#1233424 (10hashar) XDebug overloads var_dump() to generate HTML formatted stracktraces. From http://xdebug.org/docs/display :  > **xdebug.overload_...
[11:16:28] <wikibugs>	 10Continuous-Integration, 10pywikibot-core, 5Patch-For-Review: run pep8 and pep257 for pywikibot/core - https://phabricator.wikimedia.org/T87169#1233436 (10Aklapper) Any decision / movement is welcome here (I repeat myself by saying that this task has "Unbreak now" priority since January 2015 so if this is n...
[11:18:15] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 2064.00 ms  
[11:28:21] <wikibugs>	 6Release-Engineering, 10Continuous-Integration-Infrastructure, 6Project-Creators: Create "Continuous-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1233444 (10Aklapper) 5Open>3Resolved a:3Aklapper >>! In T96908#1233421, @hashar wrote: > So potentially rename #Continuous-Integr...
[11:28:30] <wikibugs>	 10Beta-Cluster, 10Analytics-EventLogging: EventLogging schemas are not served properly on beta cluster - https://phabricator.wikimedia.org/T97047#1233447 (10Tgr) Weird, I distinctly remember I did that.
[12:02:12] <shinken-wm>	 FLAPPINGSTART - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 250.74 ms  
[12:04:24] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 64%, RTA = 6776.60 ms  
[12:08:24] <shinken-wm>	 RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 110.26 ms  
[12:13:22] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13)  
[12:14:14] <shinken-wm>	 RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 151.54 ms  
[12:28:23] <shinken-wm>	 FLAPPINGSTART - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms  
[12:39:38] <shinken-wm>	 FLAPPINGSTOP - Host deployment-mx is UP: PING WARNING - Packet loss = 0%, RTA = 599.15 ms  
[12:40:36] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[12:46:34] <wmf-insecte>	 Yippee, build fixed!
[12:46:35] <wmf-insecte>	 Project UploadWizard-api-commons.wikimedia.beta.wmflabs.org build #1828: FIXED in 34 sec: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/1828/
[12:50:29] <shinken-wm>	 RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 48267 bytes in 1.408 second response time  
[12:53:13] <shinken-wm>	 FLAPPINGSTOP - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms  
[12:54:42] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: CRITICAL - Host Unreachable (10.68.17.78)  
[12:55:00] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms  
[13:00:15] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms  
[13:01:30] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1532 bytes in 3.291 second response time  
[13:07:17] <wmf-insecte>	 Project browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #524: FAILURE in 1 min 17 sec: https://integration.wikimedia.org/ci/job/browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/524/
[13:08:30] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms  
[13:09:20] <wikibugs>	 10Browser-Tests, 6Collaboration-Team, 10Collaboration-Team-Sprint-A-2015-05-06, 10Flow, 5Patch-For-Review: A5. Fix failed Flow browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94153#1233562 (10SBisson)
[13:09:23] <wikibugs>	 10Beta-Cluster, 6Release-Engineering, 10Continuous-Integration-Infrastructure, 10Parsoid: Parsoid patches don't update Beta Cluster automatically -- only deploy repo patches seem to update that code - https://phabricator.wikimedia.org/T92871#1233563 (10SBisson)
[13:19:14] <shinken-wm>	 FLAPPINGSTOP - Host deployment-cache-mobile03 is UP: PING WARNING - Packet loss = 0%, RTA = 1918.99 ms  
[13:21:56] <wikibugs>	 6Release-Engineering, 10MediaWiki-Debug-Logging, 6Security-Team, 6operations, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1233579 (10Anomie) Is there anything that actually //needs// doing besides just removing the 'sample' from the 'api' entry in wmgMon...
[13:28:47] <wikibugs>	 10Beta-Cluster, 6Release-Engineering, 10Continuous-Integration-Infrastructure, 10Parsoid: Parsoid patches don't update Beta Cluster automatically -- only deploy repo patches seem to update that code - https://phabricator.wikimedia.org/T92871#1233581 (10greg) Unrelated known problem.
[13:33:14] <andrewbogott>	 I’ll be rebooting yet more deployment instances today, in attempt to resolve the flapping issue.  Most outages should only last a couple of minutes.
[13:49:36] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: CRITICAL - Host Unreachable (10.68.17.78)  
[13:50:00] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 42.10 ms  
[14:00:07] <shinken-wm>	 PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[14:08:04] <shinken-wm>	 PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[14:13:24] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 2056.55 ms  
[14:14:14] <shinken-wm>	 RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms  
[14:16:19] <shinken-wm>	 PROBLEM - Host deployment-logstash1 is DOWN: CRITICAL - Host Unreachable (10.68.16.134)  
[14:19:42] <grrrit-wm>	 (03PS1) 10Hashar: Beta: Add wikis for ContentTranslation [integration/config] - 10https://gerrit.wikimedia.org/r/206389 (https://phabricator.wikimedia.org/T90683) 
[14:20:49] <grrrit-wm>	 (03CR) 10Hashar: [C: 04-2] "Hold until we can confirm that each database has been created and properly configured or the update database job will fail ( https://integ" [integration/config] - 10https://gerrit.wikimedia.org/r/206389 (https://phabricator.wikimedia.org/T90683) (owner: 10Hashar)
[14:23:16] <shinken-wm>	 PROBLEM - Host deployment-mediawiki01 is DOWN: PING CRITICAL - Packet loss = 27%, RTA = 7161.55 ms  
[14:23:29] <wikibugs>	 10Beta-Cluster, 7Blocked-on-RelEng, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 3 others: Setup new wikis in Beta Cluster for Content Translation - https://phabricator.wikimedia.org/T90683#1233674 (10hashar) The wiki page  https://wikitech.wikimedia.org/wiki/Nova_Resour...
[14:25:08] <shinken-wm>	 PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13)  
[14:26:02] <shinken-wm>	 RECOVERY - Host deployment-mediawiki01 is UP: PING OK - Packet loss = 0%, RTA = 296.79 ms  
[14:26:23] <shinken-wm>	 PROBLEM - Host deployment-upload is DOWN: CRITICAL - Host Unreachable (10.68.16.189)  
[14:26:59] <shinken-wm>	 PROBLEM - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187)  
[14:27:01] <shinken-wm>	 PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145)  
[14:27:05] <shinken-wm>	 PROBLEM - Host deployment-restbase02 is DOWN: CRITICAL - Host Unreachable (10.68.17.189)  
[14:27:09] <grrrit-wm>	 (03PS1) 10Soeren.oldag: Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 
[14:28:03] <shinken-wm>	 PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14)  
[14:29:03] <wmf-insecte>	 Project browsertests-Wikidata-WikidataTests-linux-chrome-sauce build #1: FAILURE in 2.6 sec: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-WikidataTests-linux-chrome-sauce/1/
[14:29:19] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag)
[14:29:29] <wikibugs>	 10Beta-Cluster, 10CirrusSearch: JobQueueError Redis server error: Could not insert 1 cirrusSearchLinksUpdatePrioritized job(s). - https://phabricator.wikimedia.org/T97130#1233688 (10hashar) 3NEW
[14:30:21] <shinken-wm>	 RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 265.74 ms  
[14:30:51] <shinken-wm>	 RECOVERY - Host deployment-restbase02 is UP: PING OK - Packet loss = 0%, RTA = 91.18 ms  
[14:30:55] <shinken-wm>	 RECOVERY - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 200.24 ms  
[14:31:36] <shinken-wm>	 RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms  
[14:31:40] <shinken-wm>	 RECOVERY - Host deployment-upload is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms  
[14:31:58] <shinken-wm>	 RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms  
[14:32:02] <shinken-wm>	 RECOVERY - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms  
[14:32:24] <shinken-wm>	 PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 4590.46 ms  
[14:33:17] <hashar>	 magic
[14:33:34] <andrewbogott>	 more to come :(
[14:37:01] <hashar>	 andrewbogott: ah that is you rebooting them right ? 
[14:37:08] <andrewbogott>	 yeah
[14:37:20] <shinken-wm>	 RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms  
[14:37:27] <hashar>	 we had a bunch of up/down spam all the night
[14:37:39] <hashar>	 I guess that was due to the automatic script :D
[14:37:56] <shinken-wm>	 RECOVERY - Puppet failure on deployment-memc02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[14:38:02] <shinken-wm>	 RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[14:39:25] <hashar>	 andrewbogott: or it just Shinken being crazy cause the host up / down alarms above don't make much sense
[14:39:41] <hashar>	 deployment-db2 is claimed to be up again but it never went down
[14:40:50] <andrewbogott>	 hashar: both things are happening.  Shinken is freaking out because of the stuttering, and I’m also rebooting things to fix the stuttering.
[14:41:27] <wikibugs>	 10Beta-Cluster: Can't connect to Beta Cluster database deployment-db1 or deployment-db2 (MariaDB down) - https://phabricator.wikimedia.org/T96905#1233736 (10thcipriani) Didn't see mysql in `/etc/rc[1-5].d/` anywhere.  Added mysql to both deployment-db{1,2] using:     sudo update-rc.d mysql defaults  I think this...
[14:42:31] <hashar>	 ahh
[14:43:27] <thcipriani>	 hashar: here's the ticket for the instance stuttering: https://phabricator.wikimedia.org/T97033
[14:43:49] <thcipriani>	 it was sad times yesterday getting to the bottom of that :P
[14:44:35] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: CRITICAL - Host Unreachable (10.68.17.78)  
[14:44:36] <hashar>	 thcipriani: good morning
[14:44:54] <hashar>	 andrewbogott: so should we just reboot ? :D
[14:44:56] <thcipriani>	 hashar: good afternoon
[14:45:53] <andrewbogott>	 hashar: I’m working as fast as I can :)  Need to juggle tools nodes to prevent tool outages.
[14:47:37] <wikibugs>	 10Beta-Cluster, 10Deployment-Systems: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1233776 (10hashar) It seems scap strip the underlying command output. Running mwscript manually I got: ``` Fatal error: Uncaught exception 'LogicException' with message 'Missing stream uri, the stream can...
[14:47:56] <hashar>	 andrewbogott: sure :)
[14:48:39] <wikibugs>	 10Beta-Cluster, 10CirrusSearch: JobQueueError Redis server error: Could not insert 1 cirrusSearchLinksUpdatePrioritized job(s). - https://phabricator.wikimedia.org/T97130#1233786 (10hashar) The Redis server for jobs is on the [[ https://wikitech.wikimedia.org/wiki/Nova_Resource:I-0000063f.eqiad.wmflabs || depl...
[14:49:02] <wikibugs>	 10Beta-Cluster, 10CirrusSearch: JobQueueError Redis server error: Could not insert 1 cirrusSearchLinksUpdatePrioritized job(s). - https://phabricator.wikimedia.org/T97130#1233789 (10hashar)
[14:49:04] <wikibugs>	 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1231310 (10hashar)
[14:50:34] <wikibugs>	 10Beta-Cluster: Beta cluster database no more update beta-update-databases-eqiad - https://phabricator.wikimedia.org/T97138#1233796 (10hashar) 3NEW
[14:51:04] <wikibugs>	 10Beta-Cluster: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1233803 (10hashar)
[14:51:31] <wikibugs>	 10Beta-Cluster: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1233796 (10hashar)
[14:51:33] <wikibugs>	 10Beta-Cluster, 10Deployment-Systems: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1233805 (10hashar)
[14:52:07] <hashar>	 thcipriani: are you aware of bd808  monolog change introduced yesterday ? 
[14:52:18] <hashar>	 that causes mwscript to stacktrace on any beta cluster database BUT testwiki
[14:52:20] <hashar>	 https://phabricator.wikimedia.org/T97138
[14:52:28] <hashar>	 seems to be some mediawiki-config related change
[14:53:17] <thcipriani>	 hashar: I'm only aware in the most general way
[14:54:10] <thcipriani>	 it's weird that testwiki works fine, you're right: it must be a config thing
[14:54:28] <wikibugs>	 10Beta-Cluster: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1233811 (10hashar) Might be caused by  https://gerrit.w...
[14:54:59] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms  
[14:55:45] <hashar>	 we need those jenkins jobs to spam this channel :)
[14:59:56] * thcipriani sighs
[15:00:13] <bd808>	 what did I break hashar ?
[15:00:19] <hashar>	 good morning!
[15:00:23] <bd808>	 o/
[15:00:26] <hashar>	 https://phabricator.wikimedia.org/T97138
[15:00:34] <hashar>	 some monolog stacktrace being spurts out
[15:00:42] <hashar>	 for anyway beta cluster wiki BUT the testwiki one
[15:00:50] <bd808>	 hmmm
[15:00:54] <bd808>	 looking
[15:01:04] <shinken-wm>	 PROBLEM - Host deployment-kafka02 is DOWN: PING CRITICAL - Packet loss = 46%, RTA = 6347.12 ms  
[15:01:11] <hashar>	 'Missing stream uri, the stream can not be opened.'  sounds like a message for an expert :]
[15:01:25] <hashar>	 seems some setting/log dest is missing
[15:01:28] <bd808>	 empty string being passed to something...
[15:01:31] <hashar>	 but would be set on testwiki for ome reason
[15:01:36] <hashar>	 https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/
[15:01:43] <hashar>	 that shows all the wgDBname 
[15:01:51] <hashar>	 and testwiki is the sole being green
[15:02:01] <hashar>	 oh
[15:02:13] <^d>	 bd808: I pinged you in -ops earlier re: search errors. Already tracked in Phab
[15:02:16] <hashar>	 it is only green since ~ 8 or 9 hours ago
[15:02:18] <^d>	 The one you filed was actually a dupe
[15:02:40] <bd808>	 ^d: cool. something is still running that query constantly
[15:02:41] <hashar>	 nm, previously testwiki failed because mysql was not available
[15:02:49] <bd808>	 fatalmonitor is full of it
[15:03:02] <^d>	 bd808: It's basically harmless because of poolcounter
[15:03:08] <^d>	 If annoying
[15:03:18] <bd808>	 https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor
[15:03:35] <bd808>	 hundreds per minute
[15:04:21] <^d>	 It's https://phabricator.wikimedia.org/T95021
[15:04:32] <^d>	 Failing on highlighting
[15:04:41] <bd808>	 nond
[15:04:44] <bd808>	 *nod*
[15:04:55] <shinken-wm>	 PROBLEM - SSH on deployment-redis01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[15:04:59] <shinken-wm>	 RECOVERY - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms  
[15:05:00] <bd808>	 the more concerning thing for me was the volume of identical requests
[15:05:18] <^d>	 Yeah, obviously somebody's trying to hammer it.
[15:05:23] <bd808>	 while we were having other problems last night it was annoying me
[15:05:33] <^d>	 But PoolCounter should limit their ability to fallout.
[15:05:47] <bd808>	 for a while I thought it was the hhvm killer. turned out to be parsoid insteead
[15:06:00] <^d>	 It'd be hard for search to take out hhvm
[15:06:05] <^d>	 yay poolcounter ;-)
[15:07:56] <wikibugs>	 10Beta-Cluster: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1233827 (10bd808) My guess is that this is related to $...
[15:08:33] <^d>	 bd808: I totally want to see those fixed too, fwiw
[15:08:40] <^d>	 All 3 bugs
[15:08:44] <bd808>	 I'm sure :)
[15:09:40] <shinken-wm>	 PROBLEM - Host deployment-elastic06 is DOWN: CRITICAL - Host Unreachable (10.68.17.186)  
[15:09:46] <shinken-wm>	 RECOVERY - SSH on deployment-redis01 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0)  
[15:10:07] <bd808>	 hashar: doh. I totally see the problem
[15:10:21] <shinken-wm>	 PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 33%, RTA = 2554.48 ms  
[15:10:25] <bd808>	 https://gerrit.wikimedia.org/r/#/c/191259/10/wmf-config/CommonSettings-labs.php,unified
[15:10:31] <bd808>	 lines 22-26
[15:10:37] <bd808>	 that happens too late
[15:10:45] <bd808>	 I think
[15:11:09] <shinken-wm>	 PROBLEM - Host deployment-mediawiki03 is DOWN: CRITICAL - Host Unreachable (10.68.17.55)  
[15:11:17] <wikibugs>	 10Beta-Cluster, 10Deployment-Systems: scap eats underlying commands output (such as maintenance script stacktrace) - https://phabricator.wikimedia.org/T97140#1233830 (10hashar) 3NEW
[15:11:44] <wikibugs>	 10Beta-Cluster, 10Deployment-Systems: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1229359 (10hashar) scap not showing the maintenance script trace is T97140
[15:12:25] <hashar>	 bd808: yeah I had multiple issues with initialization ordering :/
[15:12:39] <hashar>	 you can probably confirm on beta cluster by manually moving the code
[15:12:51] <hashar>	 and confirm the fix running eval.php --wiki=enwiki
[15:12:55] <shinken-wm>	 PROBLEM - Host integration-saltmaster is DOWN: CRITICAL - Host Unreachable (10.68.18.24)  
[15:13:03] <hashar>	 not sure why testwiki is unaffected though
[15:13:20] <bd808>	 I can see how it's wrong now. trying to figure out a reasonable fix
[15:13:33] <shinken-wm>	 PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12)  
[15:13:56] <bd808>	 would it be horrible if both cli and web mwDebug() output ended up in the same file?
[15:14:07] <shinken-wm>	 PROBLEM - Host deployment-parsoid05 is DOWN: CRITICAL - Host Unreachable (10.68.16.120)  
[15:14:47] <hashar>	 I thought it made sense early on
[15:14:59] <shinken-wm>	 PROBLEM - Host deployment-kafka02 is DOWN: CRITICAL - Host Unreachable (10.68.17.156)  
[15:15:14] <hashar>	 now that we have logstash, it is probably easier to find errors we are looking for
[15:15:19] <hashar>	 so yeah might make sense to merge them
[15:15:35] <hashar>	 my typical use case was to tail -f web.log while browsing the site
[15:15:48] <hashar>	 that dismiss errors from jobs (which are run in cli)
[15:15:53] <shinken-wm>	 PROBLEM - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149)  
[15:15:57] <wikibugs>	 10Beta-Cluster: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1233853 (10bd808) Caused by {rOMWC2680380cba022787f19c7...
[15:16:39] <hashar>	 one day we will have a suite of integration tests for operations/mediawiki-config.git :D
[15:17:18] <bd808>	 heh. one day we will rewrite all that mess into something that actually makes logical sense
[15:17:39] <shinken-wm>	 RECOVERY - Host deployment-elastic06 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms  
[15:17:47] <greg-g>	 bd808: there's always a place for you in RelEng :P
[15:18:11] <shinken-wm>	 RECOVERY - Host deployment-mediawiki03 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms  
[15:18:17] <shinken-wm>	 RECOVERY - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms  
[15:18:21] <bd808>	 <3 greg-g <3
[15:18:30] <shinken-wm>	 RECOVERY - Host deployment-parsoid05 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms  
[15:18:32] <shinken-wm>	 RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms  
[15:19:18] <shinken-wm>	 RECOVERY - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms  
[15:19:26] <gwicke>	 parsoid
[15:19:29] <shinken-wm>	 RECOVERY - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms  
[15:19:35] <shinken-wm>	 RECOVERY - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms  
[15:19:49] <shinken-wm>	 RECOVERY - Parsoid on deployment-parsoid05 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.062 second response time  
[15:20:45] <hashar>	 bd808: based on hiera :D  *evil*
[15:21:01] <shinken-wm>	 RECOVERY - SSH on deployment-restbase01 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)  
[15:21:11] <hashar>	 bd808: I gotta go, but the Jenkins job should self recover whenever the traces disappear
[15:21:18] <bd808>	 *nod*
[15:21:21] <hashar>	 also andrew is rebooting a bunch of Precise instances
[15:21:26] <hashar>	 so that causes some more side effects
[15:21:52] <hashar>	 kudos on getting monolog on prod anyway!
[15:22:34] <hashar>	 moving out have a good weekend
[15:23:27] <shinken-wm>	 PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[15:24:53] <shinken-wm>	 PROBLEM - Puppet staleness on deployment-kafka02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0]  
[15:25:29] <shinken-wm>	 PROBLEM - Puppet failure on deployment-restbase01 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [0.0]  
[15:29:38] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: CRITICAL - Host Unreachable (10.68.17.78)  
[15:30:00] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms  
[15:30:27] <bd808>	 thcipriani, ^d: https://gerrit.wikimedia.org/r/#/c/206399/ for your review
[15:30:30] <shinken-wm>	 RECOVERY - Puppet failure on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[15:31:19] <^d>	 commit summary longer than actual patch, the way I like 'em :p
[15:44:28] <shinken-wm>	 PROBLEM - SSH on deployment-cxserver03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[15:44:40] <shinken-wm>	 PROBLEM - Host deployment-mediawiki01 is DOWN: CRITICAL - Host Unreachable (10.68.17.170)  
[15:45:00] <shinken-wm>	 PROBLEM - Host deployment-mx is DOWN: CRITICAL - Host Unreachable (10.68.17.78)  
[15:45:34] <shinken-wm>	 PROBLEM - Host deployment-cxserver03 is DOWN: CRITICAL - Host Unreachable (10.68.16.150)  
[15:46:14] <shinken-wm>	 PROBLEM - Host deployment-redis01 is DOWN: CRITICAL - Host Unreachable (10.68.16.177)  
[15:46:18] <shinken-wm>	 PROBLEM - Host deployment-sentry2 is DOWN: CRITICAL - Host Unreachable (10.68.17.204)  
[15:47:18] <shinken-wm>	 PROBLEM - Host deployment-db2 is DOWN: CRITICAL - Host Unreachable (10.68.17.94)  
[15:47:26] <shinken-wm>	 PROBLEM - Host deployment-zookeeper01 is DOWN: CRITICAL - Host Unreachable (10.68.17.157)  
[15:48:33] <wikibugs>	 6Release-Engineering, 10MediaWiki-Debug-Logging, 6Security-Team, 6operations, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1233906 (10bd808) My patch in {rOMWC2680380cba022787f19c783a4535d8794ffda8d8} restores unsampled xff logs to fluorine. I left api sa...
[15:49:21] <shinken-wm>	 FLAPPINGSTOP - Host deployment-fluoride is DOWN: CRITICAL - Host Unreachable (10.68.16.190)  
[15:51:38] <shinken-wm>	 RECOVERY - Host deployment-zookeeper01 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms  
[15:51:50] <shinken-wm>	 RECOVERY - Host deployment-cxserver03 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms  
[15:52:12] <shinken-wm>	 RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms  
[15:52:16] <shinken-wm>	 RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms  
[15:52:28] <shinken-wm>	 RECOVERY - Host deployment-redis01 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms  
[15:52:32] <^d>	 bd808: Did you want that out today?
[15:52:50] <shinken-wm>	 RECOVERY - Host deployment-mediawiki01 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms  
[15:52:55] <bd808>	 ^d: It's apparently breaking beta cluster so ... yes!
[15:53:31] <^d>	 gogogo
[15:53:53] <shinken-wm>	 RECOVERY - Host deployment-sentry2 is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms  
[15:54:19] <shinken-wm>	 RECOVERY - SSH on deployment-cxserver03 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0)  
[16:13:13] <wikibugs>	 10Deployment-Systems: scap eats underlying commands output (such as maintenance script stacktrace) - https://phabricator.wikimedia.org/T97140#1233953 (10greg)
[16:13:35] <shinken-wm>	 PROBLEM - Host deployment-redis02 is DOWN: CRITICAL - Host Unreachable (10.68.16.231)  
[16:14:01] <shinken-wm>	 PROBLEM - Host integration-slave-trusty-1021 is DOWN: CRITICAL - Host Unreachable (10.68.16.17)  
[16:14:15] <shinken-wm>	 PROBLEM - Host deployment-zotero01 is DOWN: CRITICAL - Host Unreachable (10.68.17.102)  
[16:14:21] <shinken-wm>	 PROBLEM - Host deployment-mediawiki02 is DOWN: CRITICAL - Host Unreachable (10.68.16.127)  
[16:15:43] <shinken-wm>	 PROBLEM - Host deployment-elastic05 is DOWN: CRITICAL - Host Unreachable (10.68.17.182)  
[16:16:17] <shinken-wm>	 PROBLEM - Host deployment-parsoid01-test is DOWN: CRITICAL - Host Unreachable (10.68.17.215)  
[16:16:17] <shinken-wm>	 PROBLEM - Host deployment-apertium01 is DOWN: CRITICAL - Host Unreachable (10.68.16.79)  
[16:16:23] <shinken-wm>	 PROBLEM - Host deployment-jobrunner01 is DOWN: CRITICAL - Host Unreachable (10.68.17.96)  
[16:16:53] <shinken-wm>	 PROBLEM - Host deployment-stream is DOWN: CRITICAL - Host Unreachable (10.68.17.106)  
[16:17:06] <greg-g>	 :(
[16:19:17] <thcipriani>	 shinken spam is a good thing here. once all these this is over and done with, we'll be on nice new kernel version for the virthosts and all will be well.
[16:19:28] * thcipriani repeats to myself
[16:19:42] <greg-g>	 "I think I can I think I can"
[16:19:52] <shinken-wm>	 RECOVERY - Host deployment-apertium01 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms  
[16:20:16] <shinken-wm>	 RECOVERY - Host deployment-mediawiki02 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms  
[16:20:18] <shinken-wm>	 RECOVERY - Host deployment-jobrunner01 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms  
[16:20:25] <thcipriani>	 http://wittyandpretty.com/wp-content/uploads/2014/04/little-engine-that-literally-cant-even.jpg
[16:20:26] <shinken-wm>	 RECOVERY - Host deployment-zotero01 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms  
[16:20:32] <shinken-wm>	 RECOVERY - Host integration-slave-trusty-1021 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms  
[16:20:36] <shinken-wm>	 RECOVERY - Host deployment-stream is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms  
[16:20:38] <shinken-wm>	 RECOVERY - Host deployment-redis02 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms  
[16:20:46] <shinken-wm>	 RECOVERY - Host deployment-elastic05 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms  
[16:20:50] <shinken-wm>	 RECOVERY - Host deployment-parsoid01-test is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms  
[16:21:02] <greg-g>	 that was labvirt1004 I guess: 16:18 < icinga-wm> RECOVERY - Host labvirt1004 is UPING OK - Packet loss = 0%, RTA = 2.13 ms
[16:29:02] <shinken-wm>	 PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]  
[16:29:06] <shinken-wm>	 PROBLEM - Puppet failure on deployment-zotero01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[16:32:29] <andrewbogott>	 thcipriani: that’s all of them.
[16:32:42] <andrewbogott>	 So… now your job is to form an opinion about whether or not things still suck :)
[16:33:14] <thcipriani>	 andrewbogott: awesome. I'll start digging through hosts. The ones rebooted yesterday seemed to stay solid overnight.
[16:33:38] <thcipriani>	 lack of shinken spam will also be a positive indicator
[16:33:52] <greg-g>	 FLAPPINGSTART
[16:34:03] <shinken-wm>	 RECOVERY - Puppet failure on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0]  
[16:43:23] <Nemo_bis>	 Is the backlog at #Wikimedia-log-errors decreasing? :)
[16:45:24] <thcipriani>	 !log rm stale /var/lib/puppet/state/agent_catalog_run.lock on deployment-kafka02
[16:47:07] <thcipriani>	 oh, huh, no logsbot
[16:51:53] <thcipriani>	 hmm, deployment-lucid-salt is still un-ssh-able
[16:54:43] <andrewbogott>	 I’ll look
[16:54:47] <andrewbogott>	 wait, /lucid/?
[16:54:48] <andrewbogott>	 dang
[16:55:05] <thcipriani>	 I just kicked it
[16:55:40] <thcipriani>	 andrewbogott: would the log bot for this channel have been affected by the reboots?
[16:55:52] <andrewbogott>	 thcipriani: Maybe, but I don’t know what logbot you use
[16:55:54] <andrewbogott>	 what was it called?
[16:57:48] <thcipriani>	 ah, qa-morebots, sorry, had to dig through logs...
[16:57:52] <thcipriani>	 may have been down a while
[16:58:07] <andrewbogott>	 hm, I just restarted it, I’ll try again
[16:58:58] <andrewbogott>	 qa-morebots: what’s up?
[16:58:58] <qa-morebots>	 I am a logbot running on tools-exec-02.
[16:58:58] <qa-morebots>	 Messages are logged to https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL.
[16:58:58] <qa-morebots>	 To log a message, type !log <msg>.
[16:59:55] <shinken-wm>	 RECOVERY - Puppet staleness on deployment-kafka02 is OK: OK: Less than 1.00% above the threshold [3600.0]  
[17:00:12] <thcipriani>	 !log rm stale /var/lib/puppet/state/agent_catalog_run.lock on deployment-kafka02
[17:00:14] <qa-morebots>	 Logged the message, Master
[17:00:19] <thcipriani>	 neat. Thanks
[17:03:29] <thcipriani>	 huh, dkim keys missing from deployment-salt
[17:05:54] <andrewbogott>	 thcipriani: you first noticed the problem at, what, 7pm PST on Wednesday?
[17:06:01] * andrewbogott is documenting
[17:06:08] <Krenair>	 <Nemo_bis> Is the backlog at #Wikimedia-log-errors decreasing? :)
[17:06:10] <Krenair>	 I doubt it
[17:06:33] <thcipriani>	 andrewbogott: yup, that's when the shinken started screaming about it
[17:06:38] <andrewbogott>	 ok, thanks
[17:06:43] <thcipriani>	 I didn't notice anything until the following morning, FWIW
[17:07:46] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<30.00%)  
[17:08:02] <Nemo_bis>	 Krenair: thanks. Is someone checking whether the old errors keep occurring?
[17:08:14] <Krenair>	 Not actively
[17:10:22] <thcipriani>	 !log gzip /var/log/account/pacct.0 on deployment-bastion: ought to revisit logrotate on that instance.
[17:10:24] <qa-morebots>	 Logged the message, Master
[17:20:38] <thcipriani>	 !log rm stale lock on deployment-rsync01, puppet fine
[17:20:40] <qa-morebots>	 Logged the message, Master
[17:22:41] <shinken-wm>	 RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK  
[17:24:55] <wmf-insecte>	 Yippee, build fixed!
[17:24:55] <wmf-insecte>	 Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #296: FIXED in 48 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/296/
[17:26:48] <thcipriani>	 !log remove deployment-prep from domain in /etc/puppet/puppet.conf on deployment-stream, puppet now OK
[17:26:50] <qa-morebots>	 Logged the message, Master
[17:28:23] <shinken-wm>	 RECOVERY - Puppet staleness on deployment-rsync01 is OK: OK: Less than 1.00% above the threshold [3600.0]  
[17:32:27] <wikibugs>	 10Beta-Cluster: Can't connect to Beta Cluster database deployment-db1 or deployment-db2 (MariaDB down) - https://phabricator.wikimedia.org/T96905#1234147 (10hashar) to be verified, it is well possible that in production we intentionally prevent mysql from starting manually. Either via the deb package or puppet....
[17:35:32] <Krinkle>	 andre__: I've changed #continuous-integration-infra additional hashtags to lowercase, otherwise the url redirect doesn't work (it normalises to lowercase)
[17:35:40] <Krinkle>	 e.g. https://phabricator.wikimedia.org/tag/continuous-integration/ didn't work, it does now
[17:35:57] <Krinkle>	 https://phabricator.wikimedia.org/project/profile/401/
[17:52:00] <wmf-insecte>	 Yippee, build fixed!
[17:52:01] <wmf-insecte>	 Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-10-sauce build #228: FIXED in 57 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-10-sauce/228/
[17:57:43] <wikibugs>	 10Beta-Cluster, 5Patch-For-Review: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1234249 (10bd808) Aaarrrrgh!  More...
[17:57:49] <wikibugs>	 10Beta-Cluster, 5Patch-For-Review: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1234250 (10bd808) a:3bd808
[17:58:04] <wikibugs>	 10Beta-Cluster, 5Patch-For-Review: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1233796 (10bd808) p:5Triage>3Un...
[17:59:04] <wmf-insecte>	 Yippee, build fixed!
[17:59:04] <wmf-insecte>	 Project browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #519: FIXED in 1 min 3 sec: https://integration.wikimedia.org/ci/job/browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/519/
[18:01:28] <bd808>	 ^d: One more tweak for beta logging -- https://gerrit.wikimedia.org/r/#/c/206414/
[18:01:31] <thcipriani>	 !log ran sudo chown -R mwdeploy:mwdeploy /srv/mediawiki on deployment-bastion to fix beta-scap-eqiad, hopefully
[18:01:34] <qa-morebots>	 Logged the message, Master
[18:01:36] <wikibugs>	 10Deployment-Systems, 6Services: Evaluate Ansible as a deployment tool - https://phabricator.wikimedia.org/T93433#1234264 (10RyanLane) There's a team that's working on deployment right? Are you a member of that team @GWicke? Proposing an alternative outside of that team means you're actively fighting them, mak...
[18:01:56] <bd808>	 thcipriani: yuck. any idea how that got messed up?
[18:02:22] <thcipriani>	 bd808: not at all, actually, it looked like everything was owned by mwdeploy under that directory
[18:02:36] <thcipriani>	 but once I ran the chown, sync-common worked
[18:02:55] <bd808>	 weird
[18:02:56] <thcipriani>	 whereas before I got rsync: mkstemp "/srv/mediawiki/.wikiversions-labs.cdb.o2VOpX" failed: Permission denied (13)
[18:03:28] <bd808>	 oh. top level dir permissions? That could be something in puppet
[18:06:52] <wikibugs>	 10Deployment-Systems, 6Services: Evaluate Ansible as a deployment tool - https://phabricator.wikimedia.org/T93433#1234285 (10RyanLane) As for Ansible itself, see my quite extensive blog post on this:  http://ryandlane.com/blog/2014/08/04/moving-away-from-puppet-saltstack-or-ansible/  Ansible using SSH always l...
[18:06:55] <thcipriani>	 maybe, well, yay! that fixed that problem in jenkins, now we get a new error: https://phabricator.wikimedia.org/P556
[18:08:58] <bd808>	 thcipriani: I have a patch for that. https://gerrit.wikimedia.org/r/#/c/206414/
[18:09:09] <thcipriani>	 heh, I was just looking at that :)
[18:10:27] <Krinkle>	 !log cvn Promited Rxy from member to projectadmin
[18:10:29] <qa-morebots>	 Logged the message, Master
[18:14:54] <thcipriani>	 bd808: oic what's happening, okie doke, merging
[18:16:50] <bd808>	 I didn't notice the function wrapper when I rearranged in the prior patch.
[18:16:59] <bd808>	 goofy config system is goofy
[18:19:05] <thcipriani>	 dat config system tho.
[18:19:56] <wikibugs>	 6Release-Engineering, 10Continuous-Integration-Config: Rewrite beta-update-databases to not use unstable Configuration Matrix - https://phabricator.wikimedia.org/T96199#1234360 (10Krinkle)
[18:20:09] <wikibugs>	 10Beta-Cluster, 6Release-Engineering, 10Continuous-Integration-Config: Rewrite beta-update-databases to not use unstable Configuration Matrix - https://phabricator.wikimedia.org/T96199#1210871 (10Krinkle)
[18:21:26] <thcipriani>	 bd808: blerg. Same error :(
[18:21:52] <bd808>	 did the config update job run yet?
[18:21:57] <wikibugs>	 10Beta-Cluster, 6Release-Engineering, 10Continuous-Integration-Config: Send beta cluster Jenkins alerts to betacluster-alert list - https://phabricator.wikimedia.org/T1125#1234391 (10Krinkle)
[18:22:02] <wikibugs>	 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1234392 (10coren) This //should// be all fixed now; I'm not seeing the intermittent VM stalls anymore and all kernels have been upgraded to the fixed kernel.
[18:22:24] <wikibugs>	 10Browser-Tests, 6Collaboration-Team, 10Continuous-Integration-Config, 10Flow, 7Easy: send Flow browser test job notices to #wikimedia-corefeatures channel - https://phabricator.wikimedia.org/T66103#1234393 (10Krinkle)
[18:22:25] <thcipriani>	 bd808: yeah, can see the update on /srv/mediawiki-staging
[18:22:33] <bd808>	 grrr
[18:22:49] <bd808>	 I see it too... so back to hunting
[18:28:15] <bd808>	 So $wmfUdp2logDest is set on line 121 of CommonSettings.php; InitializeSettings is loaded on line 169
[18:28:23] * bd808 scratches head
[18:30:36] <bd808>	 oh ffs
[18:30:44] <bd808>	 there is yet another sub function
[18:33:05] <grrrit-wm>	 (03PS1) 10Krinkle: Copy LocalSettings.php to "/log" in teardown instead of setup [integration/jenkins] - 10https://gerrit.wikimedia.org/r/206422 (https://phabricator.wikimedia.org/T90613) 
[18:37:21] <andre__>	 Krinkle, uh, wasn't aware. Thank you!
[18:41:57] <grrrit-wm>	 (03CR) 10Krinkle: [C: 032] Copy LocalSettings.php to "/log" in teardown instead of setup [integration/jenkins] - 10https://gerrit.wikimedia.org/r/206422 (https://phabricator.wikimedia.org/T90613) (owner: 10Krinkle)
[18:43:37] <grrrit-wm>	 (03Merged) 10jenkins-bot: Copy LocalSettings.php to "/log" in teardown instead of setup [integration/jenkins] - 10https://gerrit.wikimedia.org/r/206422 (https://phabricator.wikimedia.org/T90613) (owner: 10Krinkle)
[18:44:29] <greg-g>	 thcipriani: how's the beta scap job now?
[18:44:43] <thcipriani>	 gonna be fixed here after this merge, I reckon
[18:45:01] <greg-g>	 thcipriani: yay
[18:47:19] <thcipriani>	 looks promising https://integration.wikimedia.org/ci/job/beta-scap-eqiad/50358/console
[18:47:29] <wikibugs>	 6Release-Engineering, 10Continuous-Integration-Infrastructure, 5Patch-For-Review, 7Tracking: doc.wikimedia.org: Generate documentation for release tags (tracking) - https://phabricator.wikimedia.org/T73062#771415 (10Krinkle)
[18:51:19] <bd808>	 woot
[18:51:43] <bd808>	 that l10nupdate will take a while as long as this has been stuck
[18:51:48] <bd808>	 probably 15-20 minutes
[18:52:01] <bd808>	 slow staging server is slow
[18:52:46] <greg-g>	 oh shit, hopefully jenkins doesn't kill it due to the timeout limit....
[18:53:41] <wmf-insecte>	 Yippee, build fixed!
[18:53:42] <wmf-insecte>	 Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #297: FIXED in 41 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/297/
[18:57:17] <grrrit-wm>	 (03PS1) 10Krinkle: Make sure archive-log-dir is consistently after mw-teardown [integration/config] - 10https://gerrit.wikimedia.org/r/206427 (https://phabricator.wikimedia.org/T90613) 
[18:57:32] <thcipriani>	 forgot to reprime the key on dep-bastion!
[18:57:53] <greg-g>	 :(
[19:01:44] <grrrit-wm>	 (03CR) 10Krinkle: [C: 032] Make sure archive-log-dir is consistently after mw-teardown [integration/config] - 10https://gerrit.wikimedia.org/r/206427 (https://phabricator.wikimedia.org/T90613) (owner: 10Krinkle)
[19:03:36] <grrrit-wm>	 (03Merged) 10jenkins-bot: Make sure archive-log-dir is consistently after mw-teardown [integration/config] - 10https://gerrit.wikimedia.org/r/206427 (https://phabricator.wikimedia.org/T90613) (owner: 10Krinkle)
[19:12:18] <wmf-insecte>	 Yippee, build fixed!
[19:12:18] <wmf-insecte>	 Project beta-scap-eqiad build #50359: FIXED in 14 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/50359/
[19:12:27] <thcipriani>	 \o/
[19:12:46] <greg-g>	 yay!
[19:13:21] <greg-g>	 alright, anything else burning?
[19:13:28] <thcipriani>	 there ought to be an upstart job for the keyholder thing. Or maybe there is and it's gone screwy.
[19:13:54] <thcipriani>	 greg-g: I don't think there's anything wrong with beta right now that hasn't been.
[19:14:10] <wikibugs>	 10Beta-Cluster, 5Patch-For-Review: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1234615 (10greg) 5Open>3Resolve...
[19:14:11] <wikibugs>	 10Beta-Cluster, 10Deployment-Systems: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1234617 (10greg)
[19:14:28] <greg-g>	 thcipriani: alright then, let's call it a week!  ;)
[19:15:37] <greg-g>	 thcipriani: what about https://phabricator.wikimedia.org/T96905 ?
[19:16:48] <wikibugs>	 10Beta-Cluster, 10Analytics-EventLogging: EventLogging schemas are not served properly on beta cluster - https://phabricator.wikimedia.org/T97047#1234623 (10Tgr) 5Open>3Resolved a:3Tgr Seems fixed, presumably due to work done in T97033.
[19:16:49] <thcipriani>	 greg-g: yeah, that should be resolved. Don't know about followup tickets for deciding about adding mysql to /etc/rc[x].d/ in puppet
[19:17:08] <wikibugs>	 10Beta-Cluster, 10Deployment-Systems: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1234628 (10greg) a:5mmodell>3thcipriani https://integration.wikimedia.org/ci/job/beta-scap-eqiad/50359/  Yay!
[19:18:22] <thcipriani>	 the problem was mainly the stuttering instances + not having mysql start at boot + me not manually fixing that until after beta was stable
[19:18:34] * greg-g nods
[19:18:42] * greg-g doesn't like that manual word
[19:19:22] <thcipriani>	 mariadb puppet stuff definitely has some different needs in prod vs labs :(
[19:20:58] <wmf-insecte>	 Yippee, build fixed!
[19:20:59] <wmf-insecte>	 Project beta-update-databases-eqiad build #9149: FIXED in 58 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9149/
[19:28:43] <wikibugs>	 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1234661 (10thcipriani) 5Open>3Resolved
[19:36:05] <hashar>	 oh Jenkins beta jobs have been fixed!
[19:36:34] <hashar>	 thcipriani: congratulations!
[19:37:20] <thcipriani>	 hashar: thanks, although, mostly what I did was fret a lot about them :)
[19:38:47] <wikibugs>	 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1234698 (10hashar)
[19:38:49] <wikibugs>	 10Beta-Cluster, 10CirrusSearch: JobQueueError Redis server error: Could not insert 1 cirrusSearchLinksUpdatePrioritized job(s). - https://phabricator.wikimedia.org/T97130#1234695 (10hashar) 5Open>3Resolved a:3hashar Seems redis is back up just fine now.
[19:40:26] <hashar>	 thcipriani: have you figured out why MySQL doesn't start on boot ? 
[19:40:38] <hashar>	 that is most probably the intention in prod
[19:40:56] <hashar>	 but the 1$ question is whether it is done in the deb package or some puppet manifest
[19:41:20] <thcipriani>	 hashar: no I haven't checked with any opsen
[19:43:58] <bd808>	 Sorry I broke beta logging so badly guys. I was still seeing logs into logstash and never thought to check jenkins jobs :(
[19:44:41] <hashar>	 shit happens!
[19:44:50] <hashar>	 at least it did not land on prod hehe
[19:45:17] <hashar>	 maybe the jenkins job should keep spamming errors here until the job is fixed
[19:47:08] <thcipriani>	 hashar: so, a quick install of mariadb-server on 14.04 does create a symlink in /etc/rc1.d
[19:47:39] <hashar>	 thcipriani: maybe Sean removed them manually
[19:47:41] <thcipriani>	 using the trusty repos
[19:48:00] <thcipriani>	 lemme try with the wikimedia repos
[19:48:24] <hashar>	 don't waste too much time on it though
[19:49:09] <wikibugs>	 10Beta-Cluster, 7Monitoring: Beta cluster: monitor MySQL on deployment-db1 and deployment-db2 - https://phabricator.wikimedia.org/T97120#1234740 (10hashar)
[19:49:10] <wikibugs>	 10Beta-Cluster: Setup monitoring for database servers in beta cluster - https://phabricator.wikimedia.org/T87093#1234741 (10hashar)
[19:49:57] <wikibugs>	 10Beta-Cluster: Setup monitoring for database servers in beta cluster - https://phabricator.wikimedia.org/T87093#983793 (10hashar) From T97120  The beta cluster MySQL servers turned out to be down for a few hours (T96905) and there is no monitoring for it.  We would need on both instances (deployment-db1 and dep...
[19:50:59] <Krenair>	 hashar, was mysql really down?
[19:51:09] <Krenair>	 I logged into both of those instances and checked service mysql status at the time
[19:51:17] <hashar>	 Krenair: yes was not running
[19:51:26] <hashar>	 at least when I looked at it
[19:56:19] <wikibugs>	 10Browser-Tests, 3Gather Sprint Forward, 6Mobile-Web, 10Mobile-Web-Sprint-45-Snakes-On-A-Plane, 5Patch-For-Review: Fix failed MobileFrontend browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94156#1234791 (10hashar)
[19:56:22] <wikibugs>	 10Beta-Cluster, 10Deployment-Systems: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1234789 (10hashar) 5Open>3Resolved Magically fixed when T97138 got fixed :)
[19:58:42] <wikibugs>	 5Continuous-Integration-Isolation, 10Continuous-Integration-Infrastructure, 6operations, 7Nodepool, and 2 others: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1234813 (10hashar) We now have a preliminary Debian package which is good enough. We will improv...
[19:59:42] <wikibugs>	 5Continuous-Integration-Isolation, 10Continuous-Integration-Infrastructure, 6operations, 7Nodepool: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1234818 (10hashar) p:5Normal>3Low
[20:03:11] <wikibugs>	 10Beta-Cluster: Can't connect to Beta Cluster database deployment-db1 or deployment-db2 (MariaDB down) - https://phabricator.wikimedia.org/T96905#1234868 (10Jdforrester-WMF)
[20:09:48] <Krinkle>	 hashar: I updated you cache_no_hardlinks patch for Zuul and tested it on a depooled slave with git cache enabled locally. Working fine! Clones mediawiki core in 30 seconds.
[20:12:46] <hashar>	 Krinkle: ohhhhhh
[20:12:53] <hashar>	 Krinkle: that is quite an old patches isn't it ?
[20:13:33] <hashar>	 Krinkle: now we have a debian package, it should be failry trivial to incorporate that patch in our .deb and roll it everywhere
[20:15:09] <Krinkle>	 hashar: Its taking rather long for upsream to merge patches..
[20:20:15] <hashar>	 Krinkle: yeah  I am not sure why :/
[20:20:43] <hashar>	 might be worth poking them on their openstack-infra mailing list
[20:22:46] <wikibugs>	 5Continuous-Integration-Isolation: Instances created by Nodepool cant run puppet due to missing certificate - https://phabricator.wikimedia.org/T96670#1234939 (10hashar)
[20:23:36] <hashar>	 I am off
[20:23:41] <hashar>	 weekend at last
[20:30:29] <shinken-wm>	 PROBLEM - Free space - all mounts on deployment-eventlogging02 is CRITICAL: CRITICAL: deployment-prep.deployment-eventlogging02.diskspace._var.byte_percentfree (<30.00%) WARN: deployment-prep.deployment-eventlogging02.diskspace.root.byte_percentfree (<100.00%)  
[20:43:47] <wikibugs>	 10Browser-Tests, 6Release-Engineering, 5Patch-For-Review: Use rspec-expectations "expect" syntax instead of "should" syntax - https://phabricator.wikimedia.org/T68369#1235037 (10Physikerwelt)
[21:17:25] <shinken-wm>	 PROBLEM - Puppet staleness on deployment-eventlogging02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [43200.0]  
[21:27:41] <wikibugs>	 10Deployment-Systems: scap eats underlying commands output (such as maintenance script stacktrace) - https://phabricator.wikimedia.org/T97140#1235181 (10mmodell) it's got "--output" to a temp file - is the file empty?
[21:32:03] <wikibugs>	 10Deployment-Systems: scap eats underlying commands output (such as maintenance script stacktrace) - https://phabricator.wikimedia.org/T97140#1235203 (10bd808) The stdout/stderr of the proc should be going to the logger at debug level. Is jenkins running with the `--verbose` flag? I can't remember if I got that...
[21:45:40] <wikibugs>	 10Beta-Cluster, 10Sentry, 10Wikimedia-Logstash: Channel PHP errors from Logstash to Sentry on the beta cluster - https://phabricator.wikimedia.org/T85239#1235275 (10matmarex)
[22:08:35] <wikibugs>	 10Staging, 3releng-201415-Q3: [Quarterly Success Metric] Stable uptime metrics of the Staging cluster - https://phabricator.wikimedia.org/T88705#1235332 (10greg) >>! In T88705#1156024, @mmodell wrote: > [[ https://graphite.wmflabs.org//render?width=600&from=-8hours&until=now&height=400&target=cactiStyle%28alia...
[22:12:56] <wikibugs>	 10Beta-Cluster, 5Patch-For-Review, 15User-Bd808-Test: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1235340 (10bd8...
[22:17:37] <wikibugs>	 10Staging, 3releng-201415-Q3: [Quarterly Success Metric] Stable uptime metrics of the Staging cluster - https://phabricator.wikimedia.org/T88705#1235342 (10mmodell) @greg: https://graphite.wmflabs.org/dashboard/#availability  should work now
[22:20:00] <greg-g>	 twentyafterfour: thanks!
[22:20:17] <wikibugs>	 10Staging, 3releng-201415-Q3: [Quarterly Success Metric] Stable uptime metrics of the Staging cluster - https://phabricator.wikimedia.org/T88705#1235357 (10mmodell)
[22:38:15] <bd808>	 greg-g: so when I run jouncebot locally it works just fine :/
[22:38:45] <bd808>	 Not sure what is making it sad running from tool labs
[22:39:10] * bd808 will try restarting again (definition of insanity?)
[22:59:58] <wmf-insecte>	 Yippee, build fixed!
[22:59:58] <wmf-insecte>	 Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-firefox-sauce build #31: FIXED in 57 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-firefox-sauce/31/
[23:15:26] <wikibugs>	 10Staging, 3releng-201415-Q3: [Quarterly Success Metric] Stable uptime metrics of the Staging cluster - https://phabricator.wikimedia.org/T88705#1235541 (10mmodell)
[23:34:33] <wmf-insecte>	 Project beta-scap-eqiad build #50388: FAILURE in 30 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/50388/
[23:56:53] <wmf-insecte>	 Yippee, build fixed!
[23:56:54] <wmf-insecte>	 Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-chrome-sauce build #31: FIXED in 53 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-chrome-sauce/31/