[00:03:10] RECOVERY - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [00:03:11] hmmm I should add this to my list of autojoins [00:03:16] so.... [00:03:19] root@deployment-salt:/var/lib/git/operations/puppet# git rebase origin/production [00:03:22] It seems that there is already a rebase-merge directory, and [00:03:36] anyone know what's up there? nobody currently logged in either, but rebase sitting there half-done [00:04:01] bblack: just blow it away and do it again I guess [00:04:22] as in just hard reset to origin/prod, or should I look around for an obvious set of patches we were keeping on top of the rebase before? [00:04:38] bblack: git rebase —abort [00:04:38] I guess I could rebase --abort, too [00:04:38] ? [00:04:46] and then just fetch / rebase again [00:07:23] RECOVERY - Puppet failure on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [0.0] [00:08:11] FLAPPINGSTART - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2369.91 ms [00:10:05] 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1232934 (10thcipriani) This problem is still ongoing, although @coren and @andrew may have found the root cause: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917 After updating the kernel on labvirt1001 and... [00:10:06] 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1232934 (10thcipriani) This problem is still ongoing, although @coren and @andrew may have found the root cause: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917 After updating the kernel on labvirt1001 and... [00:12:01] FLAPPINGSTART - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [00:12:37] ok deployment-salt ops/puppet is clean now. going to also cherry-pick a patch into there for testing myself now.... [00:13:53] FLAPPINGSTOP - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [00:17:18] !log cherry-picked https://gerrit.wikimedia.org/r/#/c/196009/13/ onto deployment-salt ops/puppet [00:18:29] (I'm guessing this is where to do that, docs say wikimedia-qa :P) [00:19:07] heh which docs? [00:19:22] https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code_is_updated#Cherry-picking_a_patch_from_gerrit [00:20:43] {{fixed}} [00:21:05] :) [00:26:40] 10Continuous-Integration: Zuul-cloner should use git cache via hard links - https://phabricator.wikimedia.org/T97098#1232972 (10Krinkle) 3NEW [00:26:45] 10Continuous-Integration: Zuul-cloner should use git cache via hard links - https://phabricator.wikimedia.org/T97098#1232972 (10Krinkle) 3NEW [00:26:52] PROBLEM - Host deployment-sentry2 is DOWN: CRITICAL - Host Unreachable (10.68.17.204) [00:27:15] 10Continuous-Integration: Zuul-cloner should use git cache via hard links - https://phabricator.wikimedia.org/T97098#1232972 (10Krinkle) I've tested this on integration-trusty-slave-1021 with mediawiki/core and noticed that it's not much faster to copy from `/mnt/git` than to clone from gerrit.wikimedia.org. In... [00:27:16] 10Continuous-Integration: Zuul-cloner should use git cache via hard links - https://phabricator.wikimedia.org/T97098#1232972 (10Krinkle) I've tested this on integration-trusty-slave-1021 with mediawiki/core and noticed that it's not much faster to copy from `/mnt/git` than to clone from gerrit.wikimedia.org. In... [00:27:32] 10Continuous-Integration, 5Patch-For-Review: Set up git replication on integration slaves - https://phabricator.wikimedia.org/T96687#1224500 (10Krinkle) [00:27:33] 10Continuous-Integration, 5Patch-For-Review: Set up git replication on integration slaves - https://phabricator.wikimedia.org/T96687#1224500 (10Krinkle) [00:27:33] 10Continuous-Integration: Zuul-cloner should use git cache via hard links - https://phabricator.wikimedia.org/T97098#1232982 (10Krinkle) [00:27:34] 10Continuous-Integration: Zuul-cloner should use git cache via hard links - https://phabricator.wikimedia.org/T97098#1232982 (10Krinkle) [00:28:41] ok one more dumb question: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Overview -> "The mobile traffic (*.m.*.beta.wmflabs.org) is served by the deployment-cache-mobile03 instance" [00:28:54] $ ssh deployment-cache-mobile03.beta.wmflabs.org [00:28:55] Linux deployment-cache-text02 [00:29:28] yet when I hit http://en.m.wikipedia.beta.wmflabs.org/wiki/Main_Page the headers say it did flow through a deployment-cache-mobile03 ... [00:29:54] why does ssh-ing there drop me on text02? [00:31:52] RECOVERY - Host deployment-sentry2 is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [00:36:58] FLAPPINGSTOP - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 298.11 ms [00:37:16] PROBLEM - Host deployment-fluoride is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 3302.80 ms [00:42:13] RECOVERY - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [00:51:02] PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145) [00:53:16] PROBLEM - Host deployment-restbase02 is DOWN: CRITICAL - Host Unreachable (10.68.17.189) [00:55:51] RECOVERY - Host deployment-restbase02 is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [00:55:53] RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [00:59:52] PROBLEM - SSH on deployment-parsoidcache02 is CRITICAL: No route to host [01:01:16] PROBLEM - Host deployment-elastic07 is DOWN: PING CRITICAL - Packet loss = 57%, RTA = 7109.24 ms [01:01:38] PROBLEM - Host integration-saltmaster is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2231.75 ms [01:01:54] thcipriani|afk: andrewbogott ^ [01:01:56] not sure if known? [01:02:02] RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 344.77 ms [01:02:03] or if they’re on non labsvirt1000 [01:02:05] 1 [01:08:48] PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12) [01:11:31] RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [01:11:37] RECOVERY - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [01:16:18] PROBLEM - Host deployment-logstash1 is DOWN: CRITICAL - Host Unreachable (10.68.16.134) [01:22:11] FLAPPINGSTART - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 242.34 ms [01:22:53] PROBLEM - Host deployment-logstash1 is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 6631.08 ms [01:24:31] PROBLEM - Content Translation Server on deployment-cxserver03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:25:55] PROBLEM - Host deployment-test is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2005.45 ms [01:29:24] RECOVERY - Content Translation Server on deployment-cxserver03 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.009 second response time [01:35:54] RECOVERY - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [01:40:55] PROBLEM - Host integration-saltmaster is DOWN: CRITICAL - Host Unreachable (10.68.18.24) [01:41:45] !log re-cherry-picked ops/puppet https://gerrit.wikimedia.org/r/#/c/196009/13 on deployment-salt (analytics last-access testing) [01:46:36] RECOVERY - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 83.98 ms [01:50:53] FLAPPINGSTART - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [01:51:21] FLAPPINGSTART - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [02:01:02] PROBLEM - Host deployment-kafka02 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 3843.12 ms [02:02:48] PROBLEM - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187) [02:04:20] PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13) [02:06:08] RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [02:07:14] FLAPPINGSTOP - Host deployment-fluoride is UP: PING WARNING - Packet loss = 14%, RTA = 704.75 ms [02:07:57] PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 37%, RTA = 2472.22 ms [02:08:06] YuviPanda: integration-saltmaster is on labvirt1005 (possibly) either way, wasn't on labvirt1001 or 1002 [02:08:23] RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [02:14:19] PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 37%, RTA = 4072.31 ms [02:17:53] RECOVERY - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [02:21:28] PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12) [02:23:32] RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [02:29:24] PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 93%, RTA = 7295.20 ms [02:33:03] PROBLEM - Host deployment-parsoid05 is DOWN: CRITICAL - Host Unreachable (10.68.16.120) [02:33:11] FLAPPINGSTOP - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [02:36:54] RECOVERY - Host deployment-parsoid05 is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [02:41:48] PROBLEM - Host integration-saltmaster is DOWN: PING CRITICAL - Packet loss = 73%, RTA = 2107.13 ms [02:42:04] Yippee, build fixed! [02:42:05] Project browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #515: FIXED in 1 min 4 sec: https://integration.wikimedia.org/ci/job/browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/515/ [02:42:54] FLAPPINGSTART - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [02:46:37] RECOVERY - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [02:47:06] PROBLEM - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187) [02:47:44] RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 276.00 ms [02:48:36] PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [02:53:42] PROBLEM - Host deployment-kafka02 is DOWN: CRITICAL - Host Unreachable (10.68.17.156) [02:55:01] RECOVERY - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [02:55:51] PROBLEM - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149) [02:56:33] PROBLEM - Host deployment-cache-bits01 is DOWN: PING CRITICAL - Packet loss = 25%, RTA = 2574.53 ms [02:56:55] 10Continuous-Integration, 7Zuul: Zuul-cloner should use hard links when fetching from cache-dir - https://phabricator.wikimedia.org/T97106#1233116 (10Krinkle) 3NEW [02:58:34] RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [02:59:06] PROBLEM - Puppet failure on deployment-cache-mobile03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:00:10] PROBLEM - Host deployment-kafka02 is DOWN: PING CRITICAL - Packet loss = 100% [03:00:54] RECOVERY - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 68.03 ms [03:03:40] PROBLEM - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187) [03:04:30] 10Beta-Cluster, 10Deployment-Systems: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1233131 (10greg) @mmodell @thcipriani ping (this looks unrelated to the flappiness of wmf labs as it started before all of that) [03:05:52] PROBLEM - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149) [03:06:00] PROBLEM - Host deployment-kafka02 is DOWN: PING CRITICAL - Packet loss = 54%, RTA = 8120.05 ms [03:06:11] RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [03:10:55] RECOVERY - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 496.01 ms [03:15:52] FLAPPINGSTART - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149) [03:19:25] FLAPPINGSTOP - Host deployment-cache-mobile03 is UP: PING WARNING - Packet loss = 61%, RTA = 41.10 ms [03:21:21] FLAPPINGSTOP - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [03:23:07] FLAPPINGSTOP - Host deployment-restbase01 is UP: PING WARNING - Packet loss = 73%, RTA = 0.92 ms [03:26:33] PROBLEM - Host deployment-cache-bits01 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 2341.18 ms [03:28:30] PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13) [03:38:23] PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13) [03:38:23] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:38:35] RECOVERY - Puppet failure on deployment-elastic07 is OK: OK: Less than 1.00% above the threshold [0.0] [03:45:52] FLAPPINGSTOP - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 383.99 ms [03:46:18] PROBLEM - Host deployment-logstash1 is DOWN: CRITICAL - Host Unreachable (10.68.16.134) [03:48:24] RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [03:51:20] RECOVERY - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [03:53:32] PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 86%, RTA = 2249.61 ms [03:54:57] PROBLEM - Puppet staleness on deployment-kafka02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [03:55:57] FLAPPINGSTOP - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [03:58:23] RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [04:01:09] FLAPPINGSTART - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [04:06:01] PROBLEM - Host deployment-restbase02 is DOWN: CRITICAL - Plugin timed out after 15 seconds [04:06:22] RECOVERY - Host deployment-restbase02 is UP: PING OK - Packet loss = 16%, RTA = 0.56 ms [04:07:03] 10Continuous-Integration: Update jobs to use zuul-cloner with git cache via hard links - https://phabricator.wikimedia.org/T97098#1233144 (10Krinkle) [04:07:12] 10Continuous-Integration: Update jobs to use zuul-cloner with git cache via hard links - https://phabricator.wikimedia.org/T97098#1232972 (10Krinkle) [04:07:13] 10Continuous-Integration, 7Zuul: Zuul-cloner should use hard links when fetching from cache-dir - https://phabricator.wikimedia.org/T97106#1233147 (10Krinkle) [04:09:03] 10Continuous-Integration: Update jobs to use zuul-cloner with git cache via hard links - https://phabricator.wikimedia.org/T97098#1232972 (10Krinkle) I applied https://review.openstack.org/#/c/117626/4 to zuul on the depooled integration-slave-trusty-1021. A quick test confirms it works properly and speeds up th... [04:25:01] FLAPPINGSTOP - Host deployment-mx is UP: PING WARNING - Packet loss = 0%, RTA = 992.08 ms [04:27:01] FLAPPINGSTOP - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 112.36 ms [04:28:41] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 46153 bytes in 8.993 second response time [04:34:25] PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 73%, RTA = 9741.31 ms [04:46:10] PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14) [04:47:27] PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 61%, RTA = 10674.78 ms [04:53:35] PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 100% [04:54:15] RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [04:55:11] PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 4022.02 ms [04:57:18] RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [05:01:29] PROBLEM - Host deployment-logstash1 is DOWN: PING CRITICAL - Packet loss = 100% [05:04:14] FLAPPINGSTART - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [05:04:37] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [05:06:47] PROBLEM - Puppet failure on integration-slave-jessie-1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:06:47] RECOVERY - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [05:18:20] PROBLEM - Host integration-slave-trusty-1021 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2970.65 ms [05:18:21] PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 100% [05:18:54] beta o beta! [05:19:05] PROBLEM - Puppet failure on deployment-cache-mobile03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:23:21] RECOVERY - Host integration-slave-trusty-1021 is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [05:25:03] PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 4317.80 ms [05:25:57] PROBLEM - Host deployment-parsoidcache02 is DOWN: PING CRITICAL - Packet loss = 37%, RTA = 6045.16 ms [05:29:37] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [05:30:51] RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [05:36:28] FLAPPINGSTOP - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [05:48:17] PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 4092.52 ms [05:51:17] PROBLEM - Host deployment-logstash1 is DOWN: CRITICAL - Host Unreachable (10.68.16.134) [05:54:44] PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 30%, RTA = 5790.14 ms [05:58:11] RECOVERY - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [05:59:37] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 24.52 ms [06:01:19] RECOVERY - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 460.42 ms [06:02:58] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:05:50] PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145) [06:09:38] FLAPPINGSTART - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [06:11:32] PROBLEM - Host deployment-cache-bits01 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 2194.36 ms [06:13:33] RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [06:17:06] PROBLEM - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187) [06:18:11] FLAPPINGSTART - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [06:22:01] RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 30.70 ms [06:22:17] PROBLEM - Host deployment-parsoidcache02 is DOWN: PING CRITICAL - Packet loss = 100% [06:25:55] RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [06:30:54] FLAPPINGSTOP - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [06:32:10] PROBLEM - Host deployment-fluoride is DOWN: CRITICAL - Host Unreachable (10.68.16.190) [06:37:11] RECOVERY - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [06:51:26] PROBLEM - Host deployment-logstash1 is DOWN: PING CRITICAL - Packet loss = 58%, RTA = 7360.91 ms [07:15:00] RECOVERY - SSH on deployment-cache-mobile03 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [07:19:35] PROBLEM - Puppet failure on deployment-mx is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [07:26:20] PROBLEM - Host deployment-elastic07 is DOWN: PING CRITICAL - Packet loss = 100% [07:26:50] RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [07:30:20] FLAPPINGSTOP - Host deployment-upload is UP: PING OK - Packet loss = 0%, RTA = 136.31 ms [07:31:01] PROBLEM - Host deployment-parsoidcache02 is DOWN: PING CRITICAL - Packet loss = 45%, RTA = 2472.03 ms [07:32:09] PROBLEM - Host deployment-fluoride is DOWN: CRITICAL - Host Unreachable (10.68.16.190) [07:32:53] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 29317 bytes in 3.570 second response time [07:33:31] PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12) [07:36:28] RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [07:37:10] RECOVERY - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [07:37:24] PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 3714.69 ms [07:38:23] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [07:38:53] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: No route to host [07:42:03] FLAPPINGSTART - Host deployment-elastic07 is DOWN: PING CRITICAL - Packet loss = 14%, RTA = 4375.19 ms [07:42:19] RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [07:44:36] FLAPPINGSTOP - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [07:45:54] RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [07:48:51] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 29317 bytes in 3.558 second response time [07:49:59] PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2082.32 ms [07:52:15] FLAPPINGSTART - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 445.04 ms [07:54:37] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 38.43 ms [07:54:57] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:57:20] PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 16%, RTA = 3230.12 ms [07:59:04] PROBLEM - Host deployment-parsoidcache02 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2329.54 ms [07:59:26] PROBLEM - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149) [08:00:52] RECOVERY - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [08:04:59] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 29317 bytes in 9.301 second response time [08:05:38] 6Release-Engineering, 10MediaWiki-Debug-Logging, 6Security-Team, 6operations, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1233253 (10Joe) @Andrew, @fgiunchedi is no one working on this? It is an old ticket, marked as high priority and it's unassigned. [08:05:53] FLAPPINGSTART - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [08:09:49] RECOVERY - SSH on deployment-parsoidcache02 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:26:57] FLAPPINGSTOP - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 232.15 ms [08:28:24] FLAPPINGSTOP - Host deployment-cache-mobile03 is UP: PING WARNING - Packet loss = 0%, RTA = 946.79 ms [08:35:55] FLAPPINGSTART - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [08:43:47] 10Continuous-Integration, 7Regression, 7Upstream: ERROR: Failed to notify endpoint 'HTTP:http://127.0.0.1:8001/jenkins_endpoint' - https://phabricator.wikimedia.org/T93321#1233301 (10hashar) >>! In T93321#1145026, @hashar wrote: > Before Zuul migrated to Gearman, the jobs had to notify Zuul on start and comp... [08:54:36] PROBLEM - Host deployment-mx is DOWN: CRITICAL - Host Unreachable (10.68.17.78) [08:55:06] !log Enabling puppet on deployment-eventlogging02.eqiad.wmflabs {{bug|T96921}} [08:58:23] PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2166.59 ms [08:58:37] FLAPPINGSTOP - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [08:59:13] RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [08:59:23] PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 7360.96 ms [08:59:37] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 260.11 ms [09:02:53] RECOVERY - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 362.24 ms [09:05:53] 10Beta-Cluster: upgrade salt on deployment-prep to 2014.7 - https://phabricator.wikimedia.org/T92276#1233322 (10ArielGlenn) Packages out two days ago, will be testing with those on deployment prep shortly. [09:06:24] RECOVERY - Puppet staleness on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [3600.0] [09:07:21] 10Beta-Cluster, 10Analytics-EventLogging: puppet agent disabled on beta cluster deployment-eventlogging02.eqiad.wmflabs instance - https://phabricator.wikimedia.org/T96921#1233323 (10hashar) Seems it has not been reenabled so I have did it and ran puppet again. From the diff it changes a bunch of files under /... [09:08:59] !log beta: manually rebased operations/puppet.git [09:10:00] FLAPPINGSTART - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 202.09 ms [09:13:59] PROBLEM - Puppet failure on deployment-eventlogging02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [09:22:19] FLAPPINGSTOP - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 178.74 ms [09:23:23] PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13) [09:23:59] RECOVERY - Puppet failure on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [0.0] [09:24:13] RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 343.79 ms [09:30:54] FLAPPINGSTOP - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [09:32:15] 10Beta-Cluster, 10Analytics-EventLogging: puppet agent disabled on beta cluster deployment-eventlogging02.eqiad.wmflabs instance - https://phabricator.wikimedia.org/T96921#1233348 (10hashar) I have disabled puppet agent again on the eventlogging02 instance and manually fixed the paths in all `/etc/eventlogging... [09:34:38] 10Beta-Cluster: Can't connect to Beta Cluster database deployment-db1 - https://phabricator.wikimedia.org/T96905#1233359 (10hashar) [09:34:57] !sal [09:34:57] https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [09:36:58] 10Beta-Cluster: Can't connect to Beta Cluster database deployment-db1 or deployment-db2 (MariaDB down) - https://phabricator.wikimedia.org/T96905#1233363 (10hashar) [09:40:03] PROBLEM - Host deployment-mx is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 3144.61 ms [09:40:23] PROBLEM - Host deployment-upload is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2050.93 ms [09:45:00] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [09:45:20] RECOVERY - Host deployment-upload is UP: PING OK - Packet loss = 0%, RTA = 167.00 ms [09:48:12] FLAPPINGSTOP - Host deployment-memc02 is UP: PING WARNING - Packet loss = 0%, RTA = 969.22 ms [09:48:20] PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13) [09:51:26] 10Beta-Cluster: Can't connect to Beta Cluster database deployment-db1 or deployment-db2 (MariaDB down) - https://phabricator.wikimedia.org/T96905#1233371 (10hashar) The beta cluster has two databases instances: 10.68.17.94 and 10.68.16.193 which are respectively deployment-db2 and deployment-db1. The instances... [09:52:11] FLAPPINGSTOP - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [09:52:39] 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1231310 (10hashar) From T96905 it seems MySQL/MariaDB is not started on boot and deployment-db1 got rebooted on Thu Apr 23 23:53. I have restated MySQL :-) [09:52:59] !log restarted mysql on both deployment-db1 and deployment-db2. The service is apparently not started on instance boot. [09:53:31] !log mysql down is https://phabricator.wikimedia.org/T96905 [09:53:33] PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13) [09:56:09] 10Beta-Cluster, 7Monitoring: Beta cluster: monitor MySQL on deployment-db1 and deployment-db2 - https://phabricator.wikimedia.org/T97120#1233374 (10hashar) 3NEW [09:56:44] 10Beta-Cluster: Can't connect to Beta Cluster database deployment-db1 or deployment-db2 (MariaDB down) - https://phabricator.wikimedia.org/T96905#1233383 (10hashar) a:3hashar [09:56:55] stupid log bot [09:58:37] PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12) [10:00:01] FLAPPINGSTOP - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [10:03:37] 10Browser-Tests, 6Collaboration-Team, 10Collaboration-Team-Sprint-A-2015-05-06, 10Flow, 5Patch-For-Review: A5. Fix failed Flow browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94153#1233388 (10hashar) [10:04:13] 10Browser-Tests, 6Collaboration-Team, 10Collaboration-Team-Sprint-A-2015-05-06, 10Flow, 5Patch-For-Review: A5. Fix failed Flow browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94153#1156329 (10hashar) Rephrasing the task details. The Echo and Flow jobs are listed at https://integration.wiki... [10:04:59] PROBLEM - Host deployment-kafka02 is DOWN: CRITICAL - Host Unreachable (10.68.17.156) [10:15:54] FLAPPINGSTOP - Host deployment-test is UP: PING WARNING - Packet loss = 0%, RTA = 714.22 ms [10:16:18] FLAPPINGSTOP - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [10:30:01] PROBLEM - Host deployment-kafka02 is DOWN: CRITICAL - Host Unreachable (10.68.17.156) [10:33:24] PROBLEM - Puppet staleness on deployment-rsync01 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [43200.0] [10:33:40] PROBLEM - Host deployment-logstash1 is DOWN: CRITICAL - Host Unreachable (10.68.16.134) [10:35:27] 6Release-Engineering, 10MediaWiki-Debug-Logging, 6Security-Team, 6operations, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1233411 (10fgiunchedi) no I don't think anyone is working on this, I mostly worked on it when on clinic duty, my plate is full alrea... [10:35:56] RECOVERY - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [10:36:18] RECOVERY - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 47.03 ms [10:36:18] RECOVERY - Puppet failure on deployment-logstash1 is OK: OK: Less than 1.00% above the threshold [0.0] [10:39:55] PROBLEM - Puppet staleness on deployment-kafka02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [10:48:31] FLAPPINGSTOP - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 22.72 ms [10:49:21] PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 58%, RTA = 3278.81 ms [10:52:55] RECOVERY - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [10:53:25] PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2308.11 ms [10:58:24] 10Continuous-Integration, 6Release-Engineering, 6Project-Creators: Create "Continuous-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1233421 (10hashar) The idea is to stop using the too generic `Continuous-Integration` and split requests into two disjunct groups: * Continuous-Integ... [11:03:54] 10Continuous-Integration, 5Patch-For-Review: Disable xdebug's html formatting of PHP errors for Apache on Jenkins slaves - https://phabricator.wikimedia.org/T97040#1233424 (10hashar) XDebug overloads var_dump() to generate HTML formatted stracktraces. From http://xdebug.org/docs/display : > **xdebug.overload_... [11:16:28] 10Continuous-Integration, 10pywikibot-core, 5Patch-For-Review: run pep8 and pep257 for pywikibot/core - https://phabricator.wikimedia.org/T87169#1233436 (10Aklapper) Any decision / movement is welcome here (I repeat myself by saying that this task has "Unbreak now" priority since January 2015 so if this is n... [11:18:15] PROBLEM - Host deployment-memc02 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 2064.00 ms [11:28:21] 6Release-Engineering, 10Continuous-Integration-Infrastructure, 6Project-Creators: Create "Continuous-Integration-Config" component - https://phabricator.wikimedia.org/T96908#1233444 (10Aklapper) 5Open>3Resolved a:3Aklapper >>! In T96908#1233421, @hashar wrote: > So potentially rename #Continuous-Integr... [11:28:30] 10Beta-Cluster, 10Analytics-EventLogging: EventLogging schemas are not served properly on beta cluster - https://phabricator.wikimedia.org/T97047#1233447 (10Tgr) Weird, I distinctly remember I did that. [12:02:12] FLAPPINGSTART - Host deployment-fluoride is UP: PING OK - Packet loss = 0%, RTA = 250.74 ms [12:04:24] FLAPPINGSTOP - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 64%, RTA = 6776.60 ms [12:08:24] RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 110.26 ms [12:13:22] PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13) [12:14:14] RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 151.54 ms [12:28:23] FLAPPINGSTART - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [12:39:38] FLAPPINGSTOP - Host deployment-mx is UP: PING WARNING - Packet loss = 0%, RTA = 599.15 ms [12:40:36] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:46:34] Yippee, build fixed! [12:46:35] Project UploadWizard-api-commons.wikimedia.beta.wmflabs.org build #1828: FIXED in 34 sec: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/1828/ [12:50:29] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 48267 bytes in 1.408 second response time [12:53:13] FLAPPINGSTOP - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [12:54:42] PROBLEM - Host deployment-mx is DOWN: CRITICAL - Host Unreachable (10.68.17.78) [12:55:00] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [13:00:15] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [13:01:30] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1532 bytes in 3.291 second response time [13:07:17] Project browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #524: FAILURE in 1 min 17 sec: https://integration.wikimedia.org/ci/job/browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/524/ [13:08:30] FLAPPINGSTOP - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [13:09:20] 10Browser-Tests, 6Collaboration-Team, 10Collaboration-Team-Sprint-A-2015-05-06, 10Flow, 5Patch-For-Review: A5. Fix failed Flow browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94153#1233562 (10SBisson) [13:09:23] 10Beta-Cluster, 6Release-Engineering, 10Continuous-Integration-Infrastructure, 10Parsoid: Parsoid patches don't update Beta Cluster automatically -- only deploy repo patches seem to update that code - https://phabricator.wikimedia.org/T92871#1233563 (10SBisson) [13:19:14] FLAPPINGSTOP - Host deployment-cache-mobile03 is UP: PING WARNING - Packet loss = 0%, RTA = 1918.99 ms [13:21:56] 6Release-Engineering, 10MediaWiki-Debug-Logging, 6Security-Team, 6operations, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1233579 (10Anomie) Is there anything that actually //needs// doing besides just removing the 'sample' from the 'api' entry in wmgMon... [13:28:47] 10Beta-Cluster, 6Release-Engineering, 10Continuous-Integration-Infrastructure, 10Parsoid: Parsoid patches don't update Beta Cluster automatically -- only deploy repo patches seem to update that code - https://phabricator.wikimedia.org/T92871#1233581 (10greg) Unrelated known problem. [13:33:14] I’ll be rebooting yet more deployment instances today, in attempt to resolve the flapping issue. Most outages should only last a couple of minutes. [13:49:36] PROBLEM - Host deployment-mx is DOWN: CRITICAL - Host Unreachable (10.68.17.78) [13:50:00] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 42.10 ms [14:00:07] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:08:04] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:13:24] PROBLEM - Host deployment-cache-mobile03 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 2056.55 ms [14:14:14] RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [14:16:19] PROBLEM - Host deployment-logstash1 is DOWN: CRITICAL - Host Unreachable (10.68.16.134) [14:19:42] (03PS1) 10Hashar: Beta: Add wikis for ContentTranslation [integration/config] - 10https://gerrit.wikimedia.org/r/206389 (https://phabricator.wikimedia.org/T90683) [14:20:49] (03CR) 10Hashar: [C: 04-2] "Hold until we can confirm that each database has been created and properly configured or the update database job will fail ( https://integ" [integration/config] - 10https://gerrit.wikimedia.org/r/206389 (https://phabricator.wikimedia.org/T90683) (owner: 10Hashar) [14:23:16] PROBLEM - Host deployment-mediawiki01 is DOWN: PING CRITICAL - Packet loss = 27%, RTA = 7161.55 ms [14:23:29] 10Beta-Cluster, 7Blocked-on-RelEng, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 3 others: Setup new wikis in Beta Cluster for Content Translation - https://phabricator.wikimedia.org/T90683#1233674 (10hashar) The wiki page https://wikitech.wikimedia.org/wiki/Nova_Resour... [14:25:08] PROBLEM - Host deployment-cache-mobile03 is DOWN: CRITICAL - Host Unreachable (10.68.16.13) [14:26:02] RECOVERY - Host deployment-mediawiki01 is UP: PING OK - Packet loss = 0%, RTA = 296.79 ms [14:26:23] PROBLEM - Host deployment-upload is DOWN: CRITICAL - Host Unreachable (10.68.16.189) [14:26:59] PROBLEM - Host deployment-elastic07 is DOWN: CRITICAL - Host Unreachable (10.68.17.187) [14:27:01] PROBLEM - Host deployment-parsoidcache02 is DOWN: CRITICAL - Host Unreachable (10.68.16.145) [14:27:05] PROBLEM - Host deployment-restbase02 is DOWN: CRITICAL - Host Unreachable (10.68.17.189) [14:27:09] (03PS1) 10Soeren.oldag: Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 [14:28:03] PROBLEM - Host deployment-memc02 is DOWN: CRITICAL - Host Unreachable (10.68.16.14) [14:29:03] Project browsertests-Wikidata-WikidataTests-linux-chrome-sauce build #1: FAILURE in 2.6 sec: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-WikidataTests-linux-chrome-sauce/1/ [14:29:19] (03CR) 10jenkins-bot: [V: 04-1] Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [14:29:29] 10Beta-Cluster, 10CirrusSearch: JobQueueError Redis server error: Could not insert 1 cirrusSearchLinksUpdatePrioritized job(s). - https://phabricator.wikimedia.org/T97130#1233688 (10hashar) 3NEW [14:30:21] RECOVERY - Host deployment-elastic07 is UP: PING OK - Packet loss = 0%, RTA = 265.74 ms [14:30:51] RECOVERY - Host deployment-restbase02 is UP: PING OK - Packet loss = 0%, RTA = 91.18 ms [14:30:55] RECOVERY - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 200.24 ms [14:31:36] RECOVERY - Host deployment-cache-mobile03 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [14:31:40] RECOVERY - Host deployment-upload is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [14:31:58] RECOVERY - Host deployment-parsoidcache02 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [14:32:02] RECOVERY - Host deployment-memc02 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [14:32:24] PROBLEM - Host deployment-db2 is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 4590.46 ms [14:33:17] magic [14:33:34] more to come :( [14:37:01] andrewbogott: ah that is you rebooting them right ? [14:37:08] yeah [14:37:20] RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [14:37:27] we had a bunch of up/down spam all the night [14:37:39] I guess that was due to the automatic script :D [14:37:56] RECOVERY - Puppet failure on deployment-memc02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:38:02] RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:39:25] andrewbogott: or it just Shinken being crazy cause the host up / down alarms above don't make much sense [14:39:41] deployment-db2 is claimed to be up again but it never went down [14:40:50] hashar: both things are happening. Shinken is freaking out because of the stuttering, and I’m also rebooting things to fix the stuttering. [14:41:27] 10Beta-Cluster: Can't connect to Beta Cluster database deployment-db1 or deployment-db2 (MariaDB down) - https://phabricator.wikimedia.org/T96905#1233736 (10thcipriani) Didn't see mysql in `/etc/rc[1-5].d/` anywhere. Added mysql to both deployment-db{1,2] using: sudo update-rc.d mysql defaults I think this... [14:42:31] ahh [14:43:27] hashar: here's the ticket for the instance stuttering: https://phabricator.wikimedia.org/T97033 [14:43:49] it was sad times yesterday getting to the bottom of that :P [14:44:35] PROBLEM - Host deployment-mx is DOWN: CRITICAL - Host Unreachable (10.68.17.78) [14:44:36] thcipriani: good morning [14:44:54] andrewbogott: so should we just reboot ? :D [14:44:56] hashar: good afternoon [14:45:53] hashar: I’m working as fast as I can :) Need to juggle tools nodes to prevent tool outages. [14:47:37] 10Beta-Cluster, 10Deployment-Systems: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1233776 (10hashar) It seems scap strip the underlying command output. Running mwscript manually I got: ``` Fatal error: Uncaught exception 'LogicException' with message 'Missing stream uri, the stream can... [14:47:56] andrewbogott: sure :) [14:48:39] 10Beta-Cluster, 10CirrusSearch: JobQueueError Redis server error: Could not insert 1 cirrusSearchLinksUpdatePrioritized job(s). - https://phabricator.wikimedia.org/T97130#1233786 (10hashar) The Redis server for jobs is on the [[ https://wikitech.wikimedia.org/wiki/Nova_Resource:I-0000063f.eqiad.wmflabs || depl... [14:49:02] 10Beta-Cluster, 10CirrusSearch: JobQueueError Redis server error: Could not insert 1 cirrusSearchLinksUpdatePrioritized job(s). - https://phabricator.wikimedia.org/T97130#1233789 (10hashar) [14:49:04] 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1231310 (10hashar) [14:50:34] 10Beta-Cluster: Beta cluster database no more update beta-update-databases-eqiad - https://phabricator.wikimedia.org/T97138#1233796 (10hashar) 3NEW [14:51:04] 10Beta-Cluster: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1233803 (10hashar) [14:51:31] 10Beta-Cluster: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1233796 (10hashar) [14:51:33] 10Beta-Cluster, 10Deployment-Systems: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1233805 (10hashar) [14:52:07] thcipriani: are you aware of bd808 monolog change introduced yesterday ? [14:52:18] that causes mwscript to stacktrace on any beta cluster database BUT testwiki [14:52:20] https://phabricator.wikimedia.org/T97138 [14:52:28] seems to be some mediawiki-config related change [14:53:17] hashar: I'm only aware in the most general way [14:54:10] it's weird that testwiki works fine, you're right: it must be a config thing [14:54:28] 10Beta-Cluster: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1233811 (10hashar) Might be caused by https://gerrit.w... [14:54:59] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [14:55:45] we need those jenkins jobs to spam this channel :) [14:59:56] * thcipriani sighs [15:00:13] what did I break hashar ? [15:00:19] good morning! [15:00:23] o/ [15:00:26] https://phabricator.wikimedia.org/T97138 [15:00:34] some monolog stacktrace being spurts out [15:00:42] for anyway beta cluster wiki BUT the testwiki one [15:00:50] hmmm [15:00:54] looking [15:01:04] PROBLEM - Host deployment-kafka02 is DOWN: PING CRITICAL - Packet loss = 46%, RTA = 6347.12 ms [15:01:11] 'Missing stream uri, the stream can not be opened.' sounds like a message for an expert :] [15:01:25] seems some setting/log dest is missing [15:01:28] empty string being passed to something... [15:01:31] but would be set on testwiki for ome reason [15:01:36] https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/ [15:01:43] that shows all the wgDBname [15:01:51] and testwiki is the sole being green [15:02:01] oh [15:02:13] <^d> bd808: I pinged you in -ops earlier re: search errors. Already tracked in Phab [15:02:16] it is only green since ~ 8 or 9 hours ago [15:02:18] <^d> The one you filed was actually a dupe [15:02:40] ^d: cool. something is still running that query constantly [15:02:41] nm, previously testwiki failed because mysql was not available [15:02:49] fatalmonitor is full of it [15:03:02] <^d> bd808: It's basically harmless because of poolcounter [15:03:08] <^d> If annoying [15:03:18] https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor [15:03:35] hundreds per minute [15:04:21] <^d> It's https://phabricator.wikimedia.org/T95021 [15:04:32] <^d> Failing on highlighting [15:04:41] nond [15:04:44] *nod* [15:04:55] PROBLEM - SSH on deployment-redis01 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:59] RECOVERY - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [15:05:00] the more concerning thing for me was the volume of identical requests [15:05:18] <^d> Yeah, obviously somebody's trying to hammer it. [15:05:23] while we were having other problems last night it was annoying me [15:05:33] <^d> But PoolCounter should limit their ability to fallout. [15:05:47] for a while I thought it was the hhvm killer. turned out to be parsoid insteead [15:06:00] <^d> It'd be hard for search to take out hhvm [15:06:05] <^d> yay poolcounter ;-) [15:07:56] 10Beta-Cluster: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1233827 (10bd808) My guess is that this is related to $... [15:08:33] <^d> bd808: I totally want to see those fixed too, fwiw [15:08:40] <^d> All 3 bugs [15:08:44] I'm sure :) [15:09:40] PROBLEM - Host deployment-elastic06 is DOWN: CRITICAL - Host Unreachable (10.68.17.186) [15:09:46] RECOVERY - SSH on deployment-redis01 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [15:10:07] hashar: doh. I totally see the problem [15:10:21] PROBLEM - Host deployment-restbase01 is DOWN: PING CRITICAL - Packet loss = 33%, RTA = 2554.48 ms [15:10:25] https://gerrit.wikimedia.org/r/#/c/191259/10/wmf-config/CommonSettings-labs.php,unified [15:10:31] lines 22-26 [15:10:37] that happens too late [15:10:45] I think [15:11:09] PROBLEM - Host deployment-mediawiki03 is DOWN: CRITICAL - Host Unreachable (10.68.17.55) [15:11:17] 10Beta-Cluster, 10Deployment-Systems: scap eats underlying commands output (such as maintenance script stacktrace) - https://phabricator.wikimedia.org/T97140#1233830 (10hashar) 3NEW [15:11:44] 10Beta-Cluster, 10Deployment-Systems: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1229359 (10hashar) scap not showing the maintenance script trace is T97140 [15:12:25] bd808: yeah I had multiple issues with initialization ordering :/ [15:12:39] you can probably confirm on beta cluster by manually moving the code [15:12:51] and confirm the fix running eval.php --wiki=enwiki [15:12:55] PROBLEM - Host integration-saltmaster is DOWN: CRITICAL - Host Unreachable (10.68.18.24) [15:13:03] not sure why testwiki is unaffected though [15:13:20] I can see how it's wrong now. trying to figure out a reasonable fix [15:13:33] PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12) [15:13:56] would it be horrible if both cli and web mwDebug() output ended up in the same file? [15:14:07] PROBLEM - Host deployment-parsoid05 is DOWN: CRITICAL - Host Unreachable (10.68.16.120) [15:14:47] I thought it made sense early on [15:14:59] PROBLEM - Host deployment-kafka02 is DOWN: CRITICAL - Host Unreachable (10.68.17.156) [15:15:14] now that we have logstash, it is probably easier to find errors we are looking for [15:15:19] so yeah might make sense to merge them [15:15:35] my typical use case was to tail -f web.log while browsing the site [15:15:48] that dismiss errors from jobs (which are run in cli) [15:15:53] PROBLEM - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149) [15:15:57] 10Beta-Cluster: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1233853 (10bd808) Caused by {rOMWC2680380cba022787f19c7... [15:16:39] one day we will have a suite of integration tests for operations/mediawiki-config.git :D [15:17:18] heh. one day we will rewrite all that mess into something that actually makes logical sense [15:17:39] RECOVERY - Host deployment-elastic06 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [15:17:47] bd808: there's always a place for you in RelEng :P [15:18:11] RECOVERY - Host deployment-mediawiki03 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [15:18:17] RECOVERY - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [15:18:21] <3 greg-g <3 [15:18:30] RECOVERY - Host deployment-parsoid05 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [15:18:32] RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [15:19:18] RECOVERY - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [15:19:26] parsoid [15:19:29] RECOVERY - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [15:19:35] RECOVERY - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [15:19:49] RECOVERY - Parsoid on deployment-parsoid05 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.062 second response time [15:20:45] bd808: based on hiera :D *evil* [15:21:01] RECOVERY - SSH on deployment-restbase01 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [15:21:11] bd808: I gotta go, but the Jenkins job should self recover whenever the traces disappear [15:21:18] *nod* [15:21:21] also andrew is rebooting a bunch of Precise instances [15:21:26] so that causes some more side effects [15:21:52] kudos on getting monolog on prod anyway! [15:22:34] moving out have a good weekend [15:23:27] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:24:53] PROBLEM - Puppet staleness on deployment-kafka02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [15:25:29] PROBLEM - Puppet failure on deployment-restbase01 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [0.0] [15:29:38] PROBLEM - Host deployment-mx is DOWN: CRITICAL - Host Unreachable (10.68.17.78) [15:30:00] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [15:30:27] thcipriani, ^d: https://gerrit.wikimedia.org/r/#/c/206399/ for your review [15:30:30] RECOVERY - Puppet failure on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:31:19] <^d> commit summary longer than actual patch, the way I like 'em :p [15:44:28] PROBLEM - SSH on deployment-cxserver03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:40] PROBLEM - Host deployment-mediawiki01 is DOWN: CRITICAL - Host Unreachable (10.68.17.170) [15:45:00] PROBLEM - Host deployment-mx is DOWN: CRITICAL - Host Unreachable (10.68.17.78) [15:45:34] PROBLEM - Host deployment-cxserver03 is DOWN: CRITICAL - Host Unreachable (10.68.16.150) [15:46:14] PROBLEM - Host deployment-redis01 is DOWN: CRITICAL - Host Unreachable (10.68.16.177) [15:46:18] PROBLEM - Host deployment-sentry2 is DOWN: CRITICAL - Host Unreachable (10.68.17.204) [15:47:18] PROBLEM - Host deployment-db2 is DOWN: CRITICAL - Host Unreachable (10.68.17.94) [15:47:26] PROBLEM - Host deployment-zookeeper01 is DOWN: CRITICAL - Host Unreachable (10.68.17.157) [15:48:33] 6Release-Engineering, 10MediaWiki-Debug-Logging, 6Security-Team, 6operations, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1233906 (10bd808) My patch in {rOMWC2680380cba022787f19c783a4535d8794ffda8d8} restores unsampled xff logs to fluorine. I left api sa... [15:49:21] FLAPPINGSTOP - Host deployment-fluoride is DOWN: CRITICAL - Host Unreachable (10.68.16.190) [15:51:38] RECOVERY - Host deployment-zookeeper01 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [15:51:50] RECOVERY - Host deployment-cxserver03 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [15:52:12] RECOVERY - Host deployment-mx is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [15:52:16] RECOVERY - Host deployment-db2 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [15:52:28] RECOVERY - Host deployment-redis01 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [15:52:32] <^d> bd808: Did you want that out today? [15:52:50] RECOVERY - Host deployment-mediawiki01 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [15:52:55] ^d: It's apparently breaking beta cluster so ... yes! [15:53:31] <^d> gogogo [15:53:53] RECOVERY - Host deployment-sentry2 is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [15:54:19] RECOVERY - SSH on deployment-cxserver03 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [16:13:13] 10Deployment-Systems: scap eats underlying commands output (such as maintenance script stacktrace) - https://phabricator.wikimedia.org/T97140#1233953 (10greg) [16:13:35] PROBLEM - Host deployment-redis02 is DOWN: CRITICAL - Host Unreachable (10.68.16.231) [16:14:01] PROBLEM - Host integration-slave-trusty-1021 is DOWN: CRITICAL - Host Unreachable (10.68.16.17) [16:14:15] PROBLEM - Host deployment-zotero01 is DOWN: CRITICAL - Host Unreachable (10.68.17.102) [16:14:21] PROBLEM - Host deployment-mediawiki02 is DOWN: CRITICAL - Host Unreachable (10.68.16.127) [16:15:43] PROBLEM - Host deployment-elastic05 is DOWN: CRITICAL - Host Unreachable (10.68.17.182) [16:16:17] PROBLEM - Host deployment-parsoid01-test is DOWN: CRITICAL - Host Unreachable (10.68.17.215) [16:16:17] PROBLEM - Host deployment-apertium01 is DOWN: CRITICAL - Host Unreachable (10.68.16.79) [16:16:23] PROBLEM - Host deployment-jobrunner01 is DOWN: CRITICAL - Host Unreachable (10.68.17.96) [16:16:53] PROBLEM - Host deployment-stream is DOWN: CRITICAL - Host Unreachable (10.68.17.106) [16:17:06] :( [16:19:17] shinken spam is a good thing here. once all these this is over and done with, we'll be on nice new kernel version for the virthosts and all will be well. [16:19:28] * thcipriani repeats to myself [16:19:42] "I think I can I think I can" [16:19:52] RECOVERY - Host deployment-apertium01 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [16:20:16] RECOVERY - Host deployment-mediawiki02 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [16:20:18] RECOVERY - Host deployment-jobrunner01 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [16:20:25] http://wittyandpretty.com/wp-content/uploads/2014/04/little-engine-that-literally-cant-even.jpg [16:20:26] RECOVERY - Host deployment-zotero01 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [16:20:32] RECOVERY - Host integration-slave-trusty-1021 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [16:20:36] RECOVERY - Host deployment-stream is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [16:20:38] RECOVERY - Host deployment-redis02 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [16:20:46] RECOVERY - Host deployment-elastic05 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [16:20:50] RECOVERY - Host deployment-parsoid01-test is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [16:21:02] that was labvirt1004 I guess: 16:18 < icinga-wm> RECOVERY - Host labvirt1004 is UPING OK - Packet loss = 0%, RTA = 2.13 ms [16:29:02] PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:29:06] PROBLEM - Puppet failure on deployment-zotero01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [16:32:29] thcipriani: that’s all of them. [16:32:42] So… now your job is to form an opinion about whether or not things still suck :) [16:33:14] andrewbogott: awesome. I'll start digging through hosts. The ones rebooted yesterday seemed to stay solid overnight. [16:33:38] lack of shinken spam will also be a positive indicator [16:33:52] FLAPPINGSTART [16:34:03] RECOVERY - Puppet failure on deployment-elastic05 is OK: OK: Less than 1.00% above the threshold [0.0] [16:43:23] Is the backlog at #Wikimedia-log-errors decreasing? :) [16:45:24] !log rm stale /var/lib/puppet/state/agent_catalog_run.lock on deployment-kafka02 [16:47:07] oh, huh, no logsbot [16:51:53] hmm, deployment-lucid-salt is still un-ssh-able [16:54:43] I’ll look [16:54:47] wait, /lucid/? [16:54:48] dang [16:55:05] I just kicked it [16:55:40] andrewbogott: would the log bot for this channel have been affected by the reboots? [16:55:52] thcipriani: Maybe, but I don’t know what logbot you use [16:55:54] what was it called? [16:57:48] ah, qa-morebots, sorry, had to dig through logs... [16:57:52] may have been down a while [16:58:07] hm, I just restarted it, I’ll try again [16:58:58] qa-morebots: what’s up? [16:58:58] I am a logbot running on tools-exec-02. [16:58:58] Messages are logged to https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL. [16:58:58] To log a message, type !log . [16:59:55] RECOVERY - Puppet staleness on deployment-kafka02 is OK: OK: Less than 1.00% above the threshold [3600.0] [17:00:12] !log rm stale /var/lib/puppet/state/agent_catalog_run.lock on deployment-kafka02 [17:00:14] Logged the message, Master [17:00:19] neat. Thanks [17:03:29] huh, dkim keys missing from deployment-salt [17:05:54] thcipriani: you first noticed the problem at, what, 7pm PST on Wednesday? [17:06:01] * andrewbogott is documenting [17:06:08] Is the backlog at #Wikimedia-log-errors decreasing? :) [17:06:10] I doubt it [17:06:33] andrewbogott: yup, that's when the shinken started screaming about it [17:06:38] ok, thanks [17:06:43] I didn't notice anything until the following morning, FWIW [17:07:46] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<30.00%) [17:08:02] Krenair: thanks. Is someone checking whether the old errors keep occurring? [17:08:14] Not actively [17:10:22] !log gzip /var/log/account/pacct.0 on deployment-bastion: ought to revisit logrotate on that instance. [17:10:24] Logged the message, Master [17:20:38] !log rm stale lock on deployment-rsync01, puppet fine [17:20:40] Logged the message, Master [17:22:41] RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK [17:24:55] Yippee, build fixed! [17:24:55] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #296: FIXED in 48 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/296/ [17:26:48] !log remove deployment-prep from domain in /etc/puppet/puppet.conf on deployment-stream, puppet now OK [17:26:50] Logged the message, Master [17:28:23] RECOVERY - Puppet staleness on deployment-rsync01 is OK: OK: Less than 1.00% above the threshold [3600.0] [17:32:27] 10Beta-Cluster: Can't connect to Beta Cluster database deployment-db1 or deployment-db2 (MariaDB down) - https://phabricator.wikimedia.org/T96905#1234147 (10hashar) to be verified, it is well possible that in production we intentionally prevent mysql from starting manually. Either via the deb package or puppet.... [17:35:32] andre__: I've changed #continuous-integration-infra additional hashtags to lowercase, otherwise the url redirect doesn't work (it normalises to lowercase) [17:35:40] e.g. https://phabricator.wikimedia.org/tag/continuous-integration/ didn't work, it does now [17:35:57] https://phabricator.wikimedia.org/project/profile/401/ [17:52:00] Yippee, build fixed! [17:52:01] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-10-sauce build #228: FIXED in 57 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-10-sauce/228/ [17:57:43] 10Beta-Cluster, 5Patch-For-Review: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1234249 (10bd808) Aaarrrrgh! More... [17:57:49] 10Beta-Cluster, 5Patch-For-Review: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1234250 (10bd808) a:3bd808 [17:58:04] 10Beta-Cluster, 5Patch-For-Review: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1233796 (10bd808) p:5Triage>3Un... [17:59:04] Yippee, build fixed! [17:59:04] Project browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #519: FIXED in 1 min 3 sec: https://integration.wikimedia.org/ci/job/browsertests-Math-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/519/ [18:01:28] ^d: One more tweak for beta logging -- https://gerrit.wikimedia.org/r/#/c/206414/ [18:01:31] !log ran sudo chown -R mwdeploy:mwdeploy /srv/mediawiki on deployment-bastion to fix beta-scap-eqiad, hopefully [18:01:34] Logged the message, Master [18:01:36] 10Deployment-Systems, 6Services: Evaluate Ansible as a deployment tool - https://phabricator.wikimedia.org/T93433#1234264 (10RyanLane) There's a team that's working on deployment right? Are you a member of that team @GWicke? Proposing an alternative outside of that team means you're actively fighting them, mak... [18:01:56] thcipriani: yuck. any idea how that got messed up? [18:02:22] bd808: not at all, actually, it looked like everything was owned by mwdeploy under that directory [18:02:36] but once I ran the chown, sync-common worked [18:02:55] weird [18:02:56] whereas before I got rsync: mkstemp "/srv/mediawiki/.wikiversions-labs.cdb.o2VOpX" failed: Permission denied (13) [18:03:28] oh. top level dir permissions? That could be something in puppet [18:06:52] 10Deployment-Systems, 6Services: Evaluate Ansible as a deployment tool - https://phabricator.wikimedia.org/T93433#1234285 (10RyanLane) As for Ansible itself, see my quite extensive blog post on this: http://ryandlane.com/blog/2014/08/04/moving-away-from-puppet-saltstack-or-ansible/ Ansible using SSH always l... [18:06:55] maybe, well, yay! that fixed that problem in jenkins, now we get a new error: https://phabricator.wikimedia.org/P556 [18:08:58] thcipriani: I have a patch for that. https://gerrit.wikimedia.org/r/#/c/206414/ [18:09:09] heh, I was just looking at that :) [18:10:27] !log cvn Promited Rxy from member to projectadmin [18:10:29] Logged the message, Master [18:14:54] bd808: oic what's happening, okie doke, merging [18:16:50] I didn't notice the function wrapper when I rearranged in the prior patch. [18:16:59] goofy config system is goofy [18:19:05] dat config system tho. [18:19:56] 6Release-Engineering, 10Continuous-Integration-Config: Rewrite beta-update-databases to not use unstable Configuration Matrix - https://phabricator.wikimedia.org/T96199#1234360 (10Krinkle) [18:20:09] 10Beta-Cluster, 6Release-Engineering, 10Continuous-Integration-Config: Rewrite beta-update-databases to not use unstable Configuration Matrix - https://phabricator.wikimedia.org/T96199#1210871 (10Krinkle) [18:21:26] bd808: blerg. Same error :( [18:21:52] did the config update job run yet? [18:21:57] 10Beta-Cluster, 6Release-Engineering, 10Continuous-Integration-Config: Send beta cluster Jenkins alerts to betacluster-alert list - https://phabricator.wikimedia.org/T1125#1234391 (10Krinkle) [18:22:02] 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1234392 (10coren) This //should// be all fixed now; I'm not seeing the intermittent VM stalls anymore and all kernels have been upgraded to the fixed kernel. [18:22:24] 10Browser-Tests, 6Collaboration-Team, 10Continuous-Integration-Config, 10Flow, 7Easy: send Flow browser test job notices to #wikimedia-corefeatures channel - https://phabricator.wikimedia.org/T66103#1234393 (10Krinkle) [18:22:25] bd808: yeah, can see the update on /srv/mediawiki-staging [18:22:33] grrr [18:22:49] I see it too... so back to hunting [18:28:15] So $wmfUdp2logDest is set on line 121 of CommonSettings.php; InitializeSettings is loaded on line 169 [18:28:23] * bd808 scratches head [18:30:36] oh ffs [18:30:44] there is yet another sub function [18:33:05] (03PS1) 10Krinkle: Copy LocalSettings.php to "/log" in teardown instead of setup [integration/jenkins] - 10https://gerrit.wikimedia.org/r/206422 (https://phabricator.wikimedia.org/T90613) [18:37:21] Krinkle, uh, wasn't aware. Thank you! [18:41:57] (03CR) 10Krinkle: [C: 032] Copy LocalSettings.php to "/log" in teardown instead of setup [integration/jenkins] - 10https://gerrit.wikimedia.org/r/206422 (https://phabricator.wikimedia.org/T90613) (owner: 10Krinkle) [18:43:37] (03Merged) 10jenkins-bot: Copy LocalSettings.php to "/log" in teardown instead of setup [integration/jenkins] - 10https://gerrit.wikimedia.org/r/206422 (https://phabricator.wikimedia.org/T90613) (owner: 10Krinkle) [18:44:29] thcipriani: how's the beta scap job now? [18:44:43] gonna be fixed here after this merge, I reckon [18:45:01] thcipriani: yay [18:47:19] looks promising https://integration.wikimedia.org/ci/job/beta-scap-eqiad/50358/console [18:47:29] 6Release-Engineering, 10Continuous-Integration-Infrastructure, 5Patch-For-Review, 7Tracking: doc.wikimedia.org: Generate documentation for release tags (tracking) - https://phabricator.wikimedia.org/T73062#771415 (10Krinkle) [18:51:19] woot [18:51:43] that l10nupdate will take a while as long as this has been stuck [18:51:48] probably 15-20 minutes [18:52:01] slow staging server is slow [18:52:46] oh shit, hopefully jenkins doesn't kill it due to the timeout limit.... [18:53:41] Yippee, build fixed! [18:53:42] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #297: FIXED in 41 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/297/ [18:57:17] (03PS1) 10Krinkle: Make sure archive-log-dir is consistently after mw-teardown [integration/config] - 10https://gerrit.wikimedia.org/r/206427 (https://phabricator.wikimedia.org/T90613) [18:57:32] forgot to reprime the key on dep-bastion! [18:57:53] :( [19:01:44] (03CR) 10Krinkle: [C: 032] Make sure archive-log-dir is consistently after mw-teardown [integration/config] - 10https://gerrit.wikimedia.org/r/206427 (https://phabricator.wikimedia.org/T90613) (owner: 10Krinkle) [19:03:36] (03Merged) 10jenkins-bot: Make sure archive-log-dir is consistently after mw-teardown [integration/config] - 10https://gerrit.wikimedia.org/r/206427 (https://phabricator.wikimedia.org/T90613) (owner: 10Krinkle) [19:12:18] Yippee, build fixed! [19:12:18] Project beta-scap-eqiad build #50359: FIXED in 14 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/50359/ [19:12:27] \o/ [19:12:46] yay! [19:13:21] alright, anything else burning? [19:13:28] there ought to be an upstart job for the keyholder thing. Or maybe there is and it's gone screwy. [19:13:54] greg-g: I don't think there's anything wrong with beta right now that hasn't been. [19:14:10] 10Beta-Cluster, 5Patch-For-Review: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1234615 (10greg) 5Open>3Resolve... [19:14:11] 10Beta-Cluster, 10Deployment-Systems: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1234617 (10greg) [19:14:28] thcipriani: alright then, let's call it a week! ;) [19:15:37] thcipriani: what about https://phabricator.wikimedia.org/T96905 ? [19:16:48] 10Beta-Cluster, 10Analytics-EventLogging: EventLogging schemas are not served properly on beta cluster - https://phabricator.wikimedia.org/T97047#1234623 (10Tgr) 5Open>3Resolved a:3Tgr Seems fixed, presumably due to work done in T97033. [19:16:49] greg-g: yeah, that should be resolved. Don't know about followup tickets for deciding about adding mysql to /etc/rc[x].d/ in puppet [19:17:08] 10Beta-Cluster, 10Deployment-Systems: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1234628 (10greg) a:5mmodell>3thcipriani https://integration.wikimedia.org/ci/job/beta-scap-eqiad/50359/ Yay! [19:18:22] the problem was mainly the stuttering instances + not having mysql start at boot + me not manually fixing that until after beta was stable [19:18:34] * greg-g nods [19:18:42] * greg-g doesn't like that manual word [19:19:22] mariadb puppet stuff definitely has some different needs in prod vs labs :( [19:20:58] Yippee, build fixed! [19:20:59] Project beta-update-databases-eqiad build #9149: FIXED in 58 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/9149/ [19:28:43] 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1234661 (10thcipriani) 5Open>3Resolved [19:36:05] oh Jenkins beta jobs have been fixed! [19:36:34] thcipriani: congratulations! [19:37:20] hashar: thanks, although, mostly what I did was fret a lot about them :) [19:38:47] 10Beta-Cluster: Beta cluster intermittent failures - https://phabricator.wikimedia.org/T97033#1234698 (10hashar) [19:38:49] 10Beta-Cluster, 10CirrusSearch: JobQueueError Redis server error: Could not insert 1 cirrusSearchLinksUpdatePrioritized job(s). - https://phabricator.wikimedia.org/T97130#1234695 (10hashar) 5Open>3Resolved a:3hashar Seems redis is back up just fine now. [19:40:26] thcipriani: have you figured out why MySQL doesn't start on boot ? [19:40:38] that is most probably the intention in prod [19:40:56] but the 1$ question is whether it is done in the deb package or some puppet manifest [19:41:20] hashar: no I haven't checked with any opsen [19:43:58] Sorry I broke beta logging so badly guys. I was still seeing logs into logstash and never thought to check jenkins jobs :( [19:44:41] shit happens! [19:44:50] at least it did not land on prod hehe [19:45:17] maybe the jenkins job should keep spamming errors here until the job is fixed [19:47:08] hashar: so, a quick install of mariadb-server on 14.04 does create a symlink in /etc/rc1.d [19:47:39] thcipriani: maybe Sean removed them manually [19:47:41] using the trusty repos [19:48:00] lemme try with the wikimedia repos [19:48:24] don't waste too much time on it though [19:49:09] 10Beta-Cluster, 7Monitoring: Beta cluster: monitor MySQL on deployment-db1 and deployment-db2 - https://phabricator.wikimedia.org/T97120#1234740 (10hashar) [19:49:10] 10Beta-Cluster: Setup monitoring for database servers in beta cluster - https://phabricator.wikimedia.org/T87093#1234741 (10hashar) [19:49:57] 10Beta-Cluster: Setup monitoring for database servers in beta cluster - https://phabricator.wikimedia.org/T87093#983793 (10hashar) From T97120 The beta cluster MySQL servers turned out to be down for a few hours (T96905) and there is no monitoring for it. We would need on both instances (deployment-db1 and dep... [19:50:59] hashar, was mysql really down? [19:51:09] I logged into both of those instances and checked service mysql status at the time [19:51:17] Krenair: yes was not running [19:51:26] at least when I looked at it [19:56:19] 10Browser-Tests, 3Gather Sprint Forward, 6Mobile-Web, 10Mobile-Web-Sprint-45-Snakes-On-A-Plane, 5Patch-For-Review: Fix failed MobileFrontend browsertests Jenkins jobs - https://phabricator.wikimedia.org/T94156#1234791 (10hashar) [19:56:22] 10Beta-Cluster, 10Deployment-Systems: beta cluster scap failure - https://phabricator.wikimedia.org/T96920#1234789 (10hashar) 5Open>3Resolved Magically fixed when T97138 got fixed :) [19:58:42] 5Continuous-Integration-Isolation, 10Continuous-Integration-Infrastructure, 6operations, 7Nodepool, and 2 others: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1234813 (10hashar) We now have a preliminary Debian package which is good enough. We will improv... [19:59:42] 5Continuous-Integration-Isolation, 10Continuous-Integration-Infrastructure, 6operations, 7Nodepool: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1234818 (10hashar) p:5Normal>3Low [20:03:11] 10Beta-Cluster: Can't connect to Beta Cluster database deployment-db1 or deployment-db2 (MariaDB down) - https://phabricator.wikimedia.org/T96905#1234868 (10Jdforrester-WMF) [20:09:48] hashar: I updated you cache_no_hardlinks patch for Zuul and tested it on a depooled slave with git cache enabled locally. Working fine! Clones mediawiki core in 30 seconds. [20:12:46] Krinkle: ohhhhhh [20:12:53] Krinkle: that is quite an old patches isn't it ? [20:13:33] Krinkle: now we have a debian package, it should be failry trivial to incorporate that patch in our .deb and roll it everywhere [20:15:09] hashar: Its taking rather long for upsream to merge patches.. [20:20:15] Krinkle: yeah I am not sure why :/ [20:20:43] might be worth poking them on their openstack-infra mailing list [20:22:46] 5Continuous-Integration-Isolation: Instances created by Nodepool cant run puppet due to missing certificate - https://phabricator.wikimedia.org/T96670#1234939 (10hashar) [20:23:36] I am off [20:23:41] weekend at last [20:30:29] PROBLEM - Free space - all mounts on deployment-eventlogging02 is CRITICAL: CRITICAL: deployment-prep.deployment-eventlogging02.diskspace._var.byte_percentfree (<30.00%) WARN: deployment-prep.deployment-eventlogging02.diskspace.root.byte_percentfree (<100.00%) [20:43:47] 10Browser-Tests, 6Release-Engineering, 5Patch-For-Review: Use rspec-expectations "expect" syntax instead of "should" syntax - https://phabricator.wikimedia.org/T68369#1235037 (10Physikerwelt) [21:17:25] PROBLEM - Puppet staleness on deployment-eventlogging02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [43200.0] [21:27:41] 10Deployment-Systems: scap eats underlying commands output (such as maintenance script stacktrace) - https://phabricator.wikimedia.org/T97140#1235181 (10mmodell) it's got "--output" to a temp file - is the file empty? [21:32:03] 10Deployment-Systems: scap eats underlying commands output (such as maintenance script stacktrace) - https://phabricator.wikimedia.org/T97140#1235203 (10bd808) The stdout/stderr of the proc should be going to the logger at debug level. Is jenkins running with the `--verbose` flag? I can't remember if I got that... [21:45:40] 10Beta-Cluster, 10Sentry, 10Wikimedia-Logstash: Channel PHP errors from Logstash to Sentry on the beta cluster - https://phabricator.wikimedia.org/T85239#1235275 (10matmarex) [22:08:35] 10Staging, 3releng-201415-Q3: [Quarterly Success Metric] Stable uptime metrics of the Staging cluster - https://phabricator.wikimedia.org/T88705#1235332 (10greg) >>! In T88705#1156024, @mmodell wrote: > [[ https://graphite.wmflabs.org//render?width=600&from=-8hours&until=now&height=400&target=cactiStyle%28alia... [22:12:56] 10Beta-Cluster, 5Patch-For-Review, 15User-Bd808-Test: Beta cluster: exception 'LogicException' with message 'Missing stream uri, the stream can not be opened.' in /mnt/srv/mediawiki-staging/php-master/includes/debug/logger/monolog/LegacyHandler.php:113 - https://phabricator.wikimedia.org/T97138#1235340 (10bd8... [22:17:37] 10Staging, 3releng-201415-Q3: [Quarterly Success Metric] Stable uptime metrics of the Staging cluster - https://phabricator.wikimedia.org/T88705#1235342 (10mmodell) @greg: https://graphite.wmflabs.org/dashboard/#availability should work now [22:20:00] twentyafterfour: thanks! [22:20:17] 10Staging, 3releng-201415-Q3: [Quarterly Success Metric] Stable uptime metrics of the Staging cluster - https://phabricator.wikimedia.org/T88705#1235357 (10mmodell) [22:38:15] greg-g: so when I run jouncebot locally it works just fine :/ [22:38:45] Not sure what is making it sad running from tool labs [22:39:10] * bd808 will try restarting again (definition of insanity?) [22:59:58] Yippee, build fixed! [22:59:58] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-firefox-sauce build #31: FIXED in 57 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-firefox-sauce/31/ [23:15:26] 10Staging, 3releng-201415-Q3: [Quarterly Success Metric] Stable uptime metrics of the Staging cluster - https://phabricator.wikimedia.org/T88705#1235541 (10mmodell) [23:34:33] Project beta-scap-eqiad build #50388: FAILURE in 30 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/50388/ [23:56:53] Yippee, build fixed! [23:56:54] Project browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-chrome-sauce build #31: FIXED in 53 sec: https://integration.wikimedia.org/ci/job/browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-chrome-sauce/31/