[00:00:14] (03PS1) 10Dduvall: Raita Elasticsearch logging [selenium] - 10https://gerrit.wikimedia.org/r/207324 [00:02:49] (03CR) 10jenkins-bot: [V: 04-1] Raita Elasticsearch logging [selenium] - 10https://gerrit.wikimedia.org/r/207324 (owner: 10Dduvall) [00:06:59] (03PS2) 10Dduvall: Raita Elasticsearch logging [selenium] - 10https://gerrit.wikimedia.org/r/207324 [00:07:41] (03CR) 10jenkins-bot: [V: 04-1] Raita Elasticsearch logging [selenium] - 10https://gerrit.wikimedia.org/r/207324 (owner: 10Dduvall) [00:12:06] (03PS3) 10Dduvall: Raita Elasticsearch logging [selenium] - 10https://gerrit.wikimedia.org/r/207324 [00:14:07] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [00:25:39] I think there's something messed up with how jenkins is doing the unit tests for gwtoolset - https://gerrit.wikimedia.org/r/#/c/207329/ [00:54:05] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [01:15:03] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [01:30:04] RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0] [02:07:13] PROBLEM - Puppet staleness on deployment-elastic07 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [02:23:34] (03PS1) 10Krinkle: Create npm-run-doc job [integration/config] - 10https://gerrit.wikimedia.org/r/207363 [02:28:33] (03PS2) 10Krinkle: Create npm-run-doc job [integration/config] - 10https://gerrit.wikimedia.org/r/207363 [02:28:45] (03CR) 10Krinkle: [C: 032] "Deployed npm-run-doc job." [integration/config] - 10https://gerrit.wikimedia.org/r/207363 (owner: 10Krinkle) [02:30:49] (03Merged) 10jenkins-bot: Create npm-run-doc job [integration/config] - 10https://gerrit.wikimedia.org/r/207363 (owner: 10Krinkle) [02:35:30] (03PS1) 10Legoktm: Make OAI phpunit job voting, use generic job [integration/config] - 10https://gerrit.wikimedia.org/r/207368 (https://phabricator.wikimedia.org/T67895) [02:36:03] (03CR) 10Legoktm: [C: 032] Make OAI phpunit job voting, use generic job [integration/config] - 10https://gerrit.wikimedia.org/r/207368 (https://phabricator.wikimedia.org/T67895) (owner: 10Legoktm) [02:37:56] (03Merged) 10jenkins-bot: Make OAI phpunit job voting, use generic job [integration/config] - 10https://gerrit.wikimedia.org/r/207368 (https://phabricator.wikimedia.org/T67895) (owner: 10Legoktm) [02:39:12] Krinkle: crap, I didn't realize you hadn't deployed yet :/ [02:39:57] !log deploying https://gerrit.wikimedia.org/r/207363 and https://gerrit.wikimedia.org/r/207368 [02:40:04] Logged the message, Master [02:42:27] Perfect [02:42:39] legoktm: No worries [02:55:08] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [02:55:22] PROBLEM - Puppet failure on deployment-kafka02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:29:20] PROBLEM - Puppet staleness on deployment-redis01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [04:04:24] Project beta-scap-eqiad build #50885: FAILURE in 29 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/50885/ [04:15:05] Yippee, build fixed! [04:15:06] Project beta-scap-eqiad build #50886: FIXED in 1 min 7 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/50886/ [04:48:36] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 47285 bytes in 9.787 second response time [04:54:37] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:40:20] RECOVERY - Puppet failure on deployment-kafka02 is OK: OK: Less than 1.00% above the threshold [0.0] [05:41:31] Yippee, build fixed! [05:41:31] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce build #54: FIXED in 25 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-chrome-sauce/54/ [05:46:27] PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [06:32:14] Project browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #597: FAILURE in 13 min: https://integration.wikimedia.org/ci/job/browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/597/ [06:37:41] RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK [06:51:18] PROBLEM - Puppet failure on deployment-kafka02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [07:43:15] Yippee, build fixed! [07:43:16] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce build #21: FIXED in 34 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-10-sauce/21/ [07:50:57] PROBLEM - Parsoid on deployment-parsoid05 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:55:54] RECOVERY - Parsoid on deployment-parsoid05 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 5.384 second response time [08:21:20] Yippee, build fixed! [08:21:22] Project browsertests-CirrusSearch-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #566: FIXED in 1 min 19 sec: https://integration.wikimedia.org/ci/job/browsertests-CirrusSearch-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/566/ [08:28:31] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 47296 bytes in 3.832 second response time [08:43:32] just so folks know, I am still on track for salt upgrae in deployment-prep in 1 hour and 15 minutes (10 am utc). [08:43:45] I'll notify here before I proceed. [08:44:01] this may impact salt-related commands including git-deploy. [08:44:25] I got my timezone calculations wrong and thought it would be in 15 minutes :-D [09:08:31] Project browsertests-Wikidata-SmokeTests-linux-firefox-sauce-T89343-DEBUG build #1: FAILURE in 18 sec: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-SmokeTests-linux-firefox-sauce-T89343-DEBUG/1/ [09:12:28] Yippee, build fixed! [09:12:28] Project browsertests-Wikidata-SmokeTests-linux-firefox-sauce-T89343-DEBUG build #2: FIXED in 1 min 28 sec: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-SmokeTests-linux-firefox-sauce-T89343-DEBUG/2/ [09:16:51] zeljkof: any clue what that job is: browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce [09:17:01] https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-11-sauce/ [09:17:22] hashar: not sure what you mean :/ [09:17:29] it is in jenkins [09:17:34] but apparently not in our jjb config file [09:17:39] it should be [09:17:49] ah was created by gilles [09:17:54] if it is not in jjb file, delete it [09:17:54] maybe a patch in progress [09:18:02] yes, that might be it [09:21:41] Project browsertests-Wikidata-SmokeTests-linux-firefox-sauce-T89343-DEBUG build #3: FAILURE in 7 sec: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-SmokeTests-linux-firefox-sauce-T89343-DEBUG/3/ [09:22:48] (03PS1) 10Hashar: Logrotate mwext-VisualEditor-sync-gerrit [integration/config] - 10https://gerrit.wikimedia.org/r/207400 (https://phabricator.wikimedia.org/T91396) [09:22:50] (03PS1) 10Hashar: browsertest for MultimediaViewer win7+ie11 [integration/config] - 10https://gerrit.wikimedia.org/r/207401 (https://phabricator.wikimedia.org/T91396) [09:23:00] (03CR) 10Hashar: [C: 032] Logrotate mwext-VisualEditor-sync-gerrit [integration/config] - 10https://gerrit.wikimedia.org/r/207400 (https://phabricator.wikimedia.org/T91396) (owner: 10Hashar) [09:24:51] (03Merged) 10jenkins-bot: Logrotate mwext-VisualEditor-sync-gerrit [integration/config] - 10https://gerrit.wikimedia.org/r/207400 (https://phabricator.wikimedia.org/T91396) (owner: 10Hashar) [09:26:29] RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0] [09:29:34] (03CR) 10Hashar: [C: 032] browsertest for MultimediaViewer win7+ie11 [integration/config] - 10https://gerrit.wikimedia.org/r/207401 (https://phabricator.wikimedia.org/T91396) (owner: 10Hashar) [09:31:42] (03Merged) 10jenkins-bot: browsertest for MultimediaViewer win7+ie11 [integration/config] - 10https://gerrit.wikimedia.org/r/207401 (https://phabricator.wikimedia.org/T91396) (owner: 10Hashar) [09:51:24] RECOVERY - Puppet failure on deployment-kafka02 is OK: OK: Less than 1.00% above the threshold [0.0] [09:56:41] (03PS1) 10Hashar: translatewiki-puppetlint-lenient is now voting [integration/config] - 10https://gerrit.wikimedia.org/r/207409 (https://phabricator.wikimedia.org/T95090) [09:57:21] (03CR) 10Hashar: [C: 032] translatewiki-puppetlint-lenient is now voting [integration/config] - 10https://gerrit.wikimedia.org/r/207409 (https://phabricator.wikimedia.org/T95090) (owner: 10Hashar) [09:57:44] I'm going to get started on the salt upgrade in deployment-prep now [09:58:57] (03Merged) 10jenkins-bot: translatewiki-puppetlint-lenient is now voting [integration/config] - 10https://gerrit.wikimedia.org/r/207409 (https://phabricator.wikimedia.org/T95090) (owner: 10Hashar) [10:15:38] Project beta-scap-eqiad build #50922: FAILURE in 1 min 40 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/50922/ [10:16:27] apergos: great to hear! [10:26:34] doo dee doo dee doo [10:26:50] salt is upgraded on master, the syndic and the minion is upgraded there as well [10:27:04] and everybody is responsive except parsoid05 which is heavily loaded [10:27:18] now setting up for upgrade of all non jessie minions [10:27:28] PROBLEM - Puppet failure on deployment-parsoid05 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [10:36:28] Yippee, build fixed! [10:36:28] Project beta-scap-eqiad build #50924: FIXED in 2 min 33 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/50924/ [10:38:07] deployment-parsoid05 is still very unhappy.... anyone around to look at it? [10:40:59] PROBLEM - Puppet failure on deployment-test is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [0.0] [11:01:24] I'll take that as a 'no'. It might not get updated til a more quiet time for it then [11:15:27] besides deployment-parsoid05, the other host that failed to update is deployment-cache-bits01, I will be doing this host manually [11:15:56] RECOVERY - Puppet failure on deployment-test is OK: OK: Less than 1.00% above the threshold [0.0] [12:02:21] now doing the two jessie instances [12:08:34] apergos: parsoid05 suffers from some mis configuration that causes parsoid service to eat 100% cpu [12:08:39] so I guess it is unrelated [12:08:45] oh I know it's not me [12:09:02] it was unhappy before I started. but I got the update done over there [12:09:05] the bug being https://phabricator.wikimedia.org/T97421 [12:09:29] I'lll leave that tab open [12:09:35] might be worth looking at later [12:11:31] that might be the instance having some issue though [12:13:20] !log killing puppet on deployment-parsoid05 eats all CPU for some reason [12:13:28] Logged the message, Master [12:14:20] fyi eployment-restbase01 and 02 are the two jessie instances that are taking some extra time [12:15:42] maybe that is the underlying virt node which is just slow [12:16:10] [414080.036058] BUG: soft lockup - CPU#1 stuck for 21s! [apt-cache:7539] [12:21:42] I am going to upgrade and reboot it [12:22:14] !log deployment-parsoid05 slow down is https://phabricator.wikimedia.org/T97421 . Running apt-get upgrade and rebooting it but its slowness issue might be with the underlying hardware [12:22:18] Logged the message, Master [12:23:14] (03PS3) 10JanZerebecki: Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [12:25:06] (03CR) 10jenkins-bot: [V: 04-1] Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [12:29:32] (03CR) 10Soeren.oldag: [C: 031] Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [12:35:40] (03PS4) 10JanZerebecki: Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [12:37:59] PROBLEM - Parsoid on deployment-parsoid05 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:38:59] eyeroll [12:42:21] (03CR) 10Soeren.oldag: [C: 031] Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [12:43:28] (03PS5) 10JanZerebecki: Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [12:43:43] (03CR) 10JanZerebecki: "PS5 is rebase only" [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [12:50:55] (03CR) 10JanZerebecki: "Deployed the Jenkins jobs this commit adds: mwext-WikidataQuality-npm, mwext-WikidataQuality-qunit, mwext-WikidataQuality-repo-tests-mysql" [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [12:52:34] (03CR) 10Soeren.oldag: [C: 031] Added job for WikidataQuality extension. [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [12:57:53] RECOVERY - Parsoid on deployment-parsoid05 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 3.397 second response time [12:58:29] all instances in deployment prep have been updated to 2015.7.5, all are responsive (at least for now :-P) [13:02:12] !log labvirt1005 seems to have hardware issue. Impacts a bunch of beta cluster / integration instances as listed on https://phabricator.wikimedia.org/T97521#1245217 [13:02:16] Logged the message, Master [13:04:00] PROBLEM - Parsoid on deployment-parsoid05 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:05:17] zeljkof: CFisch_WMDE will be there in a minute.. ;) [13:05:41] Tobi_WMDE_SW: ok, no rush :) [13:11:28] !log Rebooting deployment-parsoid05 via wikitech interface. [13:11:32] Logged the message, Master [13:14:01] PROBLEM - Host deployment-parsoid05 is DOWN: CRITICAL - Host Unreachable (10.68.16.120) [13:25:06] (03CR) 10JanZerebecki: "npm jobs works:" [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [13:25:25] hashar: did you tell me at some point that deployment-bastion has some firewall issues maybe? that host has never done well being a target of git deploy for me [13:25:38] but the other targets respond fine so I'm inclined to ignore [13:26:06] it probably has ferm rules [13:26:12] and might be missing the one for salt? [13:27:51] well salt works as in test.ping is ok [13:28:05] but some other piece maybe [13:29:03] it's a little odd since it's the deploy server to also be a target [13:29:08] (03CR) 10JanZerebecki: "Will fix: https://phabricator.wikimedia.org/T97529" [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [13:31:42] doubt there will be [13:34:05] apergos: you might want to announce it somewhere :] [13:34:09] QA list might be a good place [13:34:59] I have to look again to see if I'm on that [13:38:28] I sent to qa@lists, if it doesn't show up in a few minutes holler, I'll send to you to forward I guess [13:39:13] the subject has "salt upgraded" in it [13:39:19] (03PS1) 10Hashar: Cloner: Implement cache-no-hardlinks argument [integration/zuul] (patch-queue/debian/precise-wikimedia) - 10https://gerrit.wikimedia.org/r/207438 [13:46:51] (03PS1) 10Hashar: zuul-cloner can now hardlinks from cache-dir [integration/zuul] (debian/precise-wikimedia) - 10https://gerrit.wikimedia.org/r/207442 (https://phabricator.wikimedia.org/T97106) [13:50:02] apergos: else poke engineering list :] [13:53:46] (03CR) 10Hashar: [C: 032 V: 032] "Build and published at http://people.wikimedia.org/~hashar/debs/zuul_2.0.0-304-g685ca22-wmf2/" [integration/zuul] (debian/precise-wikimedia) - 10https://gerrit.wikimedia.org/r/207442 (https://phabricator.wikimedia.org/T97106) (owner: 10Hashar) [13:54:38] guess it didn't show up eh? [13:58:17] (03PS1) 10Hashar: Merge branch 'debian/precise-wikimedia' into debian/trusty-wikimedia [integration/zuul] (debian/trusty-wikimedia) - 10https://gerrit.wikimedia.org/r/207444 (https://phabricator.wikimedia.org/T97106) [14:00:26] (03CR) 10Hashar: [C: 032 V: 032] Merge branch 'debian/precise-wikimedia' into debian/trusty-wikimedia [integration/zuul] (debian/trusty-wikimedia) - 10https://gerrit.wikimedia.org/r/207444 (https://phabricator.wikimedia.org/T97106) (owner: 10Hashar) [14:03:12] (03CR) 10JanZerebecki: [C: 04-1] "Needs two composer runs or a combined one..." [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [14:11:45] !log rebooting integration-saltmaster stalled. [14:11:52] Logged the message, Master [14:16:21] PROBLEM - Host integration-saltmaster is DOWN: CRITICAL - Host Unreachable (10.68.18.24) [14:18:59] RECOVERY - Host deployment-parsoid05 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [14:23:49] RECOVERY - Parsoid on deployment-parsoid05 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.032 second response time [14:25:23] !log upgrading zuul on integration-slave-precise-1011 for https://phabricator.wikimedia.org/T97106 [14:25:28] Logged the message, Master [14:32:30] RECOVERY - Puppet failure on deployment-parsoid05 is OK: OK: Less than 1.00% above the threshold [0.0] [14:33:24] PROBLEM - Host deployment-test is DOWN: CRITICAL - Host Unreachable (10.68.16.149) [14:34:52] Krinkle|detached: I have incorporated the zuul-cloner git cache hardlink stuff in our .deb packages. Build both Precise and Trusty one and have put them in /data/project/root/ integration-slave-precise1011 has it now. Full details at https://phabricator.wikimedia.org/T97106#1245429 [14:52:56] hashar: cool [14:53:41] Krinkle: I guess there will be no side effect [14:53:48] Yeah [14:53:57] Krinkle: I have wrote some explanations on the task, so feel free to do the upgrade on the other instances [14:54:08] RECOVERY - Host deployment-test is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [14:54:10] I would have mass done it via salt but the saltmaster is dead at the moment :] [14:54:15] a good point [14:54:22] hashar: Aye [14:54:23] cherry picking a patch is reasonably easy [14:54:29] What happened? [14:54:33] oh [14:54:48] integration-saltmaster has been migrated last week to a new labs hardware (the labvirtXXXX hosts) [14:55:03] and the hardware has a faulty memory :] [14:55:08] they are cursed :/ [14:57:18] PROBLEM - Host deployment-cache-bits01 is DOWN: CRITICAL - Host Unreachable (10.68.16.12) [15:00:05] !log Instances are being moved out from labvirt1005 which has some faulty memory. List of instances at https://phabricator.wikimedia.org/T97521#1245217 [15:00:09] Logged the message, Master [15:00:51] golly [15:04:00] (03PS1) 10Aude: Update Wikidata branch to wmf/1.26wmf4 [tools/release] - 10https://gerrit.wikimedia.org/r/207459 [15:10:43] RECOVERY - Host deployment-cache-bits01 is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [15:11:29] PROBLEM - Host deployment-elastic06 is DOWN: CRITICAL - Host Unreachable (10.68.17.186) [15:12:12] hashar: Ah, right. [15:45:50] hashar: You told Andrew about https://phabricator.wikimedia.org/T96706 ? [15:46:11] I know he's busy with the migration, just making sure. [15:48:30] Krinkle: yeah yesterday during the weekly meeting [15:48:37] quite easy to do [15:48:41] RECOVERY - Host deployment-elastic06 is UP: PING OK - Packet loss = 0%, RTA = 377.75 ms [15:48:42] I am off, meeting! [15:50:55] PROBLEM - Host deployment-kafka02 is DOWN: CRITICAL - Host Unreachable (10.68.17.156) [15:54:37] PROBLEM - Host deployment-mediawiki03 is DOWN: CRITICAL - Host Unreachable (10.68.17.55) [15:55:54] RECOVERY - Host deployment-kafka02 is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [16:34:37] Project beta-scap-eqiad build #50962: FAILURE in 31 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/50962/ [16:47:07] 6Release-Engineering, 10MediaWiki-Maintenance-scripts, 10MediaWiki-Redirects, 5Patch-For-Review: namespaceDupes not handling deleted namespace redirects as desired - https://phabricator.wikimedia.org/T91401#1245997 (10demon) a:5demon>3None [16:49:45] 6Release-Engineering, 10MediaWiki-Maintenance-scripts, 10MediaWiki-Redirects, 5Patch-For-Review: namespaceDupes not handling deleted namespace redirects as desired - https://phabricator.wikimedia.org/T91401#1246003 (10demon) a:3demon Whoops, didn't mean to unassign. Also: refreshLinks has since finished... [16:55:13] Yippee, build fixed! [16:55:13] Project beta-scap-eqiad build #50964: FIXED in 1 min 11 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/50964/ [16:59:08] huh: cp: cannot create regular file `/srv/mediawiki-staging/php-master/cache/l10n/l10n_cache-ab.cdb': Permission denied [17:04:54] I bet there is a l10nupdate user in both ldap and locally on deployment-bastion [17:15:22] !log removed l10nupdate user from /etc/passwd on deployment-bastion [17:15:29] Logged the message, Master [17:17:24] 6Release-Engineering, 10MediaWiki-Vagrant, 7Documentation: Document RSpec workflow on MediaWiki-Vagrant - https://phabricator.wikimedia.org/T97464#1246100 (10dduvall) A slightly easier way would be to invoke the specific gem version using `__` following the bin name. ``` /Users/.../vagrant $ gem in... [17:26:41] 10Browser-Tests: Create new account at Sauce Labs for running Jenkins jobs - https://phabricator.wikimedia.org/T97549#1246124 (10zeljkofilipin) 3NEW a:3zeljkofilipin [17:27:45] 10Browser-Tests: Create new account at Sauce Labs for running Jenkins jobs - https://phabricator.wikimedia.org/T97549#1246124 (10zeljkofilipin) [17:35:12] 10Browser-Tests: Create new account at Sauce Labs for running Jenkins jobs - https://phabricator.wikimedia.org/T97549#1246154 (10zeljkofilipin) User with username wikimedia-jenkins created. Asked OIT to create jenkins@wikimedia.org. [17:50:14] RECOVERY - Host deployment-mediawiki03 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [17:52:16] PROBLEM - Host deployment-restbase01 is DOWN: CRITICAL - Host Unreachable (10.68.17.227) [17:52:58] 7Blocked-on-RelEng, 6Release-Engineering, 6Multimedia, 6Reading-Infrastructure-Team, and 4 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1246211 (10bd808) [18:00:05] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [18:00:19] 6Release-Engineering: Shorten/Simply MW train deploy cadence to M->Tu->W - https://phabricator.wikimedia.org/T97553#1246233 (10greg) 3NEW [18:00:41] 10Beta-Cluster, 7Blocked-on-RelEng, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 3 others: Setup new wikis in Beta Cluster for Content Translation - https://phabricator.wikimedia.org/T90683#1246240 (10mmodell) @KartikMistry it should be correct now, I'm going to try to f... [18:02:04] 6Release-Engineering: Shorten/Simply MW train deploy cadence to M->Tu->W - https://phabricator.wikimedia.org/T97553#1246257 (10greg) (NB: This proposal is basically how FB does their deploy cadence, other than how the gradual rollout happens. Essentially weekly starting on Monday.) [18:06:13] 6Release-Engineering: Shorten/Simplify MW train deploy cadence to M->Tu->W - https://phabricator.wikimedia.org/T97553#1246287 (10greg) [18:16:33] 6Release-Engineering: Shorten/Simplify MW train deploy cadence to M->Tu->W - https://phabricator.wikimedia.org/T97553#1246327 (10demon) If we did this we should automate the cutting (and testing) of the new branch like Sunday night. [18:21:29] 6Release-Engineering: Shorten/Simplify MW train deploy cadence to M->Tu->W - https://phabricator.wikimedia.org/T97553#1246343 (10Legoktm) Why MTuW instead of TuWTh? (deploying on monday means you're rushed to fix any bugs that might have been discovered over the weekend) [18:28:30] 6Release-Engineering: Shorten/Simplify MW train deploy cadence to M->Tu->W - https://phabricator.wikimedia.org/T97553#1246367 (10greg) >>! In T97553#1246327, @demon wrote: > If we did this we should automate the cutting (and testing) of the new branch like Sunday night. +1 >>! In T97553#1246343, @Legoktm wrote... [18:37:54] (03CR) 10Aude: [C: 032] Update Wikidata branch to wmf/1.26wmf4 [tools/release] - 10https://gerrit.wikimedia.org/r/207459 (owner: 10Aude) [18:38:04] (03Merged) 10jenkins-bot: Update Wikidata branch to wmf/1.26wmf4 [tools/release] - 10https://gerrit.wikimedia.org/r/207459 (owner: 10Aude) [18:54:17] 6Release-Engineering: Shorten/Simplify MW train deploy cadence to M->Tu->W - https://phabricator.wikimedia.org/T97553#1246452 (10mmodell) We really need to automate the branching stuff anyway - it's really time consuming and error prone. the way it is now wastes not just my time but @aude's and anyone else who w... [18:54:49] PROBLEM - Parsoid on deployment-parsoid05 is CRITICAL: Connection refused [18:55:53] 6Release-Engineering: Shorten/Simplify MW train deploy cadence to M->Tu->W - https://phabricator.wikimedia.org/T97553#1246460 (10Jdforrester-WMF) I'd also suggest TuWTh so that the preponderance of holiday Mondays doesn't massively disrupt schedules. (As well as @demon's, @mmodell's and @legoktm's points.) [18:58:13] 6Release-Engineering: Shorten/Simplify MW train deploy cadence to Tu->W->Th - https://phabricator.wikimedia.org/T97553#1246481 (10greg) [19:00:26] PROBLEM - Free space - all mounts on deployment-eventlogging02 is CRITICAL: CRITICAL: deployment-prep.deployment-eventlogging02.diskspace._var.byte_percentfree (<100.00%) [19:01:29] greg-g: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=156472&oldid=156458 [19:06:23] 10Beta-Cluster, 10MediaWiki-extensions-GWToolset, 6Multimedia, 7HHVM, 5Patch-For-Review: GWToolset XML upload fails with “The file that was uploaded exceeds the upload_max_filesize and/or the post_max_size directive in php.ini” on hhvm 3.6 - https://phabricator.wikimedia.org/T97415#1246514 (10Bawolff) >>... [19:06:53] manybubbles: ty, I think I had an edit window open for that ysterday but never did it [19:07:18] greg-g: I'm just trynig to get the last of it done so I can get it to beta tomorrow! [19:07:31] Its mostly done, but I've got some small stuff to finish up [19:09:35] coolio [19:11:30] 6Release-Engineering, 10Wikidata: enable use of production deployed autoloader for extensions that is created by composer - https://phabricator.wikimedia.org/T97560#1246532 (10JanZerebecki) 3NEW [19:13:51] 6Release-Engineering, 10Wikidata: enable use of production deployed autoloader for extensions that is created by composer - https://phabricator.wikimedia.org/T97560#1246563 (10JanZerebecki) [19:15:32] Project beta-scap-eqiad build #50978: FAILURE in 1 min 28 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/50978/ [19:15:57] (03CR) 10JanZerebecki: "A combined composer run won't work for now, see https://phabricator.wikimedia.org/T97560 ." [integration/config] - 10https://gerrit.wikimedia.org/r/206392 (owner: 10Soeren.oldag) [19:17:35] RECOVERY - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 4.00 ms [19:20:47] RECOVERY - SSH on integration-saltmaster is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [19:21:10] 6Release-Engineering, 3Team-Practices-This-Week: Test phabricator sprint extension updates - https://phabricator.wikimedia.org/T95469#1246627 (10KLans_WMF) 5Open>3Resolved [19:25:01] Yippee, build fixed! [19:25:02] Project beta-scap-eqiad build #50979: FIXED in 1 min 3 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/50979/ [19:28:40] PROBLEM - Free space - all mounts on deployment-bastion is CRITICAL: CRITICAL: deployment-prep.deployment-bastion.diskspace._var.byte_percentfree (<50.00%) [19:30:15] 6Release-Engineering, 10Wikidata, 7Composer: enable use of production deployed autoloader for extensions that is created by composer - https://phabricator.wikimedia.org/T97560#1246671 (10bd808) [19:31:16] PROBLEM - Puppet staleness on integration-saltmaster is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [19:36:55] 6Release-Engineering: Shorten/Simplify MW train deploy cadence to Tu->W->Th - https://phabricator.wikimedia.org/T97553#1246722 (10mmodell) I'm confused at how the best case would be 3 days but worst case 10 days? [19:41:12] 6Release-Engineering: Shorten/Simplify MW train deploy cadence to Tu->W->Th - https://phabricator.wikimedia.org/T97553#1246771 (10greg) >>! In T97553#1246722, @mmodell wrote: > I'm confused at how the best case would be 3 days but worst case 10 days? If you merge into master right after the new branch which hap... [19:46:35] RECOVERY - Host deployment-restbase01 is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [19:48:41] RECOVERY - Free space - all mounts on deployment-bastion is OK: OK: All targets OK [19:52:09] 6Release-Engineering: Shorten/Simplify MW train deploy cadence to Tu->W->Th - https://phabricator.wikimedia.org/T97553#1246830 (10mmodell) could also do it like this: cut the branch and pushed to testing wikis on Wednesday like we already do but promote sooner: wednesday: new branch thursday: group 1 friday... [20:00:57] PROBLEM - Host integration-raita is DOWN: CRITICAL - Host Unreachable (10.68.16.53) [20:06:09] RECOVERY - Host integration-raita is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [20:06:47] PROBLEM - Host integration-saltmaster is DOWN: PING CRITICAL - Packet loss = 100% [20:16:21] RECOVERY - Host integration-saltmaster is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [20:21:53] 6Release-Engineering: Convert old wmf/* deployment branches to tags (recurring chore) - https://phabricator.wikimedia.org/T1288#1246928 (10Krinkle) a:5Krinkle>3None [20:49:09] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce build #412: ABORTED in 4 min 47 sec: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce/412/ [20:54:34] aborted? [21:07:06] PROBLEM - Host deployment-bastion is DOWN: CRITICAL - Host Unreachable (10.68.16.58) [21:10:41] PROBLEM - Host Generic Beta Cluster is DOWN: CRITICAL - Host Unreachable (en.wikipedia.beta.wmflabs.org) [21:10:56] PROBLEM - Host deployment-cache-text02 is DOWN: CRITICAL - Host Unreachable (10.68.16.16) [21:11:02] PROBLEM - Puppet failure on deployment-mediawiki01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [21:11:28] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [21:11:54] RECOVERY - Host deployment-cache-text02 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [21:11:58] RECOVERY - Host deployment-bastion is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [21:14:20] RECOVERY - Host Generic Beta Cluster is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [21:14:56] PROBLEM - Host deployment-db1 is DOWN: CRITICAL - Host Unreachable (10.68.16.193) [21:15:22] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [21:17:30] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1532 bytes in 3.091 second response time [21:19:18] PROBLEM - Host deployment-db2 is DOWN: CRITICAL - Host Unreachable (10.68.17.94) [21:19:32] PROBLEM - App Server Main HTTP Response on deployment-mediawiki01 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1532 bytes in 2.054 second response time [21:20:00] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1532 bytes in 2.062 second response time [21:20:10] ugh [21:20:19] greg-g: these are "planned" sort of [21:20:23] oh, ok [21:20:28] RECOVERY - Host deployment-db1 is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [21:20:34] andrewbogott is migrating instances [21:20:56] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1888 bytes in 6.103 second response time [21:22:17] thcipriani: please poke me when you are donw [21:22:18] e [21:22:26] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1885 bytes in 2.050 second response time [21:22:58] matanya: will do [21:23:20] thanks [21:24:32] RECOVERY - App Server Main HTTP Response on deployment-mediawiki01 is OK: HTTP OK: HTTP/1.1 200 OK - 47144 bytes in 0.890 second response time [21:25:00] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 47065 bytes in 1.076 second response time [21:25:48] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 28386 bytes in 0.538 second response time [21:26:16] PROBLEM - Host deployment-jobrunner01 is DOWN: CRITICAL - Host Unreachable (10.68.17.96) [21:27:26] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 47358 bytes in 0.576 second response time [21:27:26] RECOVERY - App Server Main HTTP Response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 47281 bytes in 0.557 second response time [21:30:20] RECOVERY - Host deployment-jobrunner01 is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [21:30:40] PROBLEM - Host deployment-logstash1 is DOWN: CRITICAL - Host Unreachable (10.68.16.134) [21:31:29] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:35:19] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:35:27] PROBLEM - Host deployment-mediawiki02 is DOWN: PING CRITICAL - Packet loss = 100% [21:36:01] RECOVERY - Puppet failure on deployment-mediawiki01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:36:19] RECOVERY - Host deployment-logstash1 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [21:37:48] RECOVERY - Host deployment-mediawiki02 is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [21:39:30] PROBLEM - Host deployment-rsync01 is DOWN: CRITICAL - Host Unreachable (10.68.17.66) [21:41:02] RECOVERY - SSH on deployment-mediawiki02 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [21:41:34] RECOVERY - Host deployment-rsync01 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [21:51:34] 10Beta-Cluster, 6Release-Engineering, 10Continuous-Integration-Config, 10Parsoid: Parsoid patches don't update Beta Cluster automatically -- only deploy repo patches seem to update that code - https://phabricator.wikimedia.org/T92871#1247250 (10cscott) We tend to deploy something very close to Parsoid mast... [21:58:46] matanya: all clear! [21:58:56] thanks much thcipriani [22:04:52] (03CR) 10Dduvall: [C: 032] Enforce jshint linting [integration/raita] - 10https://gerrit.wikimedia.org/r/207103 (owner: 10Dduvall) [22:05:54] (03Merged) 10jenkins-bot: Enforce jshint linting [integration/raita] - 10https://gerrit.wikimedia.org/r/207103 (owner: 10Dduvall) [22:11:03] thcipriani: still broken: (Cannot access the database: Unknown database 'dawiki' (10.68.17.94)) [22:13:27] * thcipriani looking [22:13:55] thcipriani: i tried to create a user, if that helps [22:14:45] (03PS4) 10Dduvall: Raita Elasticsearch logging [selenium] - 10https://gerrit.wikimedia.org/r/207324 [22:20:18] (03PS2) 10Dduvall: Field mappings for more build information [integration/raita] - 10https://gerrit.wikimedia.org/r/207291 [22:21:33] (03CR) 10Dduvall: [C: 032] Field mappings for more build information [integration/raita] - 10https://gerrit.wikimedia.org/r/207291 (owner: 10Dduvall) [22:21:45] (03Merged) 10jenkins-bot: Field mappings for more build information [integration/raita] - 10https://gerrit.wikimedia.org/r/207291 (owner: 10Dduvall) [22:24:30] hmm, yeah, I don't see it in mysql either, fwiw: ERROR 1049 (42000): Unknown database 'dawiki' [22:27:37] matanya: looks like this is being worked on: https://phabricator.wikimedia.org/T90683 [22:28:26] thanks thcipriani i'll ty again tomorrow [22:28:29] see also: https://phabricator.wikimedia.org/T97388 [22:28:30] try [22:37:52] 10Deployment-Systems, 6Release-Engineering, 6Services, 6operations: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1247411 (10ssastry) Goes without saying that the individual services should also be able to work with the fact that multiple versions of t... [23:31:20] PROBLEM - Puppet failure on deployment-pdf01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [23:31:24] PROBLEM - Puppet staleness on deployment-urldownloader is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [23:31:26] (03PS1) 10Dduvall: Moved index.html to a docroot directory [integration/raita] - 10https://gerrit.wikimedia.org/r/207679 [23:31:44] PROBLEM - Puppet failure on deployment-cache-text02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]