[00:06:13] Project browsertests-VisualEditor-language-screenshot-windows_8.1-firefox » ilo,contintLabsSlave && UbuntuTrusty build #8: SUCCESS in 10 hr: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-language-screenshot-windows_8.1-firefox/LANGUAGE_SCREENSHOT_CODE=ilo,label=contintLabsSlave%20&&%20UbuntuTrusty/8/ [00:08:42] 3Beta-Cluster: HHVM emits logs filling /var/log/upstart/hhvm.log and /var/log/syslog/ filling disk - https://phabricator.wikimedia.org/T71976#808880 (10bd808) [00:08:43] 3Beta-Cluster: hhvm apache fills /var/log/apache2 with access logs - https://phabricator.wikimedia.org/T75262#808881 (10bd808) [00:11:52] 3MediaWiki-Core-Team, Beta-Cluster: no log in deployment-bastion:/data/project/logs from "503 server unavailable" on beta labs - https://phabricator.wikimedia.org/T74275#760174 (10bd808) [00:14:07] 3Verified-in-Phase0, VisualEditor, Verified-in-Phase2, Beta-Cluster: [Regression pre-wmf10] upload.beta.wmflabs.org is throwing 503s so all images are appearing with a broken icon inside VE - https://phabricator.wikimedia.org/T75786#808901 (10Ryasmeen) [00:15:26] 3Verified-in-Phase0, VisualEditor, Verified-in-Phase2, Beta-Cluster: [Regression pre-wmf10] upload.beta.wmflabs.org is throwing 503s so Math function parsing is completely broken inside VE - https://phabricator.wikimedia.org/T75787#808903 (10Ryasmeen) [00:19:28] Project beta-scap-eqiad build #32408: FAILURE in 25 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/32408/ [00:26:19] Project browsertests-VisualEditor-language-screenshot-windows_8.1-firefox » lb,contintLabsSlave && UbuntuTrusty build #8: SUCCESS in 11 hr: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-language-screenshot-windows_8.1-firefox/LANGUAGE_SCREENSHOT_CODE=lb,label=contintLabsSlave%20&&%20UbuntuTrusty/8/ [00:27:53] 3Release-Engineering, MediaWiki-Core-Team: Make ::mediawiki::syslog and ::mediawiki::php logging destinations configurable via hiera - https://phabricator.wikimedia.org/T1295#808937 (10bd808) [00:28:10] 3MediaWiki-Core-Team, Beta-Cluster: no log in deployment-bastion:/data/project/logs from "503 server unavailable" on beta labs - https://phabricator.wikimedia.org/T74275#808939 (10bd808) [00:28:11] 3Release-Engineering, MediaWiki-Core-Team: Make ::mediawiki::syslog and ::mediawiki::php logging destinations configurable via hiera - https://phabricator.wikimedia.org/T1295#808938 (10bd808) 5Open>3Resolved [00:29:33] !log deleted instance "udplog" [00:29:38] Logged the message, Master [00:30:29] wow. deployment-prep has 44 instances [00:31:10] PROBLEM - Host udplog is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [00:34:27] There are an awful lot of services boxes that could probably be merged together. (mathoid, pdf, parsoid, sca, restbase) [00:39:29] that sounds like a good idea [00:45:22] Project browsertests-VisualEditor-language-screenshot-windows_8.1-firefox » sv,contintLabsSlave && UbuntuTrusty build #8: SUCCESS in 11 hr: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-language-screenshot-windows_8.1-firefox/LANGUAGE_SCREENSHOT_CODE=sv,label=contintLabsSlave%20&&%20UbuntuTrusty/8/ [01:56:16] Yippee, build fixed! [01:56:16] Project browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #402: FIXED in 1 hr 19 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-firefox-sauce/402/ [01:57:37] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [02:08:36] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [02:13:06] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [02:18:47] Project browsertests-MobileFrontend-test2.m.wikipedia.org-linux-firefox-sauce build #326: FAILURE in 1 hr 23 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-test2.m.wikipedia.org-linux-firefox-sauce/326/ [02:37:35] 3Release-Engineering: Add in Phabricator quarterly milestones for RelEng - https://phabricator.wikimedia.org/T75729#809062 (10Aklapper) a:3Aklapper I think in this case I'd also favor generic quarterly tags. If hashar does not say No. :) [02:38:05] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [02:48:19] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce build #312: FAILURE in 52 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce/312/ [03:09:07] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [03:15:55] 3Continuous-Integration, Citoid, VisualEditor: Set up CI in the mediawiki/services/citoid.git repo - https://phabricator.wikimedia.org/T76069#809103 (10Krinkle) a:3Krinkle [03:18:37] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [03:29:31] (03PS1) 10Krinkle: Create citoid-npm job and enable (non-voting) [integration/config] - 10https://gerrit.wikimedia.org/r/177484 [03:29:36] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [03:39:03] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [03:54:36] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [04:29:05] (03CR) 10Krinkle: [C: 032] "Deployed to Jenkins." [integration/config] - 10https://gerrit.wikimedia.org/r/177484 (owner: 10Krinkle) [04:30:04] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [04:30:38] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [04:32:21] (03CR) 10jenkins-bot: [V: 04-1] Create citoid-npm job and enable (non-voting) [integration/config] - 10https://gerrit.wikimedia.org/r/177484 (owner: 10Krinkle) [04:42:25] (03CR) 10Krinkle: "Hm.. that's a first." [integration/config] - 10https://gerrit.wikimedia.org/r/177484 (owner: 10Krinkle) [04:42:29] (03CR) 10Krinkle: [C: 032] Create citoid-npm job and enable (non-voting) [integration/config] - 10https://gerrit.wikimedia.org/r/177484 (owner: 10Krinkle) [04:46:42] (03Merged) 10jenkins-bot: Create citoid-npm job and enable (non-voting) [integration/config] - 10https://gerrit.wikimedia.org/r/177484 (owner: 10Krinkle) [04:47:00] Okay, now you're just being weird zuul-layout [05:00:05] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [05:35:38] Project browsertests-VisualEditor-test2.wikipedia.org-linux-firefox-sauce build #342: FAILURE in 1 hr 21 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-test2.wikipedia.org-linux-firefox-sauce/342/ [05:51:04] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [05:55:38] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [06:21:04] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [06:31:37] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [06:37:29] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [06:51:19] Project browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #332: FAILURE in 20 min: https://integration.wikimedia.org/ci/job/browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/332/ [06:52:04] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [06:56:48] PROBLEM - Puppet failure on deployment-restbase01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [07:02:30] RECOVERY - Puppet failure on deployment-cache-bits01 is OK: OK: Less than 1.00% above the threshold [0.0] [07:17:09] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [07:21:48] RECOVERY - Puppet failure on deployment-restbase01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:13:23] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [08:18:17] aha! [08:20:24] 3Scrum-of-Scrums, Beta-Cluster: Create labs ldap user and group "apache" to be used for NFS4 access in deployment-prep project - https://phabricator.wikimedia.org/T76086#809300 (10yuvipanda) [08:38:03] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:52:16] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #362: FAILURE in 55 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/362/ [09:02:38] PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [09:22:07] Project browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #447: FAILURE in 1 hr 14 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/447/ [09:27:39] RECOVERY - Puppet failure on deployment-logstash1 is OK: OK: Less than 1.00% above the threshold [0.0] [09:29:05] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [09:33:47] (03PS2) 10Zfilipin: mediawiki-ruby-api-bundle-rubocop is now voting [integration/config] - 10https://gerrit.wikimedia.org/r/177231 (owner: 10Hashar) [09:34:47] (03CR) 10Zfilipin: [C: 032] mediawiki-ruby-api-bundle-rubocop is now voting [integration/config] - 10https://gerrit.wikimedia.org/r/177231 (owner: 10Hashar) [09:35:46] (03Merged) 10jenkins-bot: mediawiki-ruby-api-bundle-rubocop is now voting [integration/config] - 10https://gerrit.wikimedia.org/r/177231 (owner: 10Hashar) [09:45:48] Yippee, build fixed! [09:45:48] Project browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #369: FIXED in 1 hr 9 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce/369/ [09:56:37] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:57:27] (03CR) 10Hashar: "[integration-zuul-layoutvalidation-gate] 04:29:12 Job citoid-npm not defined" [integration/config] - 10https://gerrit.wikimedia.org/r/177484 (owner: 10Krinkle) [09:59:03] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [10:04:32] All config deployments to labs are stuck: https://integration.wikimedia.org/zuul/ [10:21:51] (03CR) 10Hashar: "I have deployed the configuration on Zuul server (still has to be done manually)." [integration/config] - 10https://gerrit.wikimedia.org/r/177231 (owner: 10Hashar) [10:32:37] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [10:57:37] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [11:12:05] Yippee, build fixed! [11:12:05] Project browsertests-MobileFrontend-test2.m.wikipedia.org-linux-firefox-sauce build #327: FIXED in 1 hr 23 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-test2.m.wikipedia.org-linux-firefox-sauce/327/ [11:19:01] Yippee, build fixed! [11:19:02] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce build #313: FIXED in 43 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-windows_8-internet_explorer-sauce/313/ [11:22:34] 3Release-Engineering: Add in Phabricator quarterly milestones for RelEng - https://phabricator.wikimedia.org/T75729#816441 (10Qgil) The context of those Release Engineering quarterly goals is https://www.mediawiki.org/wiki/Wikimedia_Engineering/2014-15_Goals Meaning: at least every WMF Enginnering and Product t... [11:23:54] Who can fix stuck zuul? [11:24:15] https://integration.wikimedia.org/zuul/ [11:25:07] eh, I mean mwconfig deployments on beta [11:50:05] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [12:07:31] PROBLEM - Puppet failure on deployment-eventlogging02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [12:28:39] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [12:35:16] can someone kick jenkins? [12:35:42] https://integration.wikimedia.org/ci/view/BrowserTests/view/Wikidata/job/beta-update-databases-eqiad/5921/console is running since 12 hours and stuff is piling up [12:37:29] RECOVERY - Puppet failure on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:40:05] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [12:50:54] Tobi_WMDE_SW: yes, it seems stuck. Waiting for it too. [13:18:38] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:28:26] 3Continuous-Integration: Config changes to deployment prep are not deployed - https://phabricator.wikimedia.org/T76714#818748 (10Nikerabbit) [13:29:37] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [13:36:39] 3Continuous-Integration: Config changes to deployment prep are not deployed - https://phabricator.wikimedia.org/T76714#818774 (10KartikMistry) [13:37:32] 3Continuous-Integration: Config changes to deployment prep are not deployed - https://phabricator.wikimedia.org/T76714#818748 (10KartikMistry) [13:59:37] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [14:10:39] 3Continuous-Integration: Config changes to deployment prep are not deployed - https://phabricator.wikimedia.org/T76714#818802 (10hashar) [14:11:51] 3Continuous-Integration: Config changes to deployment prep are not deployed - https://phabricator.wikimedia.org/T76714#818806 (10hashar) [14:19:37] hashar: thanks! [14:28:43] 3Continuous-Integration: Config changes to deployment prep are not deployed - https://phabricator.wikimedia.org/T76714#818827 (10Nikerabbit) [14:30:57] Nikerabbit: kart_: yeah Jenkins deadlock from time to time [14:31:12] Nikerabbit: kart_ though apparently the issue magically solved by canceling a few jobs [14:44:25] Yippee, build fixed! [14:44:25] Project beta-scap-eqiad build #32409: FIXED in 27 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/32409/ [14:50:38] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:17:00] 3Release-Engineering: Add in Phabricator quarterly milestones for RelEng - https://phabricator.wikimedia.org/T75729#818893 (10Aklapper) (And for the records: Project descriptions need to be clear that such quarters refer to Gregorian calendar and not to some fiscus year of some random country.) [16:15:37] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:31:36] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:51:10] 3Scrum-of-Scrums, Beta-Cluster: Create labs ldap user and group "apache" to be used for NFS4 access in deployment-prep project - https://phabricator.wikimedia.org/T76086#819074 (10bd808) >>! In T76086#809300, @yuvipanda wrote: > Hmm, it's causing puppet failures already: > > ```Error: /Stage[main]/Mediawiki::Us... [16:53:15] 3Scrum-of-Scrums, Beta-Cluster: Create labs ldap user and group "apache" to be used for NFS4 access in deployment-prep project - https://phabricator.wikimedia.org/T76086#819076 (10chasemp) [17:04:31] Project browsertests-Wikidata-WikidataTests-linux-firefox-sauce build #62: FAILURE in 2 hr 42 min: https://integration.wikimedia.org/ci/job/browsertests-Wikidata-WikidataTests-linux-firefox-sauce/62/ [17:05:32] 3Scrum-of-Scrums, Beta-Cluster: Create labs ldap user and group "apache" to be used for NFS4 access in deployment-prep project - https://phabricator.wikimedia.org/T76086#819089 (10coren) [17:08:17] 3Scrum-of-Scrums, Beta-Cluster: Create labs ldap user and group "apache" to be used for NFS4 access in deployment-prep project - https://phabricator.wikimedia.org/T76086#819103 (10coren) Just so that it is clear: the actual UID does not need to be the same. You can add the user in the node /labstore100[12]\.eqi... [17:15:53] Yippee, build fixed! [17:15:53] Project browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #333: FIXED in 21 min: https://integration.wikimedia.org/ci/job/browsertests-Core-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/333/ [17:29:31] (03PS1) 10Jforrester: Add link to already-published UnicodeJS docs [integration/docroot] - 10https://gerrit.wikimedia.org/r/177570 [17:29:33] (03PS1) 10Jforrester: Split the first bullet list into three: MW, Front-End and Operations [integration/docroot] - 10https://gerrit.wikimedia.org/r/177571 [17:32:50] (03CR) 10Esanders: [C: 031] Add link to already-published UnicodeJS docs [integration/docroot] - 10https://gerrit.wikimedia.org/r/177570 (owner: 10Jforrester) [17:50:30] 3Scrum-of-Scrums, Beta-Cluster: beta cluster: deployment-cache-upload02 does not seem to purge content when getting PURGE - https://phabricator.wikimedia.org/T67683#819145 (10BBlack) a:3BBlack [17:53:32] 3Scrum-of-Scrums, Beta-Cluster: beta cluster: deployment-cache-upload02 does not seem to purge content when getting PURGE - https://phabricator.wikimedia.org/T67683#819151 (10BBlack) Fix going in here, assuming it works: https://gerrit.wikimedia.org/r/#/c/177576/ (also, apologize for the delay, but relatedly: i... [17:56:05] 3Continuous-Integration, Release-Engineering: [upstream] Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) - https://phabricator.wikimedia.org/T72597#819153 (10greg) [17:56:37] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:03:17] (03CR) 10Krinkle: [C: 032] Add link to already-published UnicodeJS docs [integration/docroot] - 10https://gerrit.wikimedia.org/r/177570 (owner: 10Jforrester) [18:03:19] (03Merged) 10jenkins-bot: Add link to already-published UnicodeJS docs [integration/docroot] - 10https://gerrit.wikimedia.org/r/177570 (owner: 10Jforrester) [18:11:04] PROBLEM - Puppet failure on deployment-videoscaler01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:18:47] 3Scrum-of-Scrums, Beta-Cluster: Create labs ldap user and group "apache" to be used for NFS4 access in deployment-prep project - https://phabricator.wikimedia.org/T76086#819201 (10Andrew) ok... I'm happy to do whatever's needed but I'm now confused. It sounds like the right solution is something like: a) remo... [18:32:36] PROBLEM - Puppet failure on deployment-jobrunner01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [18:35:07] 3Scrum-of-Scrums, Beta-Cluster: beta cluster: deployment-cache-upload02 does not seem to purge content when getting PURGE - https://phabricator.wikimedia.org/T67683#819224 (10greg) >>! In T67683#819151, @BBlack wrote: > Fix going in here, assuming it works: https://gerrit.wikimedia.org/r/#/c/177576/ Thanks! >... [18:49:53] 3Scrum-of-Scrums, Beta-Cluster: Create labs ldap user and group "apache" to be used for NFS4 access in deployment-prep project - https://phabricator.wikimedia.org/T76086#819254 (10Andrew) [18:50:06] 3Release-Engineering: Add in Phabricator quarterly milestones for RelEng - https://phabricator.wikimedia.org/T75729#819256 (10greg) 1) I'm fine with generic wmf quarterly goal workboards, but I don't think we should do it until other teams agree that it's what we all should use to track wmf teams' quarterly goal... [18:56:06] RECOVERY - Puppet failure on deployment-videoscaler01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:57:38] RECOVERY - Puppet failure on deployment-jobrunner01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:03:20] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce build #169: FAILURE in 37 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce/169/ [19:15:31] Project beta-scap-eqiad build #32438: FAILURE in 1 min 27 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/32438/ [19:25:29] PROBLEM - Puppet failure on deployment-parsoid04 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:35:21] Yippee, build fixed! [19:35:21] Project beta-scap-eqiad build #32440: FIXED in 1 min 19 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/32440/ [19:42:19] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [19:48:32] PROBLEM - Puppet failure on deployment-eventlogging02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [19:50:29] RECOVERY - Puppet failure on deployment-parsoid04 is OK: OK: Less than 1.00% above the threshold [0.0] [20:07:20] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [20:18:30] RECOVERY - Puppet failure on deployment-eventlogging02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:21:42] Project beta-update-databases-eqiad build #5930: FAILURE in 1 min 41 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/5930/ [21:22:32] getting 503 from beta labs just now [21:23:22] asking ori on -tech [21:25:07] Project browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce build #341: FAILURE in 10 min: https://integration.wikimedia.org/ci/job/browsertests-UploadWizard-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce/341/ [21:25:43] gah, I am on the VERY FINAL STEP of the ENTIRE REFACTOR OF ALL OF MOBILEFRONTEND [21:28:30] bd808: if you're interested, beta labs is dying fast right now [21:28:39] greg-g: ^^ [21:28:57] blerg [21:29:07] yah [21:29:29] bits just stopped responding I think [21:29:48] this is when looking at logstash/etc is useful, btw [21:29:59] "Undefined variable: wmgUseContentTranslationCluster in /srv/mediawiki/wmf-config/CommonSettings-labs.php on line 105" [21:30:17] somebody merged config that needs an update in beta [21:30:37] Also "Fatal error: Uncaught exception 'ConfigException' with message 'GlobalVarConfig::get: undefined option: 'UDPProfilerPort'' in /srv/mediawiki/php-master/includes/config/GlobalVarConfig.php:53" [21:30:41] also from... yeah [21:30:41] 21:21:08 [0ace469b] [no req] ConfigException from line 53 of /mnt/srv/mediawiki-staging/php-master/includes/config/GlobalVarConfig.php: GlobalVarConfig::get: undefined option: 'UDPProfilerPort' [21:30:49] that's what broke the dbupdate run [21:30:51] sucks that that is so fatal [21:31:00] https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/label=deployment-bastion-eqiad,wikidb=simplewiki/5930/console [21:31:07] jenkins points out the problem ^ [21:31:15] 16:21 < wmf-insec> Project beta-update-databases-eqiad build #5930: FAILURE in 1 min 41 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/5930/ [21:31:18] 16:22 <+chrismcma> getting 503 from beta labs just now [21:31:21] :) [21:31:39] So yeah, find somebody to fix those things... [21:32:06] the udp bit is the profiler that Aaron has been workign on I suspect [21:32:18] or... maybe not [21:32:24] Project browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #298: FAILURE in 16 sec: https://integration.wikimedia.org/ci/job/browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/298/ [21:32:43] /srv/mediawiki/php-master/includes/StatCounter.php(86): StatCounter->sendDeltasUDP() [21:32:43] StatCounter [21:32:55] Both look like config problems [21:33:32] ok, one of them is from ^d https://gerrit.wikimedia.org/r/#/c/177561/ [21:34:10] maybe both [21:34:42] heh. so StatCounter was piggybacking on another component's config [21:34:48] shame [21:36:02] wtf is wmgUseContentTranslationCluster? is that just not fatal? [21:36:27] yeah that's not fatal but it should get fixed [21:36:59] wmg* are feature flags in wmf-config that folks often forget to set properly for beta [21:37:40] indeed [21:38:09] 3Wikimedia-Labs-General, Beta-Cluster: Pages are appearing in a broken way in Betalabs - https://phabricator.wikimedia.org/T76777#819681 (10Ryasmeen) [21:38:28] Project browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #206: FAILURE in 6 min 2 sec: https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/206/ [21:38:33] chrismcmahon: So now that we have hhvm errors properly in logstash, the first place to go when things start breaking is -- https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/default [21:38:43] Project browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #266: FAILURE in 4.6 sec: https://integration.wikimedia.org/ci/job/browsertests-PageTriage-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/266/ [21:39:32] bd808: I don't seem to have credentials for that [21:39:45] chrismcmahon: I'll PM them to you [21:41:16] 3Wikimedia-Labs-General, Beta-Cluster: Pages are appearing in a broken way in Betalabs - https://phabricator.wikimedia.org/T76777#819689 (10greg) p:5Triage>3Unbreak! Confirmed, this is due to a config breakage. @chad / @Aaron: maybe due to https://gerrit.wikimedia.org/r/#/c/177561/ See also: https://integr... [21:41:24] Yippee, build fixed! [21:41:24] Project browsertests-CirrusSearch-test2.wikipedia.org-linux-firefox-sauce build #300: FIXED in 2 min 55 sec: https://integration.wikimedia.org/ci/job/browsertests-CirrusSearch-test2.wikipedia.org-linux-firefox-sauce/300/ [21:41:33] Project browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #404: FAILURE in 10 sec: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-firefox-sauce/404/ [21:41:37] 3Wikimedia-Labs-General, Beta-Cluster: Pages are appearing in a broken way in Betalabs - https://phabricator.wikimedia.org/T76777#819692 (10greg) [21:42:42] 3Beta-Cluster: Pages are appearing in a broken way in Betalabs - https://phabricator.wikimedia.org/T76777#819660 (10greg) [21:50:14] 3Release-Engineering, Beta-Cluster: Make logstash in beta public - https://phabricator.wikimedia.org/T76784#819741 (10bd808) [21:50:30] 3Wikimedia-Logstash, Release-Engineering, Beta-Cluster: Make logstash in beta public - https://phabricator.wikimedia.org/T76784#819741 (10bd808) [21:52:04] 3Wikimedia-Logstash, Release-Engineering, Beta-Cluster: Make logstash in beta public - https://phabricator.wikimedia.org/T76784#819741 (10bd808) [21:53:49] man, and I think the current scap just missed the revert [21:53:53] * greg-g waits [21:55:00] as bd808 said: [21:55:25] 14:34 bd808: heh. so StatCounter was piggybacking on another component's config [21:55:26] 14:34 bd808: shame [21:55:52] I've started to call this "stunt programming" and it is EVERYWHERE [21:55:57] 3Beta-Cluster: Pages are appearing in a broken way in Betalabs - https://phabricator.wikimedia.org/T76777#819769 (10greg) After this current scap run in beta, it should be fixed, I hope: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/32452/ (actually, looks like it just missed it, I'll kick off another... [21:56:27] It's easy to be lazy :( [21:56:37] Just try being green. [21:56:46] * chrismcmahon grumbles and shuts up [21:56:54] And when we already have hundreds of config vars folks tend to try and keep from making more [21:57:10] chrismcmahon: you're not wrong, it's a horrible pracitce [21:57:39] the complexity of MW's config system is astounding and yet insufficient [21:57:57] bd808: yeah, it is the difference between "code reuse" and "tight coupling" that seems poorly understood [21:58:29] * greg-g mumbles about how long scap is taking on beta [21:59:58] <^d> bd808: I've got a new plan. If a class annoys me, I'm just going to delete it. [21:59:58] <^d> Move fast and /delete stuff/ [22:00:01] it's been getting slower and slower [22:00:01] l10n is usually the cause [22:00:05] ^d: +2 [22:00:21] In a week we'll have a managable codebase [22:00:31] like 3 classes or so [22:00:50] <^d> Title, Article and Parser. [22:00:53] <^d> It'll be like 2003 all over again. [22:01:17] just finished https://integration.wikimedia.org/ci/job/beta-scap-eqiad/32452/console [22:01:27] (duration: 26m 21s) [22:01:39] wheeee [22:01:39] Finished mw-update-l10n (duration: 15m 46s) [22:01:53] plus "Finished mw-update-l10n (duration: 15m 46s)" [22:01:55] pretty sure that one missed the revert, chrismcmahon [22:02:03] er "Finished scap-rebuild-cdbs (duration: 04m 10s)" [22:02:22] * greg-g nods [22:02:23] so ~20 minutes of the 26 for l10n [22:02:35] plus whatever that added to the rsync [22:02:36] stupid languages [22:02:49] stupid overloaded deployment-bastion too [22:02:59] that box is sluggish all the time [22:03:03] here we go again :) https://integration.wikimedia.org/ci/job/beta-scap-eqiad/32453/console [22:03:17] Finished mw-update-l10n (duration: 00m 17s) [22:03:31] PROBLEM - Puppet failure on deployment-cache-bits01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [22:03:40] gah, really [22:03:46] what does that mean? ^ [22:03:53] oh puppet [22:03:57] that's werid, I mean, I'd expect it before, not now [22:04:03] oh, puppet, right [22:05:03] "Error: Could not retrieve catalog from remote server: Error 400 on SERVER: keys(): Requires hash to work with at /etc/puppet/manifests/role/cache.pp:388 on node i-0000007f.eqiad.wmflabs" [22:05:22] bits is still not right I think [22:05:25] somebody merged something that don't work in beta... :( [22:05:34] seriously [22:06:49] * YuviPanda should really fix the messages shinken-wm spouts [22:06:52] some CentralNotice stuff maybe? I don't see any other likely merges [22:07:01] but for now, it saying *anything* about puppet fialures means 'yo, puppet failures' [22:07:11] https://github.com/wikimedia/operations-puppet/commit/b27fb0d40f681615d9899f8de12aecfd76ac696e [22:07:33] ori is pinging ottomata about it [22:07:34] commit 2 weeks ago? [22:07:47] uh, missed that [22:07:54] * greg-g just assumed the commit [22:08:32] "$role::analytics::kafka::config::cluster_config['eqiad']" [22:08:40] That's not going to work in beta [22:08:51] oh wait, things might be OK (ish) [22:09:30] * bd808 goes back to work [22:17:01] thanks bd808 [22:17:15] https://phabricator.wikimedia.org/T76799 for the puppet failure [22:20:14] 3Beta-Cluster: Pages are appearing in a broken way in Betalabs - https://phabricator.wikimedia.org/T76777#819923 (10greg) 5Open>3Resolved a:3greg Should now be fixed. Let me know if not. [22:21:37] Yippee, build fixed! [22:21:37] Project beta-update-databases-eqiad build #5931: FIXED in 1 min 4 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/5931/ [22:37:26] Project browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #370: FAILURE in 1 hr 12 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-en.m.wikipedia.beta.wmflabs.org-linux-chrome-sauce/370/ [22:48:00] !log manually rebased puppet on deployment-prep [22:48:05] Logged the message, Master [22:49:20] chrismcmahon: the latest betalabs outage was a bits breakage, right? [22:49:25] which is why the shinken test didn't catch it [22:49:32] chrismcmahon: I should probably add a check for bits as well. [22:49:49] I think that'll be useful... [22:51:07] YuviPanda: I think it was an everything breakage, this was the root cause: https://gerrit.wikimedia.org/r/#/c/177561/ [22:51:54] chrismcmahon: hmm, the image in https://phabricator.wikimedia.org/T76777 looks like a bits failure? [22:52:39] YuviPanda: bits was definitely 503, but I think that config problem killed everything [22:53:02] hmm, I wonder why the shinken alert didn't trigger [22:53:41] chrismcmahon: do you know if enwiki etc was 503 too? [22:55:32] Project beta-scap-eqiad build #32459: FAILURE in 1 min 21 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/32459/ [22:56:43] YuviPanda: enwiki was also 503, I was getting 503 on login attempts before bits went down [22:56:52] oh, hmmmmmm. [22:57:03] so the check hits the home page and checks for the string 'Wikipedia' [22:57:07] I suppose that might've still been up [22:57:55] YuviPanda: ah, yeah, the page was showing up from being cached, but my Selenium tests were smart enough to report the 503 returned [22:58:06] aaaah, interesting. [22:58:13] chrismcmahon: hmm, should I make it hit the debug=true URL? [22:58:17] that would've caught this, probably [22:58:25] YuviPanda: that would be amusing, go for it :-) [22:58:31] heh, alright! [22:59:08] YuviPanda: I just happened to be running browser tests over and over right as beta labs was falling over [22:59:29] *running browser tests locally* [22:59:30] chrismcmahon: right :) [22:59:40] chrismcmahon: I just want to replace some parts of you with a shell script :) [23:00:35] I was trying to merge my 48th and final refactoring patch set to MobileFrontend :-) Tech Debt Go Home. [23:00:46] hehe [23:00:47] nice [23:01:54] YuviPanda: chrismcmahon, it was just bits [23:01:59] I was getting text, just no css/js [23:02:14] hmm, right. [23:02:22] greg-g: yeah? I was failing login with 503s I'm pretty sure [23:02:23] I'll add a bits check as well [23:02:33] I could be wrong [23:02:38] I was already logged in, so didn't test that [23:03:09] greg-g: chrismcmahon I think after every betacluster failure discovered by a human being testing, we should see if we can put checks in place for that [23:03:27] RECOVERY - Puppet failure on deployment-cache-bits01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:03:29] YuviPanda: that is an excellent policy [23:04:33] chrismcmahon: :D can someone remember to include me in any outage reports, etc? [23:04:38] well, or tasks [23:04:39] or discussions [23:04:42] whichever :) [23:05:24] Yippee, build fixed! [23:05:24] Project beta-scap-eqiad build #32460: FIXED in 1 min 20 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/32460/ [23:06:27] Project browsertests-MobileFrontend-test2.m.wikipedia.org-linux-firefox-sauce build #328: FAILURE in 1 hr 22 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-test2.m.wikipedia.org-linux-firefox-sauce/328/ [23:06:42] YuviPanda: yes. ryasmeen|Away often is the first reporter of beta labs issues and greg-g is tracking this stuff closely, so that is a good reporting axis to work with