[00:00:07] greg-g: Special:NovaSudoers or something like that? [00:00:19] https://wikitech.wikimedia.org/wiki/Special:NovaSudoer [00:00:35] ah, yeah [00:01:07] {{done}} [00:01:18] not sure how long that takes to take effect [00:01:21] works, thanks [00:01:43] np and thank you for flying Beta Cluster Air. [00:06:17] 3Beta-Cluster, Release-Engineering: deployment-prep mobile sites are down - https://phabricator.wikimedia.org/T87821#1000068 (10Krenair) 5Open>3Resolved a:3Krenair I just fixed this myself. ```root@deployment-cache-mobile03:~# ls -l /var/lib/varnish/ total 0 drwxr-xr-x 2 root root 100 Jan 28 00:17 deployme... [00:06:49] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 29491 bytes in 0.858 second response time [00:07:52] yw, etc. [00:19:27] !log Jenkins slave on deployment-bastion.eqiad has been stuck for the past 5 hours [00:19:31] Logged the message, Master [00:19:58] greg-g: ^ It's not affecting the other CI parts, and I dont know what causes this. This, too, happens about every other day. I suspect beta is not being updated right now. [00:20:12] https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/7202/ [00:20:46] 3Beta-Cluster: Searching for images in VisualEditor with Firefox is painfully slow in Betalabs - https://phabricator.wikimedia.org/T87676#1000116 (10Ryasmeen) [00:20:57] Krinkle: Can we make a cron job that restarts the slave on deployment-bastion without breaking other things? [00:21:06] James_F: Restarting it doesn't fix it. [00:21:14] Krinkle: What does? [00:21:30] James_F: Restarting the slave deamon on deployment-bastion [00:21:36] Does not fix the clog [00:21:49] Krinkle: … yes. So what does fix it? [00:22:17] * Reedy hands James_F a toilet plunger [00:22:22] I don't know. This is the only recurring failure I've never been able to solve. It just goes away at some point. Presumably beta labs maintainers or Antoine do something that (un)intentionally resets it [00:22:39] Anyway, I don't have cycles for proactive helpfulness.It's not my problem. [00:22:44] Sure. [00:26:20] !log integration-slave1007 rm -rf /mnt/jenkins-workspace/workspace/oojs* [00:26:24] Logged the message, Master [00:26:26] Krinkle: Thanks. [00:34:15] greg-g: Do you know what fixes this? A full Jenkins restart? [00:41:11] This isn't https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update is it? [00:43:40] Krenair: Looks like. [00:47:26] !log running instructions at https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update [00:47:30] Logged the message, Master [00:47:55] (03PS6) 10Krinkle: [WIP] Add qunit-karma macro [integration/config] - 10https://gerrit.wikimedia.org/r/186934 [00:49:34] RECOVERY - Content Translation Server on deployment-cxserver03 is OK: HTTP OK: HTTP/1.1 200 OK - 1103 bytes in 0.025 second response time [00:54:38] greg-g: bd808: Was job 'beta-scap-eqiad' changed in the Jenkins UI directly, or was that a test and is it now back to what is in jjb-config? It should take Antoine, Bryan, Chad or me less than 10 minutes to compile/push a simple job change like that. They should not be changed in Jenkins directly. It would likely be undone the next time any sort of deployment happens. [00:54:55] I can compile/push https://gerrit.wikimedia.org/r/#/c/184502 now if you like. [00:55:33] Krinkle: yeah, I manually changed it at some point to see if it was the right choice. Merging and updating from the patch would be great I think [00:55:46] bd808: I'm tyring to correlate the change to the change. [00:55:52] Which UI component correlates to this? [00:56:27] * bd808 looks [00:57:09] IRC Notification > Notification strategy I think [00:57:48] hmmm... that patch changes for email [00:57:50] not irc [00:58:01] I changed irc which is I think what we wanted [01:04:57] I'm seeing "chmod +rx /var/lib/varnish/*" as being the solution du jour, what's up with that? [01:13:03] greg-g, Reedy found the permissions on /var/lib/varnish/frontend were broken earlier, which caused varnish-frontend to fail to start on the upload cache vm [01:13:30] I applied the same thing on the mobile cache vm [01:18:58] (03PS3) 10Krinkle: Complain on each scap job failure [integration/config] - 10https://gerrit.wikimedia.org/r/184502 (https://phabricator.wikimedia.org/T84947) (owner: 10Greg Grossmeier) [01:19:05] (03PS4) 10Krinkle: Complain on each scap job failure [integration/config] - 10https://gerrit.wikimedia.org/r/184502 (https://phabricator.wikimedia.org/T84947) (owner: 10Greg Grossmeier) [01:21:28] bd808: Travis CI is pushing an environmental update on Feb 3 that should make cdb possible. [01:21:36] bd808: https://gerrit.wikimedia.org/r/#/c/175317/ https://github.com/travis-ci/travis-cookbooks/pull/401#issuecomment-70025395 [01:22:05] I saw that. Pretty cool that it will finally work (unless I messed up the config change for them) [01:27:12] (03CR) 10Krinkle: [C: 032] Complain on each scap job failure [integration/config] - 10https://gerrit.wikimedia.org/r/184502 (https://phabricator.wikimedia.org/T84947) (owner: 10Greg Grossmeier) [01:28:15] (03CR) 10Krinkle: [C: 04-1] "Environment needs MW_SERVER and MW_SCRIPT_PATH to be set." [integration/config] - 10https://gerrit.wikimedia.org/r/186934 (owner: 10Krinkle) [01:34:00] (03Merged) 10jenkins-bot: Complain on each scap job failure [integration/config] - 10https://gerrit.wikimedia.org/r/184502 (https://phabricator.wikimedia.org/T84947) (owner: 10Greg Grossmeier) [01:39:44] !log Restarting Jenkins because deployment-bastion.eqiad isn't depooling even after restart. [01:39:50] Logged the message, Master [01:59:15] 3operations, Beta-Cluster: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#1000325 (10yuvipanda) [01:59:52] PROBLEM - Puppet failure on deployment-redis02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [02:00:24] PROBLEM - Puppet failure on deployment-cxserver03 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [02:02:28] PROBLEM - Puppet failure on deployment-mediawiki04 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [02:05:23] RECOVERY - Puppet failure on deployment-cxserver03 is OK: OK: Less than 1.00% above the threshold [0.0] [02:23:58] (03PS1) 10Krinkle: Add mw-set-env-qunit.sh [integration/jenkins] - 10https://gerrit.wikimedia.org/r/187320 [02:26:06] 3MediaWiki-extensions-CodeEditor, Continuous-Integration: Jenkins jobs for CodeEditor need WikiEditor dependency installed - https://phabricator.wikimedia.org/T87838#1000350 (10Krinkle) 3NEW [02:26:21] (03CR) 10Krinkle: [C: 032] Add mw-set-env-qunit.sh [integration/jenkins] - 10https://gerrit.wikimedia.org/r/187320 (owner: 10Krinkle) [02:26:50] (03Merged) 10jenkins-bot: Add mw-set-env-qunit.sh [integration/jenkins] - 10https://gerrit.wikimedia.org/r/187320 (owner: 10Krinkle) [02:27:18] (03PS7) 10Krinkle: [WIP] Add qunit-karma macro [integration/config] - 10https://gerrit.wikimedia.org/r/186934 [02:27:26] RECOVERY - Puppet failure on deployment-mediawiki04 is OK: OK: Less than 1.00% above the threshold [0.0] [02:29:50] RECOVERY - Puppet failure on deployment-redis02 is OK: OK: Less than 1.00% above the threshold [0.0] [02:36:33] Yippee, build fixed! [02:36:33] Project beta-code-update-eqiad build #42233: FIXED in 3 min 32 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/42233/ [02:36:59] Project beta-scap-eqiad build #39335: FAILURE in 26 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/39335/ [02:55:38] 3Continuous-Integration: npm tests intermittently fail; npm cache needs purging - https://phabricator.wikimedia.org/T87666#1000387 (10Jdforrester-WMF) p:5Triage>3High [03:01:21] 03:00:14 /tmp/hudson5966398102575837696.sh: line 2: /srv/deployment/integration/slave-scripts/bin/mw-set-env-qunit.sh: Permission denied [03:02:07] 3Continuous-Integration: Investigate npm cache-min option to speed up npm install - https://phabricator.wikimedia.org/T85961#1000395 (10Krinkle) 5Open>3declined a:3Krinkle [03:04:57] 3Continuous-Integration: npm tests intermittently fail; npm cache needs purging - https://phabricator.wikimedia.org/T87666#1000399 (10Krinkle) Unlike most npm cache issues, this is not caused by {T76304}. This particular bug is not about incompatible `node_modules` directories between builds, but about the `.np... [03:21:13] Project beta-update-databases-eqiad build #7204: FAILURE in 1 min 12 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/7204/ [03:24:14] (03PS10) 10KartikMistry: WIP: Add generic npm-set-env to fix npm on */deploy repos [integration/config] - 10https://gerrit.wikimedia.org/r/184609 [03:38:51] PROBLEM - App Server bits response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:39:19] PROBLEM - App Server Main HTTP Response on deployment-mediawiki02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:43:44] RECOVERY - App Server bits response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 3895 bytes in 0.003 second response time [03:44:10] RECOVERY - App Server Main HTTP Response on deployment-mediawiki02 is OK: HTTP OK: HTTP/1.1 200 OK - 49118 bytes in 0.606 second response time [03:53:03] Yippee, build fixed! [03:53:03] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce build #435: FIXED in 43 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/435/ [03:54:29] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #507: FAILURE in 43 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/507/ [04:26:46] Yippee, build fixed! [04:26:47] Project browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #408: FIXED in 2 min 49 sec: https://integration.wikimedia.org/ci/job/browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/408/ [05:10:04] 3Beta-Cluster: Don't throttle WMF office IP(s) for account creation - https://phabricator.wikimedia.org/T87841#1000418 (10greg) 3NEW [05:11:10] 3Beta-Cluster, Release-Engineering: deployment-prep mobile sites are down - https://phabricator.wikimedia.org/T87821#1000424 (10greg) From Brandon: ``` 22:33 < bblack> greg-g: re: varnish perms - it's something that was messed up earlier I think, but didn't rear its head until restarts.... [05:16:04] Yippee, build fixed! [05:16:05] Project browsertests-MobileFrontend-test2.m.wikipedia.org-linux-firefox-sauce build #438: FIXED in 39 min: https://integration.wikimedia.org/ci/job/browsertests-MobileFrontend-test2.m.wikipedia.org-linux-firefox-sauce/438/ [05:16:43] Yippee, build fixed! [05:16:43] Project browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #484: FIXED in 19 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/484/ [05:36:15] 3Beta-Cluster: Remove beta specific mediawiki roles - https://phabricator.wikimedia.org/T87210#1000435 (10greg) p:5Triage>3Normal [05:36:30] 3Beta-Cluster: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#1000438 (10greg) p:5Triage>3Normal [05:36:51] 3operations, Beta-Cluster: Set 'cluster' salt grain appropriately for all instances in beta cluster - https://phabricator.wikimedia.org/T87199#1000441 (10greg) p:5Triage>3Normal [05:37:30] 3Beta-Cluster: Searching for images in VisualEditor with Firefox is painfully slow in Beta Cluster - https://phabricator.wikimedia.org/T87676#1000444 (10greg) [05:49:03] Yippee, build fixed! [05:49:03] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #436: FIXED in 28 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/436/ [06:08:26] 3Continuous-Integration: Puppet is causing changed/added files in 'slave-scripts' git::clone on integration slaves in labs to become root read-only - https://phabricator.wikimedia.org/T87843#1000457 (10Krinkle) 3NEW [06:08:39] 3Continuous-Integration: Puppet is causing changed/added files in 'slave-scripts' git::clone on integration slaves in labs to become root read-only - https://phabricator.wikimedia.org/T87843#1000464 (10Krinkle) p:5Triage>3Unbreak! [06:31:24] Yippee, build fixed! [06:31:24] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #435: FIXED in 24 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/435/ [06:38:28] Yippee, build fixed! [06:38:29] Project browsertests-UniversalLanguageSelector-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce build #429: FIXED in 14 min: https://integration.wikimedia.org/ci/job/browsertests-UniversalLanguageSelector-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce/429/ [07:08:23] 3Continuous-Integration: Puppet is causing changed/added files in 'slave-scripts' git::clone on integration slaves in labs to become root read-only - https://phabricator.wikimedia.org/T87843#1000533 (10BBlack) I'm not entirely confident in the above patch given how much reuse git::clone sees all over puppet, but... [07:23:03] PROBLEM - Puppet failure on deployment-elastic06 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [08:18:48] 3Phabricator: Fatal error (30 seconds timeout) upon certain maniphest search in a component - https://phabricator.wikimedia.org/T87739#1000583 (10FriedhelmW) Confirming the problem (not logged in). [08:30:46] PROBLEM - Puppet failure on deployment-elastic07 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [08:40:09] PROBLEM - Puppet failure on deployment-elastic05 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [08:41:33] PROBLEM - Puppet failure on deployment-mediawiki03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [08:47:56] 3Beta-Cluster: Puppet failures on deployment-mx - https://phabricator.wikimedia.org/T87848#1000610 (10yuvipanda) 3NEW [08:48:36] 3Beta-Cluster: deployment-mx does not have salt master set to deployment-salt - https://phabricator.wikimedia.org/T87849#1000617 (10yuvipanda) 3NEW [08:58:05] 3operations, Beta-Cluster: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#1000629 (10yuvipanda) [08:58:08] 3Beta-Cluster: deployment-mx is its own puppetmaster - https://phabricator.wikimedia.org/T86575#1000623 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Yay, done now. I've filed T87849 and T87848 as things to be fixed by someone who is *not* me :) [09:06:29] RECOVERY - Puppet failure on deployment-mediawiki03 is OK: OK: Less than 1.00% above the threshold [0.0] [09:19:15] 3operations, Beta-Cluster: Renumber apache user/group to uid=48 - https://phabricator.wikimedia.org/T78076#1000652 (10yuvipanda) As an update, I think @faidon and @joe are working on moving our apache user to just use www-data instead (in prod). [09:45:36] 3Quality-Assurance: use rspec-expectations expect syntax instead of should syntax - https://phabricator.wikimedia.org/T68369#1000683 (10Physikerwelt) I was not able to test this for the Math Extension: See my comment in gerrit: " I could not test the effects of this change. I tried to run the tests on labs-vagra... [14:32:38] 3MediaWiki-extensions-MathSearch, Beta-Cluster: Broken submodule - https://phabricator.wikimedia.org/T87643#1000865 (10Physikerwelt) 5Open>3Resolved [14:33:10] 3MediaWiki-extensions-MathSearch, Beta-Cluster: Broken submodule - https://phabricator.wikimedia.org/T87643#996020 (10Physikerwelt) Resolved as duplicate of T87820 [14:36:14] Hi, is there someone around who want's to discuss https://phabricator.wikimedia.org/T87820 [14:40:45] 3MediaWiki-extensions-MathSearch, Release-Engineering: beta-code-update-eqiad has been failing since 24 January - https://phabricator.wikimedia.org/T87820#1000870 (10Physikerwelt) https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/ looks happy. Can we close this bug? [14:45:52] 3MediaWiki-extensions-MathSearch, Release-Engineering: beta-code-update-eqiad has been failing since 24 January - https://phabricator.wikimedia.org/T87820#1000878 (10Physikerwelt) ..mh.. this is not reflected on github.com https://github.com/wikimedia/mediawiki-extensions-MathSearch/commits/master and there was... [16:17:04] 3MediaWiki-extensions-CodeEditor, Continuous-Integration: Jenkins jobs for CodeEditor need WikiEditor dependency installed - https://phabricator.wikimedia.org/T87838#1000948 (10Umherirrender) [17:14:00] greg-g: https://www.mediawiki.org/wiki/Talk:MediaWiki_1.24#Release_schedule [17:14:34] I guess there's gonna have to be a security release soon anyway [17:14:59] https://phabricator.wikimedia.org/T87275 [17:24:16] legoktm: thanks [17:28:36] legoktm: see also: https://www.mediawiki.org/wiki/User_talk:Greg_%28WMF%29#Future_releases_of_MediaWiki [17:31:53] greg-g: https://www.mediawiki.org/wiki/Release_checklist is probably the place to update? [17:32:24] 3Beta-Cluster: Puppet failures on deployment-mx - https://phabricator.wikimedia.org/T87848#1001029 (10greg) p:5Triage>3Normal [17:32:26] 3Beta-Cluster: deployment-mx does not have salt master set to deployment-salt - https://phabricator.wikimedia.org/T87849#1001032 (10greg) p:5Triage>3Normal [17:32:55] 3Wikidata, Beta-Cluster: m.wikidata.beta.wmflabs.org/ redirects to a host that does not exist - https://phabricator.wikimedia.org/T87440#1001035 (10greg) p:5Triage>3Normal [17:36:04] 3Beta-Cluster: Migrate beta cluster jobrunner to HAT - https://phabricator.wikimedia.org/T87214#1001046 (10greg) p:5Triage>3Normal [17:36:30] 3VisualEditor, VisualEditor-MediaWiki, Beta-Cluster: On Beta Cluster, switching from VisualEditor to edit source mode intermittently loads the wikitext editor without any CSS - https://phabricator.wikimedia.org/T86624#1001047 (10greg) [17:37:20] 3Beta-Cluster: Unify labs and prod roles for role::deployment::deployment_servers - https://phabricator.wikimedia.org/T86885#1001050 (10greg) p:5Triage>3Normal [17:37:54] 3Beta-Cluster: Shinken warnings about free space on beta cluster Varnish instances - https://phabricator.wikimedia.org/T76417#1001053 (10greg) p:5Triage>3Normal [17:38:17] legoktm: /me sighs, yeah [17:39:43] 3Phabricator: Fatal error (30 seconds timeout) upon certain maniphest search in a component when not logged in - https://phabricator.wikimedia.org/T87739#1001056 (10Aklapper) [17:40:07] 3MediaWiki-File-management, Multimedia, Beta-Cluster: Thumbnail 404s get cached - https://phabricator.wikimedia.org/T69056#1001060 (10greg) [17:47:59] 3Phabricator: request for deletion: 'shell' project - https://phabricator.wikimedia.org/T87623#1001083 (10Aklapper) CC'ing the current Members/Watchers of the "shell" tag on this ticket. +1 for killing the project: most shell usecases are covered by the Wikimedia-Site-requests project anyway IMO, or by Wikimedi... [18:19:44] 3Echo, MediaWiki-General-or-Unknown, VisualEditor, Release-Engineering: Get JQuery error "a is undefined" running browser tests locally for Firefox - https://phabricator.wikimedia.org/T87446#1001134 (10Jdforrester-WMF) [18:27:36] 3operations, Beta-Cluster: Renumber apache user/group to uid=48 - https://phabricator.wikimedia.org/T78076#1001166 (10faidon) apache right now has no uid (so all kinds of uid across the fleet) and a gid 48, which is < 100 and thus, wrong (that space is reserved for packages). Rather than renumbering both uid/gi... [18:44:41] has zuul fallen over? [18:47:08] legoktm: just overwhelmed [18:47:26] legoktm: [18:47:26] https://graphite.wikimedia.org/render/?from=-24hours&height=180&width=400&areaMode=first&colorList=blue,green,red&target=alias(summarize(zuul.pipeline.test.current_changes.value,%2715min%27,%27min%27),%2715min%20minimum%27)&target=alias(summarize(zuul.pipeline.test.current_changes.value,%271min%27),%271min%20avg%27)&target=alias(summarize(zuul.pipeline.test. [18:47:26] current_changes.value,%271h%27,%27max%27),%27hourly%20max%27)&title=Zuul%20test%20pipeline&_=1422557223824 [18:47:31] ...fail [18:47:53] legoktm: http://bit.ly/1z8W3M8 [18:48:16] No data :( [18:48:23] but yeah, https://integration.wikimedia.org/zuul/ shows a lot [18:48:38] yeah, the link got clipped :-p [18:48:46] Project browsertests-ZeroBanner-en.m.wikipedia.org-linux-phantomjs build #416: FAILURE in 3 min 28 sec: https://integration.wikimedia.org/ci/job/browsertests-ZeroBanner-en.m.wikipedia.org-linux-phantomjs/416/ [18:49:43] 3Echo, MediaWiki-General-or-Unknown, Release-Engineering: Get JQuery error "a is undefined" running Echo browser tests locally for Firefox - https://phabricator.wikimedia.org/T87873#1001244 (10EBernhardson) 3NEW [18:58:27] 3Echo, MediaWiki-General-or-Unknown, VisualEditor, Release-Engineering: Get JQuery error "a is undefined" running browser tests locally for Firefox - https://phabricator.wikimedia.org/T87446#1001283 (10Krinkle) Please provide a reduced (browser) test case that triggers this error. [19:01:29] 3VisualEditor, VisualEditor-MediaWiki, Beta-Cluster: On Beta Cluster, switching from VisualEditor to edit source mode intermittently loads the wikitext editor without any CSS - https://phabricator.wikimedia.org/T86624#1001300 (10Ryasmeen) @Krenair: This is the issue we observed yesterday , VE was loading in a ve... [19:04:25] 3Echo, MediaWiki-General-or-Unknown, VisualEditor, Release-Engineering: Get JQuery error "a is undefined" running browser tests locally for Firefox - https://phabricator.wikimedia.org/T87446#1001309 (10Cmcmahon) Literally every browser test in every repo that I run locally produces this error in Firefox. [19:18:35] PROBLEM - App Server Main HTTP Response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:01] PROBLEM - Puppet staleness on deployment-elastic06 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [43200.0] [19:27:32] 3Phabricator: "Project creation log" cronjob email for Phab admins - https://phabricator.wikimedia.org/T85183#1001391 (10Aklapper) Not to myself: Might want to extend to also cover "newly created workboards" to get a list of column names. [19:32:18] Yippee, build fixed! [19:32:19] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce build #282: FIXED in 33 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-windows_8.1-internet_explorer-11-sauce/282/ [19:40:21] PROBLEM - Puppet failure on deployment-pdf01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [20:01:56] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce build #266: FAILURE in 34 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-monobook-sauce/266/ [20:04:19] 3MediaWiki-extensions-CentralAuth, Beta-Cluster: Create Account fails on beta labs wrt spoofuser table - https://phabricator.wikimedia.org/T88008#1002183 (10Cmcmahon) [20:08:13] reporting all the bugs [20:08:53] 3MediaWiki-extensions-CentralAuth, Beta-Cluster: Create Account fails on beta labs wrt spoofuser table - https://phabricator.wikimedia.org/T88008#1002191 (10Legoktm) https://dev.mysql.com/doc/refman/5.0/en/myisam-repair.html suggests a simple "repair table" should work. [20:08:55] 3Beta-Cluster: Create Account fails on beta labs wrt spoofuser table - https://phabricator.wikimedia.org/T88008#1002192 (10Legoktm) [20:11:57] 3Beta-Cluster: Create Account fails on beta cluster wrt spoofuser table - https://phabricator.wikimedia.org/T88008#1002205 (10greg) [20:12:08] chrismcmahon: "beta labs" -> "beta cluster" plzkthx [20:19:12] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-monobook-sauce build #282: FAILURE in 46 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-firefox-monobook-sauce/282/ [20:21:19] Project browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce build #409: FAILURE in 3 min 18 sec: https://integration.wikimedia.org/ci/job/browsertests-WikiLove-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/409/ [20:21:29] Project browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #474: FAILURE in 35 min: https://integration.wikimedia.org/ci/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/474/ [20:22:20] (03PS1) 10Krinkle: Example commit to test T87843 [integration/jenkins] - 10https://gerrit.wikimedia.org/r/187465 [20:22:23] (03CR) 10Krinkle: [C: 032] Example commit to test T87843 [integration/jenkins] - 10https://gerrit.wikimedia.org/r/187465 (owner: 10Krinkle) [20:23:05] (03Merged) 10jenkins-bot: Example commit to test T87843 [integration/jenkins] - 10https://gerrit.wikimedia.org/r/187465 (owner: 10Krinkle) [20:31:03] PROBLEM - Puppet staleness on deployment-elastic07 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [43200.0] [20:32:46] Project browsertests-PdfHandler-test2.wikipedia.org-linux-firefox-sauce build #349: FAILURE in 1 min 25 sec: https://integration.wikimedia.org/ci/job/browsertests-PdfHandler-test2.wikipedia.org-linux-firefox-sauce/349/ [20:33:12] PROBLEM - App Server bits response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:34:54] PROBLEM - Puppet staleness on deployment-elastic05 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [43200.0] [20:38:03] RECOVERY - App Server bits response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 3895 bytes in 0.002 second response time [20:43:48] (03PS1) 10Krinkle: Touch mw-set-env-qunit.sh to test T87843 [integration/jenkins] - 10https://gerrit.wikimedia.org/r/187477 [20:43:55] (03CR) 10Krinkle: [C: 032] Touch mw-set-env-qunit.sh to test T87843 [integration/jenkins] - 10https://gerrit.wikimedia.org/r/187477 (owner: 10Krinkle) [20:44:23] (03Merged) 10jenkins-bot: Touch mw-set-env-qunit.sh to test T87843 [integration/jenkins] - 10https://gerrit.wikimedia.org/r/187477 (owner: 10Krinkle) [20:48:12] (03PS8) 10Krinkle: [WIP] Add qunit-karma macro [integration/config] - 10https://gerrit.wikimedia.org/r/186934 [20:49:14] PROBLEM - App Server bits response on deployment-mediawiki03 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:54:01] RECOVERY - App Server bits response on deployment-mediawiki03 is OK: HTTP OK: HTTP/1.1 200 OK - 3895 bytes in 0.003 second response time [21:07:47] Project browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #485: FAILURE in 23 min: https://integration.wikimedia.org/ci/job/browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/485/ [21:24:41] 3Continuous-Integration, Wikimedia-Fundraising-CiviCRM: CI for Civi: provision and run tests under Jenkins/Zuul - https://phabricator.wikimedia.org/T86103#961478 (10atgo) [21:40:05] 3operations, Beta-Cluster: Renumber apache user/group to uid=48 - https://phabricator.wikimedia.org/T78076#1002444 (10bd808) >>! In T78076#1001166, @faidon wrote: > apache right now has no uid (so all kinds of uid across the fleet) and a gid 48, which is < 100 and thus, wrong (that space is reserved for packages... [21:41:33] 3Continuous-Integration, MediaWiki-extensions-ZeroPortal, MediaWiki-extensions-ZeroBanner: Jenkins must not load ZeroPortal before ZeroBanner - https://phabricator.wikimedia.org/T88015#1002453 (10Krinkle) 3NEW [21:43:07] (03PS1) 10Krinkle: mediawiki/conf.d: Remove currentExt-last override [integration/jenkins] - 10https://gerrit.wikimedia.org/r/187492 (https://phabricator.wikimedia.org/T88015) [21:43:55] (03CR) 10Krinkle: [C: 04-2] "Cannot merge due to T87843." [integration/jenkins] - 10https://gerrit.wikimedia.org/r/187492 (https://phabricator.wikimedia.org/T88015) (owner: 10Krinkle) [21:57:48] (03CR) 10Krinkle: [C: 032] mediawiki/conf.d: Remove currentExt-last override [integration/jenkins] - 10https://gerrit.wikimedia.org/r/187492 (https://phabricator.wikimedia.org/T88015) (owner: 10Krinkle) [22:00:41] (03Merged) 10jenkins-bot: mediawiki/conf.d: Remove currentExt-last override [integration/jenkins] - 10https://gerrit.wikimedia.org/r/187492 (https://phabricator.wikimedia.org/T88015) (owner: 10Krinkle) [22:00:50] /nick chrismcmahon [22:05:24] 3Continuous-Integration: Puppet is causing changed/added files in 'slave-scripts' git::clone on integration slaves in labs to become root read-only - https://phabricator.wikimedia.org/T87843#1002513 (10Krinkle) 5Open>3Resolved a:3Krinkle https://gerrit.wikimedia.org/r/187331 has been cherry-picked to integ... [22:09:05] thanks a ton, Krinkle [22:11:49] I've used git-deploy countless times but I still don't understand why it says "0/2 minions completed fetch" 9 out of 10 times and we're supposed to just continue. [22:12:04] From what I understand, it's not that it failed, it's just asynchronous and reports back to early [22:12:06] It's very confusing [22:12:45] bd808: I sometimes [r]etry until it's finished. [22:13:18] It is indeed. This is a consequence of salt. [22:13:34] salt is async and git-deploy polls for results [22:13:51] but not long enough? [22:14:20] And it doesn't poll salt directly, it polls redis which is the reporting endpoint for the custom salt commands [22:14:41] well what it does is wait N seconds and then tell you what it knows from redis [22:15:01] the [d]etailed report is basically a dump of the known state [22:15:15] There is a patch pending somewhere to make this smarter [22:15:39] * bd808 doesn't remember if it is upstream with trebuchet-trigger or against ops/puppet [22:15:52] bd808: But it's safe to just continue from fetch to checkout before they've completed? [22:16:08] well... "maybe" [22:16:31] the salt messages will be sent but I don't think ordering is guaranteed [22:16:58] so the fetch may be stuck/waiting and then the checkout may fail [22:17:36] the checkout is for an explicit tag so it won't check out the wrong code but it may fire before the hash is fetched and then fail [22:17:47] OK [22:18:02] I need to go afk, back later (headache/migraine/whatever it is I have) [22:18:04] I also noticed these new errors in syslog on integration slaves: https://gist.github.com/Krinkle/b219a1d4d3d5112f3be8/raw [22:18:39] * Krinkle reports in -labs [22:19:45] the way to find out after the fact what the state of the deploy was is `git deploy report --detailed sync` [22:20:14] Ryan was working on a web interface to show that at one point but I don't think it was ever installed outside of t=his test env [22:20:48] * bd808 realizes he just showed more knowledge of trebuchet guts than he should have ;) [22:24:02] bd808: Whoops. :-) [22:24:17] 3Continuous-Integration, MediaWiki-extensions-ZeroPortal, MediaWiki-extensions-ZeroBanner: Jenkins must not load ZeroPortal before ZeroBanner - https://phabricator.wikimedia.org/T88015#1002542 (10Krinkle) 5Open>3Resolved a:3Krinkle Now passing: https://gerrit.wikimedia.org/r/#/c/187532/ [22:25:15] bd808: thx [22:25:20] bd808: I'll poll [d]etailed in the future [22:25:21] before continuing [22:29:32] 3Continuous-Integration: Remove mediawiki-core-regression-* jobs from mediawiki-core#postmerge pipeline - https://phabricator.wikimedia.org/T88018#1002548 (10Krinkle) 3NEW [22:29:33] 3Continuous-Integration: Remove mediawiki-core-regression-* jobs from mediawiki-core#postmerge pipeline - https://phabricator.wikimedia.org/T88018#1002555 (10Krinkle) [22:30:03] Project browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce build #436: FAILURE in 24 min: https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/436/ [22:37:16] Project browsertests-UniversalLanguageSelector-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce build #430: FAILURE in 19 min: https://integration.wikimedia.org/ci/job/browsertests-UniversalLanguageSelector-commons.wikimedia.beta.wmflabs.org-linux-firefox-sauce/430/ [22:43:53] 3Phabricator: Fatal error (30 seconds timeout) upon certain maniphest search in a component when not logged in - https://phabricator.wikimedia.org/T87739#1002604 (10Qgil) I get the error while being logged in (Chrome). [22:48:25] 3Phabricator: Fatal error (30 seconds timeout) upon certain maniphest search in a component when not logged in - https://phabricator.wikimedia.org/T87739#1002611 (10chasemp) Best person to ask is @mmodell as he has had luck breaking down their queries before. [22:54:38] 3operations, Beta-Cluster: Renumber apache user/group to uid=48 - https://phabricator.wikimedia.org/T78076#1002618 (10Dzahn) per https://wikitech.wikimedia.org/wiki/UID using www-data means we want uid/gid 33/33 (not 48/48 or random:48) [22:55:22] 3operations, Beta-Cluster: Renumber apache user/group to uid=48 - https://phabricator.wikimedia.org/T78076#1002631 (10Dzahn) I think apache user might have been in use since before we even used Ubuntu. Like on Fedora... [23:00:36] 3Phabricator: request for deletion: 'shell' project - https://phabricator.wikimedia.org/T87623#1002652 (10Dzahn) >>! In T87623#1001083, @Aklapper wrote: > Wikimedia-Site-requests project Btw, what is a Site-request. Let's quesiton that as well :) Is that an alias for Team Platform Engineering? [23:44:40] 3Phabricator, Community-Engagement: Experiment with a Volunteer team tag - https://phabricator.wikimedia.org/T87808#1002743 (10Qgil) I think it is worth creating this tag and experimenting with it. "Volunteer-Team" might sound too much like a team. Just for the sake of productive bikeshedding, what about... Vol... [23:49:04] 3Continuous-Integration, Release-Engineering: Jenkins: Implement hhvm based voting jobs for mediawiki and extensions (tracking) - https://phabricator.wikimedia.org/T75521#1002767 (10Krinkle) [23:50:30] 3Code-Review, Engineering-Community: How to prioritize code review of patches submitted by volunteers - https://phabricator.wikimedia.org/T78768#1002785 (10Qgil) There is an interesting discussion about speeding up our code reviews in general: https://lists.wikimedia.org/pipermail/wikitech-l/2015-January/080409....