[00:31:29] (03PS1) 1020after4: WIP: fix up branch.py so that it's suitable for wmf/ production branches [tools/release] - 10https://gerrit.wikimedia.org/r/543248 [00:33:49] (03CR) 10jerkins-bot: [V: 04-1] WIP: fix up branch.py so that it's suitable for wmf/ production branches [tools/release] - 10https://gerrit.wikimedia.org/r/543248 (owner: 1020after4) [00:33:51] (03CR) 1020after4: "see wmf-branch.sh for an example of how branch.py would be used to make a production wmf branch (at least in theory, currently untested!)" [tools/release] - 10https://gerrit.wikimedia.org/r/543248 (owner: 1020after4) [01:53:47] 10Phabricator, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO, 10Operations, and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Krinkle) [08:42:37] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-09 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:47:30] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-09 is OK: HTTP OK: HTTP/1.1 200 OK - 49284 bytes in 0.548 second response time [08:57:33] !log created deployment-memc08 in deployment-prep as memcached test host for Buster - T213089 [08:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:57:36] T213089: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 [08:57:45] please let me know if --^ is a problem [09:02:30] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-parsoid10 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:02:55] PROBLEM - Host integration-agent-docker-1008 is DOWN: CRITICAL - Host Unreachable (172.16.3.105) [09:03:20] PROBLEM - Host deployment-sca01 is DOWN: CRITICAL - Host Unreachable (172.16.5.13) [09:04:00] PROBLEM - Host deployment-db05 is DOWN: CRITICAL - Host Unreachable (172.16.5.170) [09:05:25] PROBLEM - Host saucelabs-02 is DOWN: CRITICAL - Host Unreachable (172.16.3.20) [09:05:47] PROBLEM - Host integration-agent-docker-1005 is DOWN: CRITICAL - Host Unreachable (172.16.7.210) [09:08:45] PROBLEM - Host deployment-memc05 is DOWN: CRITICAL - Host Unreachable (172.16.5.17) [09:09:12] PROBLEM - Host integration-agent-jessie-docker-1001 is DOWN: CRITICAL - Host Unreachable (172.16.5.149) [09:10:36] RECOVERY - Host deployment-sca01 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [09:10:49] RECOVERY - Host integration-agent-docker-1005 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [09:11:31] RECOVERY - Host integration-agent-jessie-docker-1001 is UP: PING OK - Packet loss = 0%, RTA = 1.54 ms [09:11:31] RECOVERY - Host integration-agent-docker-1008 is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [09:12:22] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-parsoid10 is OK: HTTP OK: HTTP/1.1 200 OK - 49268 bytes in 0.690 second response time [09:12:36] RECOVERY - Host deployment-memc05 is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [09:13:01] RECOVERY - Host deployment-db05 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [09:15:32] RECOVERY - Puppet staleness on deployment-sca01 is OK: OK: Less than 1.00% above the threshold [3600.0] [09:15:46] RECOVERY - Puppet staleness on integration-agent-docker-1005 is OK: OK: Less than 1.00% above the threshold [3600.0] [09:16:04] RECOVERY - Puppet staleness on saucelabs-02 is OK: OK: Less than 1.00% above the threshold [3600.0] [09:17:24] RECOVERY - Puppet staleness on integration-agent-docker-1008 is OK: OK: Less than 1.00% above the threshold [3600.0] [09:17:26] RECOVERY - Puppet staleness on integration-agent-jessie-docker-1001 is OK: OK: Less than 1.00% above the threshold [3600.0] [09:17:46] RECOVERY - Puppet staleness on deployment-db05 is OK: OK: Less than 1.00% above the threshold [3600.0] [09:20:59] PROBLEM - Host integration-slave-jessie-1004 is DOWN: CRITICAL - Host Unreachable (172.16.2.228) [09:22:23] PROBLEM - Host deployment-imagescaler03 is DOWN: CRITICAL - Host Unreachable (172.16.7.231) [09:25:56] RECOVERY - Host integration-slave-jessie-1004 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [09:27:16] RECOVERY - Host deployment-imagescaler03 is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [09:31:19] PROBLEM - Host deployment-webperf12 is DOWN: CRITICAL - Host Unreachable (172.16.4.24) [09:31:46] PROBLEM - Host deployment-parsoid09 is DOWN: CRITICAL - Host Unreachable (172.16.5.63) [09:32:02] PROBLEM - Host integration-slave-jessie-1001 is DOWN: CRITICAL - Host Unreachable (172.16.0.86) [09:32:17] PROBLEM - Host deployment-kafka-main-1 is DOWN: CRITICAL - Host Unreachable (172.16.4.116) [09:32:36] PROBLEM - Host deployment-deploy02 is DOWN: CRITICAL - Host Unreachable (172.16.4.19) [09:32:43] PROBLEM - Host deployment-sca02 is DOWN: CRITICAL - Host Unreachable (172.16.5.112) [09:33:02] PROBLEM - Host deployment-mcs01 is DOWN: CRITICAL - Host Unreachable (172.16.5.64) [09:33:17] PROBLEM - Host deployment-sca04 is DOWN: CRITICAL - Host Unreachable (172.16.5.54) [09:33:30] PROBLEM - Host deployment-kafka-jumbo-2 is DOWN: CRITICAL - Host Unreachable (172.16.5.47) [09:33:37] PROBLEM - Host deployment-mediawiki-09 is DOWN: CRITICAL - Host Unreachable (172.16.4.106) [09:34:06] PROBLEM - Host deployment-memc04 is DOWN: CRITICAL - Host Unreachable (172.16.5.76) [09:34:35] PROBLEM - Host deployment-maps04 is DOWN: CRITICAL - Host Unreachable (172.16.4.10) [09:34:43] PROBLEM - Host integration-agent-docker-1011 is DOWN: CRITICAL - Host Unreachable (172.16.3.126) [09:35:19] PROBLEM - Host deployment-deploy01 is DOWN: CRITICAL - Host Unreachable (172.16.4.18) [09:35:55] PROBLEM - Host integration-agent-docker-1012 is DOWN: CRITICAL - Host Unreachable (172.16.3.130) [09:36:52] PROBLEM - Host deployment-fluorine02 is DOWN: CRITICAL - Host Unreachable (172.16.5.71) [09:38:04] RECOVERY - Host deployment-mcs01 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [09:38:16] RECOVERY - Host deployment-kafka-main-1 is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [09:38:16] RECOVERY - Host deployment-sca04 is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [09:38:19] RECOVERY - Host deployment-parsoid09 is UP: PING OK - Packet loss = 0%, RTA = 1.24 ms [09:38:26] RECOVERY - Host deployment-deploy02 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [09:38:27] RECOVERY - Host deployment-sca02 is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [09:38:30] RECOVERY - Host integration-agent-docker-1011 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [09:38:30] RECOVERY - Host deployment-mediawiki-09 is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [09:38:32] RECOVERY - Host deployment-kafka-jumbo-2 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [09:38:32] RECOVERY - Host deployment-deploy01 is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [09:38:35] RECOVERY - Host integration-slave-jessie-1001 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [09:38:35] RECOVERY - Host deployment-webperf12 is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [09:39:06] RECOVERY - Host deployment-memc04 is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [09:39:35] RECOVERY - Host deployment-maps04 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [09:39:36] RECOVERY - Host integration-agent-docker-1012 is UP: PING OK - Packet loss = 0%, RTA = 1.27 ms [09:41:53] RECOVERY - Host deployment-fluorine02 is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms [09:42:32] RECOVERY - Puppet staleness on deployment-maps04 is OK: OK: Less than 1.00% above the threshold [3600.0] [09:43:09] RECOVERY - Puppet staleness on deployment-kafka-main-1 is OK: OK: Less than 1.00% above the threshold [3600.0] [09:43:10] RECOVERY - Puppet staleness on deployment-parsoid09 is OK: OK: Less than 1.00% above the threshold [3600.0] [09:44:19] RECOVERY - Puppet staleness on deployment-sca02 is OK: OK: Less than 1.00% above the threshold [3600.0] [09:45:29] RECOVERY - Puppet staleness on integration-agent-docker-1012 is OK: OK: Less than 1.00% above the threshold [3600.0] [09:55:23] (03PS1) 10Daimona Eaytoy: layout: [mediawiki/tools/phan/SecurityCheckPlugin] Move to PHP72+ [integration/config] - 10https://gerrit.wikimedia.org/r/543394 [10:33:20] (03PS1) 10Pwirth: Activate tests for new repo BlueSpiceDistributionConnector [integration/config] - 10https://gerrit.wikimedia.org/r/543399 [10:58:26] 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (201907): Investigate gerrit session expiration - https://phabricator.wikimedia.org/T222472 (10LarsWirzenius) Data point: I tend to need to log in about once a day, based on memory. Have not kept a log, though. I use... [11:00:58] 10Deployments, 10MediaWiki-SWAT-deployments: Figure out what to do with `fatalmonitor` script - https://phabricator.wikimedia.org/T234345 (10Lucas_Werkmeister_WMDE) Now the script itself has been removed from `mwlog1001`, it seems. [11:03:31] (03CR) 10jerkins-bot: [V: 04-1] Activate tests for new repo BlueSpiceDistributionConnector [integration/config] - 10https://gerrit.wikimedia.org/r/543399 (owner: 10Pwirth) [11:05:42] (03PS2) 10Pwirth: Activate tests for new repo BlueSpiceDistributionConnector [integration/config] - 10https://gerrit.wikimedia.org/r/543399 [11:24:46] (03PS8) 10Kosta Harlan: Add option for using Apache as server [integration/quibble] - 10https://gerrit.wikimedia.org/r/516729 (https://phabricator.wikimedia.org/T225218) [11:25:30] (03PS9) 10Kosta Harlan: Add option for using Apache as server [integration/quibble] - 10https://gerrit.wikimedia.org/r/516729 (https://phabricator.wikimedia.org/T225218) [11:25:32] (03CR) 10jerkins-bot: [V: 04-1] Add option for using Apache as server [integration/quibble] - 10https://gerrit.wikimedia.org/r/516729 (https://phabricator.wikimedia.org/T225218) (owner: 10Kosta Harlan) [11:26:26] (03CR) 10jerkins-bot: [V: 04-1] Add option for using Apache as server [integration/quibble] - 10https://gerrit.wikimedia.org/r/516729 (https://phabricator.wikimedia.org/T225218) (owner: 10Kosta Harlan) [11:28:55] (03PS10) 10Kosta Harlan: Add option for using Apache as server [integration/quibble] - 10https://gerrit.wikimedia.org/r/516729 (https://phabricator.wikimedia.org/T225218) [11:29:48] (03CR) 10jerkins-bot: [V: 04-1] Add option for using Apache as server [integration/quibble] - 10https://gerrit.wikimedia.org/r/516729 (https://phabricator.wikimedia.org/T225218) (owner: 10Kosta Harlan) [11:32:19] (03PS11) 10Kosta Harlan: Add option for using Apache as server [integration/quibble] - 10https://gerrit.wikimedia.org/r/516729 (https://phabricator.wikimedia.org/T225218) [11:49:12] kostajh: but. I though that support for Apache had been merged an ddeployed already! ? :D [11:49:17] I am outdated :-\ [12:30:13] hashar: no, I shelved it but Krinkle suggested I revive it, so here it is again [12:31:53] +1 [12:32:19] kostajh: iirc the benchmark you have done definitely proved that apache was wayy faster than the php builtin server [12:32:28] for good reason, php -S is serially processing requests hehe [12:32:46] hashar: it was faster on my local machine but not in a CI-like environment (I provisioned a DigitalOcean droplet for some tests) [12:32:57] ohhh [12:33:00] strange :-\ [12:33:27] See also https://travis-ci.org/kostajh/quibble/builds/549546283?utm_source=github_status [12:34:46] kostajh: that is surprising [12:35:14] or maybe it is less proeminent now that wdio runs test in parallel [12:35:43] or maybe it does not run them in parallel [12:36:15] anyway, given you wrote the patch, I guess there is not much work to complete it and have the feature added [12:36:36] (03PS12) 10Kosta Harlan: Add option for using Apache as server [integration/quibble] - 10https://gerrit.wikimedia.org/r/516729 (https://phabricator.wikimedia.org/T225218) [12:36:55] kostajh: will you be there on friday? Could use some of your time to talk about the tech conf sessions [12:37:23] (03CR) 10jerkins-bot: [V: 04-1] Add option for using Apache as server [integration/quibble] - 10https://gerrit.wikimedia.org/r/516729 (https://phabricator.wikimedia.org/T225218) (owner: 10Kosta Harlan) [12:37:27] but I have to dig a bit more into them tomorrow and get more familiar with the proposed topic + figure out question I could have for you :) [12:37:29] hashar: yes, I'll be around, and hopefully with a proper internet connection too [12:37:34] good [12:37:55] hashar: have you seen anywhere what exactly is involved in being a "lead/co-lead" for techconf session? [12:38:10] well [12:38:11] no [12:38:22] but in short I guess it is mostly facilitating during the session [12:38:35] and prepare the actual session. So reach out to people interested and see what they want to talk about [12:38:45] maybe even talk about stuff before the sessio happen [12:38:55] gather materials for people to read [12:39:24] and come out with an organiation for the session : unconference versus presentation+ questions/answers versus whatever [12:39:43] I am not too worried, we masterize the topic so we should be at ease [12:40:22] I guess the lead / co-lead role is thus to ensure there is a good dynamic, that the session is on track and does not derail in an off topic discussion or rant [12:40:36] and that in the end there is some positive outcome, eventually even a plan of action for the future [12:41:39] (03PS13) 10Kosta Harlan: Add option for using Apache as server [integration/quibble] - 10https://gerrit.wikimedia.org/r/516729 (https://phabricator.wikimedia.org/T225218) [12:42:02] k [12:43:13] I am not too worried about the session in itsel [12:43:20] but would need to prepare some stuff before :] [12:44:07] sure [12:46:12] anyway, gotta write slides and flip some more paper, so I am disconnecting [12:46:26] plan for tomorrow: review tech conf stuff and write thoughts about it [12:46:33] and hopefully review some of the pending Quibble patches! [13:09:17] one question about memcached in deployment prep [13:09:38] today I created deployment-memc08 with buster [13:09:53] so I was looking for the hiera config to add it to mcrouter [13:10:34] on deployment-mediawiki-07 I can see only two deployment-memc listed, 05 and 04 [13:10:45] and they are listed in operations/puppet [13:11:10] but we also have deployment-memc06 and 07 [13:11:16] do we use them elsewhere? [13:11:20] Don't we use Horizon for beta things? [13:11:43] we have some config in operations/puppet for deployment-prep [13:33:37] 10Beta-Cluster-Infrastructure: Global developer for DannyS712 on beta cluster - https://phabricator.wikimedia.org/T235650 (10Daimona) [13:37:42] 10Continuous-Integration-Config, 10Wikidata, 10Wikidata Query UI: Update wikidata-query-gui-build job versions (from Jessie, Node v6, npm v3) - https://phabricator.wikimedia.org/T235651 (10Lucas_Werkmeister_WMDE) [13:39:41] 10Beta-Cluster-Infrastructure: Global developer for DannyS712 on beta cluster - https://phabricator.wikimedia.org/T235650 (10MarcoAurelio) Global permissions are granted through [[ https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:GlobalUserRights | Special:GlobalUserRights ]]. But as noted on [[ https:... [14:18:21] 10Release-Engineering-Team, 10Scap, 10Operations, 10Wikimedia-General-or-Unknown, and 2 others: "Currently active MediaWiki versions:" broken on noc/conf - https://phabricator.wikimedia.org/T235338 (10thcipriani) >>! In T235338#5569953, @Reedy wrote: > Current implementation: > > `lang=html >

Currently... [14:52:09] 10Release-Engineering-Team (Deployment services), 10Security-Team, 10Wikimedia-Extension-setup, 10Wikimedia-Site-requests, 10Wikimedia-extension-review-queue: Deploy WebAuthn to Wikimedia Wikis - https://phabricator.wikimedia.org/T227242 (10Reedy) [15:26:29] 10Release-Engineering-Team, 10Scap, 10Operations, 10Wikimedia-General-or-Unknown, and 2 others: "Currently active MediaWiki versions:" broken on noc/conf - https://phabricator.wikimedia.org/T235338 (10Krinkle) I thought maybe it was user-permission or working-directory related. But, looks like not.. As www... [15:26:56] (03CR) 10Thcipriani: "Some driveby comments inline" (032 comments) [tools/release] - 10https://gerrit.wikimedia.org/r/543248 (owner: 1020after4) [15:40:01] 10Phabricator (Upstream), 10Upstream: Phabricator fonts look broken on systems with JoyPixels (formerly EmojiOne) installed - https://phabricator.wikimedia.org/T235339 (10epriestley) (See for the upstream position on... [16:27:36] beta doesn’t seem to update anymore, is that a known problem? [16:27:46] apparently https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/ doesn’t find a suitable executor [16:28:01] since yesterday evening or so [16:40:19] 10Beta-Cluster-Infrastructure: Beta cluster doesn’t update since ca. 2019-10-15 21:00 UTC - https://phabricator.wikimedia.org/T235674 (10Lucas_Werkmeister_WMDE) [16:40:22] reported as ^ [16:45:54] Lucas_WMDE: ahh I noticed that and fixed it by killing the queued build [16:46:37] 10Beta-Cluster-Infrastructure: Beta cluster doesn’t update since ca. 2019-10-15 21:00 UTC - https://phabricator.wikimedia.org/T235674 (10hashar) 05Open→03Resolved a:03hashar Fixed it on spot. I have canceled the queued builds in Jenkins which eventually unblock whatever deadlock occur. [16:46:50] Lucas_WMDE: I should probably just convert that job to poll the scm instead [16:46:56] ie just pull from time to time [16:47:04] instead of on every single merged changes [16:48:52] ok [17:08:21] hashar: beta-scap-eqiad et al. ain't running since yesterday apparently [17:08:45] https://integration.wikimedia.org/ci/view/Beta/ [17:08:55] can I scap manually while that's fixed? [17:09:57] Stalled on executor. [17:10:01] Don't scap manually. [17:10:07] I'll fix when I'm out of this meeting. [17:10:38] Alright! [17:10:38] 10Release-Engineering-Team, 10Operations, 10Wikimedia Design Style Guide: Automatic pickup of Gerrit clone master doesn't happen - https://phabricator.wikimedia.org/T235677 (10Volker_E) [17:10:40] hashar said he already fixed it… [17:11:01] Still looks stalled. :-( [17:11:03] Lucas_WMDE: I should probably scroll up and down more often [17:11:14] :P [17:11:15] but indeed it is not executing [17:11:25] 10Release-Engineering-Team (Unit & Int & System Tooling), 10Release-Engineering-Team-TODO (201910), 10International-Developer-Events, 10Wikimedia-Technical-Conference-2019, and 2 others: Wikimedia Technical Conference 2019 Session: System level testing: patterns an... - https://phabricator.wikimedia.org/T234635 [17:11:47] 10Gerrit, 10Release-Engineering-Team, 10Operations, 10Wikimedia Design Style Guide: Automatic pickup of Gerrit clone master doesn't happen - https://phabricator.wikimedia.org/T235677 (10Dzahn) [17:17:18] 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO, 10Release Pipeline: contint1001 has lot of dangling Docker images - https://phabricator.wikimedia.org/T235680 (10hashar) [17:18:44] (03CR) 1020after4: WIP: fix up branch.py so that it's suitable for wmf/ production branches (031 comment) [tools/release] - 10https://gerrit.wikimedia.org/r/543248 (owner: 1020after4) [17:19:27] 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO, 10Release Pipeline: contint1001 has lot of dangling Docker images - https://phabricator.wikimedia.org/T235680 (10hashar) p:05Triage→03Normal [17:25:21] faiure message reads "xxx do not have the BetaClusterBastion tag" [17:26:40] Yeah. Still in this meeting. [17:26:44] perhaps T235674 should be reopened until the issue is actually fixed? [17:26:45] T235674: Beta cluster doesn’t update since ca. 2019-10-15 21:00 UTC - https://phabricator.wikimedia.org/T235674 [17:27:21] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team-TODO (201910): Beta cluster doesn’t update since ca. 2019-10-15 21:00 UTC - https://phabricator.wikimedia.org/T235674 (10Jdforrester-WMF) 05Resolved→03Open p:05Triage→03High Not fixed. [17:27:27] Done. [17:27:31] thanks [17:42:32] OK, back. [17:43:05] https://integration.wikimedia.org/ci/label/BetaClusterBastion/ has nodes and projects, so it's not a mis-config. [17:43:23] However https://integration.wikimedia.org/ci/computer/deployment-deploy01/ doesn't have anything assigned to it? [17:44:30] !log Marking deployment-deplog01 offline temporarily for T235674 [17:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [17:44:33] T235674: Beta cluster doesn’t update since ca. 2019-10-15 21:00 UTC - https://phabricator.wikimedia.org/T235674 [17:45:11] 10Project-Admins: Create Project: Watchlist-Expiry - https://phabricator.wikimedia.org/T235686 (10ifried) [17:46:55] OK, it processed https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/16075/ [17:47:53] PROBLEM - Parsoid on deployment-parsoid09 is CRITICAL: connect to address 172.16.5.63 and port 8000: Connection refused PROBLEM - Parsoid on deployment-mediawiki-parsoid10 is CRITICAL: connect to address 172.16.0.141 and port 8000: Connection refused [17:48:02] Killed a few more hung jobs and it seems to be processing https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/16075/ now [17:48:06] * James_F crosses fingers. [17:49:41] Project beta-scap-eqiad build #271468: 04FAILURE in 1 min 59 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/271468/ [17:49:59] 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Patch-For-Review, 10User-Ladsgroup: Use `git lfs` for large binary files of Design Style Guide - https://phabricator.wikimedia.org/T235013 (10Dzahn) [17:57:35] Project beta-scap-eqiad build #271469: 04STILL FAILING in 2 min 7 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/271469/ [17:57:50] Oh dear. [17:59:01] Could be because there's so much churn, possibly. [17:59:25] Hmm, no, it's getting refused when it's trying to scap out files. [17:59:45] `17:49:35 sudo -u mwdeploy -n -- /usr/bin/scap cdb-rebuild on deployment-mediawiki-09.deployment-prep.eqiad.wmflabs returned [255]: Permission denied (publickey).` [18:01:43] geez [18:01:47] publickey? [18:05:05] https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/268080/console was quite an update [18:05:16] Ha, yes. [18:05:31] But maybe the keyholder/whatever stuff isn't set right in Beta Cluster? [18:06:25] Project beta-scap-eqiad build #271470: 04STILL FAILING in 1 min 56 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/271470/ [18:06:41] puppet thing? [18:09:49] James_F: puppet said it was 'stopped' for some reason [18:10:04] Notice: /Stage[main]/Confd/Base::Service_unit[confd]/Service[confd]/ensure: ensu re changed 'stopped' to 'running' [18:10:27] Hmm. That'd not help. [18:10:49] * James_F wonders if Krenair is around. [18:10:55] that was from puppet agent -tv [18:11:01] let's see if that helps [18:11:27] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team-TODO (201910): Beta cluster doesn’t update since ca. 2019-10-15 21:00 UTC - https://phabricator.wikimedia.org/T235674 (10Jdforrester-WMF) Jobs populating correctly, but failing with: `17:49:35 sudo -u mwdeploy -n -- /usr/bin/scap cdb-rebuild on deploym... [18:15:17] Nope, still failing [18:15:35] Project beta-scap-eqiad build #271471: 04STILL FAILING in 1 min 17 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/271471/ [18:16:26] hauskater: you know what's odd.. if i follow that FAILING link above and look at Console Output.. it says SUCCESS at the end [18:16:45] https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/268082/console [18:16:55] oh, not the same job ID i guess.. ehm [18:17:11] beta-scap-eqiad vs beta-code-update-eqiad [18:17:13] https://github.com/wikimedia/puppet/blob/8897bcbaae5d96f0c0bf2db43c93e5d717b1cd83/modules/mediawiki/manifests/users.pp#L29 <-- ? [18:17:42] hauskater: do you see a puppet error? [18:17:42] mutante: it says it's a public key failure but I'm not sure if you've updated mwdeploy user keys? [18:17:57] mutante: Not errors, some debug messages [18:18:00] let me show you [18:18:53] (03CR) 10Jforrester: [C: 03+2] layout: [mediawiki/tools/phan/SecurityCheckPlugin] Move to PHP72+ [integration/config] - 10https://gerrit.wikimedia.org/r/543394 (owner: 10Daimona Eaytoy) [18:18:58] mutante: https://phabricator.wikimedia.org/P9365 [18:19:04] but I don't think they're related [18:19:13] (03CR) 10Jforrester: [C: 03+2] Activate tests for new repo BlueSpiceDistributionConnector [integration/config] - 10https://gerrit.wikimedia.org/r/543399 (owner: 10Pwirth) [18:20:32] hauskater: yea, that looks like a succesful puppet run. Does it start confd on every run? repeat it [18:20:36] (03Merged) 10jenkins-bot: layout: [mediawiki/tools/phan/SecurityCheckPlugin] Move to PHP72+ [integration/config] - 10https://gerrit.wikimedia.org/r/543394 (owner: 10Daimona Eaytoy) [18:20:52] mutante: I'll switch back to deploy01 [18:21:04] (03Merged) 10jenkins-bot: Activate tests for new repo BlueSpiceDistributionConnector [integration/config] - 10https://gerrit.wikimedia.org/r/543399 (owner: 10Pwirth) [18:21:27] hello [18:21:33] James_F, what's up? [18:21:52] Krenair: Puppet issues in Beta Cluster, but hauskater seems to be dealing? [18:22:13] James_F: well, not really. Krenair is the expert [18:22:21] mutante: same messages as previously [18:22:27] permission denied errors everywhere? better check keyholder [18:22:44] hauskater: you can try "keyholder status" https://wikitech.wikimedia.org/wiki/Keyholder [18:22:56] Paste updated [18:22:59] it show a list of fingerprints [18:23:01] should [18:23:18] -bash: keyholder: command not found [18:23:26] jenkins-bot@deployment-deploy01:~$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh-add -L [18:23:26] The agent has no identities. [18:23:35] did nobody arm it? [18:23:49] * Krenair does [18:24:16] arm and restart? [18:24:41] for reference: sudo keyholder arm (and then it asks for a password for the key(s)) .. that's how it would be in prod [18:24:44] krenair@deployment-deploy01:~$ sudo keyholder arm [18:24:44] ... [18:24:51] /etc/keyholder.d/mwdeploy is not an acceptable key. Is it an RSA or ED25519 key with passphrase? [18:24:55] I've not seen Icigna complain about it as stated in Wikitech [18:25:14] It looks like an RSA private key... hmm [18:25:20] sudo did it mutante [18:25:35] hauskater: i think if that is monitored that would be Shinken [18:25:36] Project beta-scap-eqiad build #271472: 04STILL FAILING in 1 min 15 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/271472/ [18:26:10] krenair@deployment-deploy01:~$ sudo file /etc/keyholder.d/mwdeploy [18:26:10] /etc/keyholder.d/mwdeploy: PEM RSA private key [18:26:17] Krenair: maybe a new key that does not have a passphrase? [18:26:27] hmmm [18:26:35] yes [18:26:37] looks like it [18:26:41] but [18:26:42] When I did sudo keyholder status only keys for analytics_deploy and dumpsdeploy appear listed [18:26:45] why? [18:26:56] !log Zuul: Activate tests for new repo BlueSpiceDistributionConnector [18:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:27:05] !log Zuul: [mediawiki/tools/phan/SecurityCheckPlugin] Move to PHP72+ [18:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:27:17] /etc/keyholder.d/mwdeploy on deployment-deploy01 does not match files/ssh/tin/mwdeploy_rsa from the puppetmaster [18:27:58] puppet looks normal [18:28:21] looks like it's picked up modules/secret/secrets/keyholder/mwdeploy instead [18:28:21] maybe somebody wanted to remove the "tin" remnants and replace with deploy1001 [18:29:53] https://phabricator.wikimedia.org/T235491#5577775 <-- related ? [18:29:54] but git log mwdeploy only shows one entry [18:30:00] when it was added [18:30:15] taking about secret/secrets/keyholder in labs/private ..right [18:31:02] hauskater, unlikely [18:31:39] I tried 'cp files/ssh/tin/mwdeploy_rsa modules/secret/secrets/keyholder/mwdeploy' and while I could load that into keyholder, it was not accepted by the remote host [18:32:03] Is the problem on the keyholder server or the remotes? [18:32:06] (or both?) [18:33:40] who knows? [18:34:16] okay after checking things I am reasonably confident that modules/secret/secrets/keyholder/mwdeploy contains the key that remote hosts are expecting [18:35:11] sounds like the issue is it has no password and does not like that. [18:35:15] oh wait [18:35:19] that's not a private file [18:35:22] but we dont see a change so far [18:35:34] Project beta-scap-eqiad build #271473: 04STILL FAILING in 1 min 15 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/271473/ [18:36:03] are the remote hosts getting the wrong key and we're trying to load the wrong thing into keyholder? [18:39:16] 10Gerrit, 10Release-Engineering-Team, 10Operations, 10Wikimedia Design Style Guide: Automatic pickup of Gerrit clone master doesn't happen - https://phabricator.wikimedia.org/T235677 (10Dzahn) The changes made in T235013 added a requirement to have git-lfs installed and use a different command to pull data... [18:43:05] Krenair: this seems interesting https://gerrit.wikimedia.org/r/c/operations/puppet/+/522008 [18:43:23] 10Project-Admins: Create Project: Watchlist-Expiry - https://phabricator.wikimedia.org/T235686 (10MBinder_WMF) Thanks for making a task @ifried . FWIW, this does sound just like #expiring-watchlist-items and this description (even mentioning the Community Tech Wishlist from 2015 in T124752 : https://phabricator.... [18:43:27] see that commit message there [18:43:38] maybe it needs the override in Hiera [18:43:49] no [18:43:56] ... re-introduces a hiera setting that allows this to be changed on [18:43:56] the key that should be in use has a password [18:43:58] a per-deploy basis. Allowing unencrypted keys.. ? [18:44:05] oh, ok [18:44:30] do we need to add all the keys at /etc/keyholder.d in the keyholder? [18:44:36] ie, eventlogging [18:44:39] i dont see any newer changes in Gerrit that mention keyholder [18:45:18] no [18:45:40] Project beta-scap-eqiad build #271474: 04STILL FAILING in 1 min 12 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/271474/ [18:48:25] ugh maybe I should just re-key this [18:52:41] 10Project-Admins: Create Project: Watchlist-Expiry - https://phabricator.wikimedia.org/T235686 (10ifried) @MBinder_WMF I was thinking of creating a new component because the Expiring-Watchlist-Items component is attached to the work done by WMDE. In the past, the WMDE Community Tech team had taken on a [[ https:... [18:54:21] now the keyholder agent is refusing to sign things... great... [18:54:55] key works though [18:55:09] was keyholder.d/mwdeploy fwiw? [18:55:15] ok [18:55:17] for documentation [18:55:39] I'm not at the stage of dealing with docs yet [19:02:40] Project beta-scap-eqiad build #271475: 04STILL FAILING in 7 min 21 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/271475/ [19:06:49] James_F, so if I'm right I think it should be fixed everywhere but snapshot01? [19:08:39] Project beta-scap-eqiad build #271476: 04STILL FAILING in 4 min 15 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/271476/ [19:14:35] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-09 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:27] Krenair: The job complained about snapshot01, mwmaint01 and deploy02. [19:16:40] Was that just race condition with your fixing things, or are those broken too? [19:16:41] so close [19:17:11] oh is that just about the opcache update? [19:17:16] it always does that [19:17:25] Project beta-scap-eqiad build #271477: 04STILL FAILING in 8 min 43 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/271477/ [19:17:38] snapshot01 is the bit that matters [19:17:50] Right. [19:17:50] just gotta do the same fix on deployment-dumps-puppetmaster... :) [19:17:55] Fun. [19:17:56] or at least the public part [19:20:30] ok [19:25:02] Yippee, build fixed! [19:25:03] Project beta-scap-eqiad build #271478: 09FIXED in 6 min 15 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/271478/ [19:25:07] 10Project-Admins: Create new project for WatchTranslations tool - https://phabricator.wikimedia.org/T235700 (10Urbanecm) [19:27:06] James_F, ^ [19:32:11] Success. [19:32:40] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team-TODO (201910): Beta cluster doesn’t update since ca. 2019-10-15 21:00 UTC - https://phabricator.wikimedia.org/T235674 (10Jdforrester-WMF) 05Open→03Resolved a:05hashar→03Krenair Fixed by @Krenair re-doing the keyholder configuration. [19:39:16] Krenair: oh so that was more complicated than just CI being broken. Thank you Krenair! [19:39:25] yeah [19:39:27] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-09 is OK: HTTP OK: HTTP/1.1 200 OK - 49270 bytes in 1.034 second response time [19:39:35] I'm still not sure how this happened [19:40:38] I blame cosmic rays. [19:41:34] 10Deployments, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (201910), 10Performance-Team (Radar): Reduce static asset time on disk from five trains' worth to two - https://phabricator.wikimedia.org/T140921 (10Jdforrester-WMF) 05Open→03Resolved [19:41:53] the lack of recent entries in those hosts' puppet logs suggests they may have been accepting this bad key for a while, but then the mystery becomes how were the deployment servers using it if they didn't permit unencrypted keys? [19:43:04] I should get some food [19:43:30] (03PS1) 10Urbanecm: Add jobs for wikimedia-cz/web-plugin and wikimediacz/web-theme [integration/config] - 10https://gerrit.wikimedia.org/r/543681 [19:43:40] Krinkle: Is that you using blameStartupRegistry? [19:43:52] See AW3WEXPaghP2xm4vmnOD etc. [19:44:43] https://logstash.wikimedia.org/goto/98450918bb82d3c3b05217dbe951e9cc [19:44:46] does not resolve [19:44:57] I haven't SSH'ed today yet [19:45:02] Oh. [19:45:08] but I did test with that some days ago [19:45:33] A smattering of `Error from line 160 of …/WikimediaMaintenance/blameStartupRegistry.php: Call to private method ResourceLoaderStartUpModule::getConfigSettings() from context 'BlameStartupRegistry'` [19:45:36] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-09 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:45:49] Did you set it on a cron? [19:46:05] It has been a cron for a while yes [19:46:16] Ah, and we made a breaking change in wmf.2 I guess? [19:46:23] Hence the sudden burst just after the train. [19:47:28] 10Project-Admins: Create Project: Watchlist-Expiry - https://phabricator.wikimedia.org/T235686 (10MBinder_WMF) Thanks for laying it all out, helps me understand. I think a separate project is probably OK. I mostly just worry about confusion for those people that aren't part of either project but want to engage... [19:47:52] Gerrit User Summit live for the first time: https://twitter.com/gerritforge/status/1184405834899607553 [19:48:19] James_F: hm.. not exactly, I made it public in master and backported to wmf.1 some days ago [19:48:26] was the branch not cut on Tuesday? [19:48:37] wmf.2 if anything should be fixing it not causing it [19:48:40] It was. [19:49:18] https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/542711/ [19:49:25] OK I guess I was dreaming when I thought the core change landed [19:49:31] :| [19:49:46] Yeah, tip of RLSUM is 5155abe0e6ab6589d4104a221df0a0b2c5142c16 on wmf.2 [19:49:55] Not merged. :-) [19:50:09] I'm merging it now. [19:50:16] thx [19:50:25] can also revert the WikimediaMaint change if that doesn't fix it [19:50:40] Nah, let's just fix it. [19:50:56] thx for flagging it. [19:51:00] Gotta prep for next meting now [19:51:05] That's what logspam is for. :-0 [19:51:21] Feel free to deploy whenever. Am also find with rolling it out in 2-3h on my own otherwise [19:51:25] fine* [19:52:02] hi, how's that publickey issue going? [19:52:13] (03CR) 10Jforrester: Add jobs for wikimedia-cz/web-plugin and wikimediacz/web-theme (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/543681 (owner: 10Urbanecm) [19:56:49] thcipriani: Deployment Freeze affects everything deployed via scap from deployment server or everything [19:59:15] in the past we've done everything that would be scheduled on the deployment page aside from extreme emergency SWAT deployments [19:59:39] so that may be a wider net than everything deployed via scap [20:01:43] Honestly, though, if SRE want to spend their Christmases taking down the e-mail MXes, I guess they can and RelEng don't get involved? :-) [20:02:37] thcipriani: ack [20:03:25] there are more subtle things like puppet is set to clone content from something and changes are in the deploy repo. but the Deployment Calendar rule applies [20:30:26] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-09 is OK: HTTP OK: HTTP/1.1 200 OK - 49274 bytes in 0.434 second response time [21:16:45] (03PS2) 10Urbanecm: Add jobs for wikimedia-cz/web-plugin and wikimediacz/web-theme [integration/config] - 10https://gerrit.wikimedia.org/r/543681 [21:19:18] (03CR) 10Urbanecm: Add jobs for wikimedia-cz/web-plugin and wikimediacz/web-theme (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/543681 (owner: 10Urbanecm) [22:06:06] Project beta-scap-eqiad build #271495: 04FAILURE in 1 min 46 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/271495/ [22:10:45] !log https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php72-docker/8502/console got stuck in quibble's composer install step for half an hour; manually aborted. :-( [22:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [22:11:30] CalledProcessError. Interesting. [22:16:21] Project beta-scap-eqiad build #271496: 04STILL FAILING in 1 min 51 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/271496/ [22:26:16] Project beta-scap-eqiad build #271497: 04STILL FAILING in 1 min 49 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/271497/ [22:26:49] Hmm, Beta seems to have broken itself again. :-( [22:35:26] (03PS3) 10Jforrester: jjb: Point OOUI experimental image at node10-test-browser-php72-composer [integration/config] - 10https://gerrit.wikimedia.org/r/543227 (https://phabricator.wikimedia.org/T235570) [22:35:28] (03PS1) 10Jforrester: dockerfiles: [node10-test-browser-php72-composer] Make this actually provide both PHP and Node [integration/config] - 10https://gerrit.wikimedia.org/r/543723 [22:36:03] Project beta-scap-eqiad build #271498: 04STILL FAILING in 1 min 46 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/271498/ [22:36:40] (03CR) 10Jforrester: Add jobs for wikimedia-cz/web-plugin and wikimediacz/web-theme (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/543681 (owner: 10Urbanecm) [22:37:24] no idea what that is [22:37:42] some weird permissions problem now? [22:38:19] hrm [22:38:28] did we lose contact with ldap at some point? [22:38:53] (03PS3) 10Urbanecm: Add jobs for wikimedia-cz/web-plugin and wikimediacz/web-theme [integration/config] - 10https://gerrit.wikimedia.org/r/543681 [22:39:01] (03CR) 10Urbanecm: Add jobs for wikimedia-cz/web-plugin and wikimediacz/web-theme (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/543681 (owner: 10Urbanecm) [22:39:42] thcipriani, sounds like you may have seen this before? :) [22:40:08] (03CR) 10Jforrester: [C: 03+2] Add jobs for wikimedia-cz/web-plugin and wikimediacz/web-theme [integration/config] - 10https://gerrit.wikimedia.org/r/543681 (owner: 10Urbanecm) [22:40:43] heh, unexplained permissions can mean that we've created a local mwdeploy user that shadows the ldap mwdeploy user [22:40:50] looking now [22:41:53] which looks like what happened on deployment-deploy02 [22:41:57] (03Merged) 10jenkins-bot: Add jobs for wikimedia-cz/web-plugin and wikimediacz/web-theme [integration/config] - 10https://gerrit.wikimedia.org/r/543681 (owner: 10Urbanecm) [22:42:13] now that you mention it there was a weird thing in puppet [22:42:27] changing ownership from mwdeploy to mwdeploy I think? [22:42:38] that fits with this [22:42:42] that would, yeah [22:42:59] !log deployment-deploy02:sudo vipw to remove mwdeploy user [22:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [22:43:08] vipw? [22:43:26] hm, ok [22:43:49] yep, just removing from passwd file locally is how I've been doing it [22:44:16] (03CR) 10Urbanecm: "thanks" [integration/config] - 10https://gerrit.wikimedia.org/r/543681 (owner: 10Urbanecm) [22:45:05] scap pull works on deployment-deploy02, so hopefully that means beta-scap-eqiad'll be happy [22:46:06] Yippee, build fixed! [22:46:07] Project beta-scap-eqiad build #271499: 09FIXED in 1 min 48 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/271499/ [22:47:29] (y)