[00:03:31] (03PS1) 10Legoktm: Build universal wheels [integration/tox-wikimedia] - 10https://gerrit.wikimedia.org/r/570750 [00:22:19] 10Beta-Cluster-Infrastructure, 10Operations: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10Krenair) >>! In T243226#5855852, @jbond wrote: > This is similar to productions which still has `issuer=CN = Puppet CA: palladium.eqiad.wmnet`. Tha... [00:48:38] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: connect to address en.m.wikipedia.beta.wmflabs.org and port 443: Connection refused [00:48:47] So I got puppet working. [00:48:51] Except it broke everything :/ [00:49:00] Oops. [00:50:19] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: connect to address en.wikipedia.beta.wmflabs.org and port 443: Connection refused [00:50:25] looks like it wants to have ATS terminate TLS, except it's got a problem trying to pull the cert from acme-chief [00:50:28] some http 500...hm [00:54:55] looks like it needed hieradata from I770478dbc07739acd38cfd78b2d8b171093667fe [00:57:53] that made puppet on cache-text05 a lot happier [01:00:07] varnish was hogging its port, 3128 [01:05:19] it's listening on *:8443... wonder why not 443 [01:07:51] 10Release-Engineering-Team, 10serviceops: Enable phpdbg on mwdebug* servers - https://phabricator.wikimedia.org/T244549 (10Jdforrester-WMF) [01:07:58] root@deployment-cache-text05:/etc/trafficserver# grep proxy.config.http.server_ports . -r [01:07:58] ./records.config:CONFIG proxy.config.http.server_ports STRING 3128 3128:ipv6 [01:08:14] no :ssl in there... based on modules/trafficserver/templates/records.config.erb this suggests we don't have @inbound_tls_settings set [01:08:39] 10Release-Engineering-Team, 10serviceops: Enable phpdbg on mwdebug* servers - https://phabricator.wikimedia.org/T244549 (10EBernhardson) In terms of actual deployment I think we can simply install the php-phpdbg package (available from our php7.2 deb component) and adjust MWScript.php to allow the 'phpdbg' SAP... [01:09:33] hm, but: [01:09:35] root@deployment-cache-text05:/srv/trafficserver/tls# grep proxy.config.http.server_ports . -r [01:09:35] ./etc/records.config:CONFIG proxy.config.http.server_ports STRING 443:ssl 443:ipv6:ssl [01:13:34] it's definitely running a tls instance process... [01:15:14] /etc/systemd/system/trafficserver.service.d/puppet-override.conf doesn't seem to contain CAP_NET_BIND_SERVICE which would be needed for ports < 1024... [01:24:32] okay wtf that was weird [01:24:43] old ats process from 2019 was bound to 8443 [01:24:44] terminated it [01:24:50] started trafficserver process again [01:24:55] it binds to 443 and stuff comes back to life [01:24:58] computers are dumb [01:25:24] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 92321 bytes in 3.944 second response time [01:25:25] and I am tired [01:28:04] 10Beta-Cluster-Infrastructure, 10Operations: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10Krenair) Got puppet running on -cache-text05, whole beta cluster broke, fixed acme-chief and ATS, going to sleep. [01:28:41] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 51268 bytes in 4.203 second response time [01:28:51] (Also puppet is complaining about GeoIP again. This is fine.) [01:59:02] 10MediaWiki-Codesniffer, 10User-DannyS712: add sniffs for using less specific assertions - https://phabricator.wikimedia.org/T244556 (10DannyS712) [04:49:47] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:54:39] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 51238 bytes in 4.326 second response time [06:49:06] 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Patch-For-Review, 10Release, 10Train Deployments: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 (10thiemowmde) We need help with T240858. It appears like the train deleted a message on all group 1 wikis it should n... [07:27:48] 15:57:57 legoktm: You continuing your trek to get rid of python2? :-) <-- yes, definitely [07:28:20] (03CR) 10Legoktm: [C: 03+2] Build universal wheels [integration/tox-wikimedia] - 10https://gerrit.wikimedia.org/r/570750 (owner: 10Legoktm) [07:32:57] (03Merged) 10jenkins-bot: Build universal wheels [integration/tox-wikimedia] - 10https://gerrit.wikimedia.org/r/570750 (owner: 10Legoktm) [07:56:13] (03PS1) 10Legoktm: Update Gerrit URI in test-requirements.txt [integration/jenkins] - 10https://gerrit.wikimedia.org/r/570815 [07:56:15] (03PS1) 10Legoktm: Use yaml.safe_load to avoid potential RCE [integration/jenkins] - 10https://gerrit.wikimedia.org/r/570816 [07:56:17] (03PS1) 10Legoktm: Use pytest instead of deprecated nose [integration/jenkins] - 10https://gerrit.wikimedia.org/r/570817 [07:56:20] (03PS1) 10Legoktm: Simplify configuration with tox-wikimedia [integration/jenkins] - 10https://gerrit.wikimedia.org/r/570818 [07:56:54] (03PS2) 10Legoktm: Use pytest instead of deprecated nose [integration/jenkins] - 10https://gerrit.wikimedia.org/r/570817 [07:56:55] (03PS2) 10Legoktm: Simplify configuration with tox-wikimedia [integration/jenkins] - 10https://gerrit.wikimedia.org/r/570818 [09:05:28] 10Phabricator, 10Release-Engineering-Team, 10DBA, 10Operations: Upgrade and restart m3 (phabricator) master (db1128) - https://phabricator.wikimedia.org/T244566 (10Marostegui) [09:05:39] 10Phabricator, 10Release-Engineering-Team, 10DBA, 10Operations: Upgrade and restart m3 (phabricator) master (db1128) - https://phabricator.wikimedia.org/T244566 (10Marostegui) p:05Triage→03Medium [09:10:55] (03PS2) 10Legoktm: [WIP] Add standard black configuration [integration/tox-wikimedia] - 10https://gerrit.wikimedia.org/r/566407 [09:14:30] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add standard black configuration [integration/tox-wikimedia] - 10https://gerrit.wikimedia.org/r/566407 (owner: 10Legoktm) [10:03:02] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-07 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:07:58] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-07 is OK: HTTP OK: HTTP/1.1 200 OK - 91829 bytes in 6.231 second response time [10:27:25] 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Release, 10Train Deployments: 1.35.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T233867 (10Addshore) [10:30:03] 10Release-Engineering-Team, 10Gerrit-Privilege-Requests, 10Operations, 10SRE-Access-Requests: Request for +2 access to mediawiki-config - https://phabricator.wikimedia.org/T244508 (10jijiki) p:05Triage→03Medium [10:30:37] 10Release-Engineering-Team, 10Gerrit-Privilege-Requests, 10Operations, 10SRE-Access-Requests: Request for +2 access to mediawiki-config - https://phabricator.wikimedia.org/T244508 (10jijiki) @thcipriani is that a task for your end? I am not sure :) [10:39:32] 10Release-Engineering-Team, 10Gerrit-Privilege-Requests, 10Operations, 10SRE-Access-Requests: Request for +2 access to mediawiki-config - https://phabricator.wikimedia.org/T244508 (10Urbanecm) @jijiki Anyone listed at https://gerrit.wikimedia.org/r/admin/groups/21,members plus members of ops LDAP group can... [10:57:47] 10Scap, 10Operations, 10serviceops: Make canary wait time configurable - https://phabricator.wikimedia.org/T217924 (10jijiki) shall we move this forward? [11:11:05] 10Phabricator: Rename my Phabricator account - https://phabricator.wikimedia.org/T244537 (10Aklapper) 05Open→03Resolved a:03Aklapper Renamed; for the AKA please use https://phabricator.wikimedia.org/people/editprofile/21455/ [12:18:29] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:19:03] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-07 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:24:01] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-07 is OK: HTTP OK: HTTP/1.1 200 OK - 91829 bytes in 9.070 second response time [12:28:24] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 92321 bytes in 4.211 second response time [13:06:07] 10Continuous-Integration-Infrastructure (Slipway), 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO, 10Wikidata, and 2 others: Migrate wikidata-query-gui-build to Docker containers - https://phabricator.wikimedia.org/T210286 (10Addshore) So the decision by the QS team was... [13:19:52] This is everyone's favorite question I'm sure... I'd like to do a full scap to refresh the l10n cache on group1/wmf.18 wikis. This is to solve a few thousand pages currently broken due to an UBN: T240858 [13:19:54] T240858: Clean up implementation for "follow" cases - https://phabricator.wikimedia.org/T240858 [13:21:18] In the meantime, I'm trying to understand why the l10n files were inconsistent with code. Is it possible that the l10n update is only done for group0 deployments, and not regenerated for the subsequent train deployments to group1 and 2? [13:32:20] well there was a rollback yesterday to compound things [13:32:31] but I don't know about when they are generated [13:37:46] I think I understand, now. The group0 deployments includes a full "scap sync", which builds the l10n cache files, but the group1 and group2 deployments are a simple configuration push, which cause the multiversion framework to load MediaWiki from a different directory depending on the host requested. [13:38:00] Hindsight is 20:40 or something. [13:49:06] gotcha [13:50:23] In other words, "burned by a thing I should have known already" ;-) [14:04:29] twentyafterfour: Moving here rather than the unrelated ticket. I learned something interesting about the cache/gitinfo just now. It apparently does get updated during scap sync-file, but only one of the pointers is changed. For example, the info-extensions-Cite.json file contains, [14:04:34] {"head": "71c2f93adcc07c6b933827457c4ff203686e04fa", "remoteURL": "https://gerrit.wikimedia.org/r/mediawiki/extensions/Cite", "branch": "71c2f93adcc07c6b933827457c4ff203686e04fa", "headCommitDate": "1580801583", "headSHA1": "7defec4d7a4c9890564e7373b5fdf609419c5a36", "@directory": "/srv/mediawiki-staging/php-1.35.0-wmf.18/extensions/Cite"} [14:05:26] Special:Version reports the `headSHA1` which happens to be the original branchpoint, but I would have expected it to report `head`, which is the actual version checked out and deployed. [14:06:07] awight: interesting [14:07:09] Hi releng! hoo and I would like to do a config change for an UBN :D [14:07:20] https://phabricator.wikimedia.org/T244529 [14:07:25] awight: l10n files are not updated during the normal train deployment for group1 and group2 so any swat that updates l10n specifically should run a full scap afterwards [14:07:39] twentyafterfour: excellent, thanks for the confirmation. [14:07:50] addshore: I don't see a problem with that [14:08:06] hoo: I'll leave it up to you when to do it! [14:08:15] twentyafterfour: btw, the documentation for GitInfo::getHead vs getHeadSHA1 do not make it clear that the difference was intended. [14:08:15] addshore: would you like me to do the needful? [14:08:21] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-09 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:08:24] I'm on a train at this current second, and will be in a building in 30 mins or so [14:08:32] twentyafterfour: I think we can manage it ourselves :) thanks though! [14:08:46] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:09:02] awight: yeah I'm not sure why there is a separate head and headSHA1 [14:09:29] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:09:58] awight: which documentation are you looking at? [14:10:04] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-07 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:11:31] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-parsoid10 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:20] I can't find any clues about why beta just died [14:21:25] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-parsoid10 is OK: HTTP OK: HTTP/1.1 200 OK - 91849 bytes in 2.887 second response time [14:23:13] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-09 is OK: HTTP OK: HTTP/1.1 200 OK - 91823 bytes in 1.013 second response time [14:23:17] hmm [14:28:42] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 51280 bytes in 4.828 second response time [14:29:25] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 92273 bytes in 5.618 second response time [14:29:56] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-07 is OK: HTTP OK: HTTP/1.1 200 OK - 91829 bytes in 4.568 second response time [14:37:17] I'll deploy my config change once things have finished settling down and I'm not on a train :) [14:38:10] Scratch that, done now! [14:39:46] PROBLEM - English Wikipedia Mobile Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:29] PROBLEM - English Wikipedia Main page on beta-cluster is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:02] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-07 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:42:31] PROBLEM - App Server Main HTTP Response on deployment-mediawiki-parsoid10 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:42:54] (03CR) 10Gopavasanth: "> > Patch Set 4:" [integration/config] - 10https://gerrit.wikimedia.org/r/569352 (https://phabricator.wikimedia.org/T244079) (owner: 10Gopavasanth) [14:45:25] RECOVERY - English Wikipedia Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 92297 bytes in 6.042 second response time [14:45:55] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-07 is OK: HTTP OK: HTTP/1.1 200 OK - 91853 bytes in 3.916 second response time [14:47:24] RECOVERY - App Server Main HTTP Response on deployment-mediawiki-parsoid10 is OK: HTTP OK: HTTP/1.1 200 OK - 91863 bytes in 2.018 second response time [14:49:39] RECOVERY - English Wikipedia Mobile Main page on beta-cluster is OK: HTTP OK: HTTP/1.1 200 OK - 51279 bytes in 3.506 second response time [14:59:08] (03PS3) 10Thiemo Kreuz (WMDE): Rewrite MultipleEmptyLines sniff for performance [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/566552 [15:03:01] a heads up to addshore and hoo: we should probably figure out a way to coordinate also with sre, when an unscheduled deployment/ubn etc is going to happen [15:03:44] ideas welcome [15:04:51] (03CR) 10Thiemo Kreuz (WMDE): "I rebased this again and changed it back to use the 'line' information, instead of string comparisons with $phpcsFile->eolChar. This works" (031 comment) [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/566552 (owner: 10Thiemo Kreuz (WMDE)) [15:05:51] apergos: yes! Its an interesting dynamic I guess as sometimes ops point us to releng, and sometimes releng point us to ops :P [15:06:26] I think both teams have to give a thumbs up [15:07:28] if sre is not available/busy with outage stuff/skeptical that something is so urgent as to be an ubn (they might be wrong, pushback is ok) [15:07:37] anyways without coordiation there is not even notice [15:07:52] releng is de facto in charge of deploys so they should be signing off too [15:08:44] most of the time that should add almost no slowdown to the deploy time... [15:09:49] That's true, I think we (at least I) just didn't think to ask again [15:10:14] it should just go into the deploy/swat/whatever procedure someplace [15:10:23] +1 [15:10:41] wonder who specifically maintains that [15:15:17] Hey folks. I'm trying to deploy ORES to beta. When I run scap from deployment-deploy01 it fails with "Host key verification failed" on deployment-ores01. [15:15:27] Any idea what might be going on? [15:27:08] https://phabricator.wikimedia.org/P10347 [15:27:34] twentyafterfour: I was reading the PHP docs here, https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/GitInfo.php [15:32:32] halfak: hi, it looks like the host was reimaged, does that sound likely? If so, you can remove the old hostkey with "ssh-keygen -R deployment-ores01.deployment-prep.eqiad.wmflabs" [15:32:36] gtg o/ [15:32:46] thanks for the tip! [15:37:56] Right [15:38:28] hoo|away: apergos twentyafterfour it looks like this config change we just tried to make actually revealed what is blocking the train. the answer is, Wikibase and this config! [15:38:53] as it is set wrong in mediawiki-config, a new default that is in Wikibase is being rolled out with .18 unexpectedly. [15:38:54] addshore: ohh [15:39:16] So, what we just saw deploying our UBN config change, essentially revealed some of the same issues that rolling the train forward did [15:40:15] addshore: wonderful. Can you update https://wikitech.wikimedia.org/wiki/Incident_documentation/20200206-mediawiki with relevant findings? [15:40:25] ho ho ho [15:40:35] that is very fun indeed [15:40:45] this world is just trying to cram as many more things into my week before i go skiing I can see [15:40:50] pretty much [15:40:55] its been a bit ridiculous... [15:41:27] twentyafterfour: Yes I will and I'll make a patch to fix the current state, and also unblock the train [15:42:17] Looks like I can't remove scap/deploy-service [15:42:19] 's keys [15:42:28] woot thanks addshore [15:42:40] halfak: want me to give it a try? [15:42:42] please no more deploys today though ... if at all possible [15:42:50] yes please twentyafterfour :) [15:43:14] there's already been enough 'excitement' before the weekend [15:43:22] apergos: Ideally we should still make this config deploy, its an UBN and all group0 and group1 wikis are reading from tables that are not fully populated for wikidata item labels and descriptions [15:43:45] Or, we would want to roll the train back [15:44:15] it's Fridy late afternoon/early evening here already [15:44:46] and we've had two outages (not blaming anyone, just stating energy level/etc) already today [15:44:54] so that's where i'm coming from on this [15:45:14] we rolled back to wmf.16, we can deal with it on monday as long as wel can find a way to leave the sites in a fairly stable state [15:45:38] right [15:46:09] we should definitely respect sre folks getting a chance to rest [15:46:38] addshore: what's the minimally disruptive / risky course that you would propose? [15:46:43] I do partly agree, but caches everywhere are being filled with "garbage" / incorrect things currently [15:46:54] that's not good [15:47:13] minimally disruptive = reseting this config back to the default that was in .16 [15:47:21] which is what it was set to everywhere at the start of this week [15:48:15] Infact, rather than touching config, we could revert https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/566513/ on .18 [15:48:27] and leave mediawiki-config touching until next week [15:48:31] I apologize but I have to step out, the meeting we postponed due to outages is about to start... [15:48:34] back in 30 mins [15:48:47] I"ll try to keep an eye over here but... [15:50:17] thanks apergos. [15:51:01] addshore: I leave it up to you and sre to decide the best course of action. I'm fine with either option, I think I agree we should do one or the other as long as sre is ok with it [15:51:41] halfak: sorry I haven't forgot you, looking at the beta cluster ssh key thing now [15:51:46] 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Patch-For-Review, 10Release, 10Train Deployments: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 (10Addshore) [15:52:03] Thanks! No worries. Seems like y'all have your hands full. I appreciate it :) [15:53:14] we will be unavailable for discussion for the next 30 minutes (me too, have to pay attention now) [15:53:14] 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Patch-For-Review, 10Release, 10Train Deployments: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 (10Addshore) >>! In T233866#5858082, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations),... [15:55:51] twentyafterfour: ack! [15:56:04] * addshore will wait 30 mins in that case! [15:59:00] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10MediaWiki-Configuration, 10MW-1.35-notes (1.35.0-wmf.19; 2020-02-11): Beta: Undefined index: 1x in /srv/mediawiki-staging/php-master/includes/Setup.php on line 186 - https://phabricator.wikimedia.org/T244370 (10abi_... [16:03:55] halfak: give it a shot now, I added the host key to global known hosts on deployment-deploy01. Not sure where else I could add it or why puppet isn't keeping that file up to date [16:04:07] \o/ working [16:10:46] 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Patch-For-Review, 10Release, 10Train Deployments: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 (10Addshore) Summary of the blocker at T244529#5859927 [16:11:08] twentyafterfour: I wrote the things up here https://phabricator.wikimedia.org/T244529#5859927 [16:11:34] addshore: thanks! [16:17:25] is gerrit incredibly slow for anyone else? [16:18:03] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Other / Uncategorized), 10Release-Engineering-Team-TODO, 10Cloud-Services, and 2 others: Horizon hiera UI: investigate data type handling - https://phabricator.wikimedia.org/T243422 (10Andrew) 05Open→03Resolved This is quite a bit better now. [16:18:06] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Other / Uncategorized), 10Release-Engineering-Team-TODO, 10Cloud-Services, and 2 others: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10Andrew) [16:18:14] nvm, seems ok now [16:22:12] ok folks, our meeting is over, let's move the deploy/not deploy discussion to wikimedia-sre please [16:22:35] a number of folks are there not here... [16:22:40] addshore: twentyafterfour hoo [16:22:57] ack! [16:23:05] ok [16:24:55] 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Release, 10Train Deployments: 1.35.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T233867 (10Addshore) [16:50:49] 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Jenkins: Jenkins plugins security advisory - 2012-02-12 - https://phabricator.wikimedia.org/T244582 (10greg) [17:17:25] (07:17:18 μμ) apergos: folks, anyone following along: unscheduled deploy happening now (rollback to .16 o groups 0, 1) [17:43:33] 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Patch-For-Review, 10Release, 10Train Deployments: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 (10Addshore) If .18 restarts next week then the backport for .18 in T244529 is needed for sure and is a blocker. If ne... [17:46:37] (03CR) 10Jforrester: [C: 03+2] Update Gerrit URI in test-requirements.txt [integration/jenkins] - 10https://gerrit.wikimedia.org/r/570815 (owner: 10Legoktm) [17:46:50] (03CR) 10Jforrester: [C: 03+1] Use yaml.safe_load to avoid potential RCE [integration/jenkins] - 10https://gerrit.wikimedia.org/r/570816 (owner: 10Legoktm) [17:47:22] (03Merged) 10jenkins-bot: Update Gerrit URI in test-requirements.txt [integration/jenkins] - 10https://gerrit.wikimedia.org/r/570815 (owner: 10Legoktm) [17:47:53] PROBLEM - Parsoid on deployment-parsoid09 is CRITICAL: connect to address 172.16.5.63 and port 8000: Connection refused [17:47:53] PROBLEM - Parsoid on deployment-mediawiki-parsoid10 is CRITICAL: connect to address 172.16.0.141 and port 8000: Connection refused [17:47:54] (03CR) 10Jforrester: [C: 03+1] Use pytest instead of deprecated nose [integration/jenkins] - 10https://gerrit.wikimedia.org/r/570817 (owner: 10Legoktm) [17:48:16] (03CR) 10Jforrester: [C: 03+2] Simplify configuration with tox-wikimedia [integration/jenkins] - 10https://gerrit.wikimedia.org/r/570818 (owner: 10Legoktm) [17:56:13] 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Patch-For-Review, 10Release, 10Train Deployments: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 (10Krinkle) [17:57:35] 10Beta-Cluster-Infrastructure, 10RESTBase: Restbase [might be] down on beta - https://phabricator.wikimedia.org/T244586 (10Jdforrester-WMF) [18:18:36] matthiasmullie and i were thinking it probably makes most sense to backport the code change necessitating the config change that caused T244591 to wmf.16 (which tbh should have been done in the first place) and redeploy the config change. then it's one less thing to trip over when attempting to reroll wmf.18+. thoughts? when would be a good time to do that? [18:18:37] T244591: Argument 4 passed to GoogleCloudVisionHandler must be an instance of WikidataDepictsSetter - https://phabricator.wikimedia.org/T244591 [18:25:13] 10Beta-Cluster-Infrastructure, 10RESTBase: Restbase [might be] down on beta - https://phabricator.wikimedia.org/T244586 (10Mholloway) Looks like a routing problem. The deployment-restbase0* instances are working fine, as is restbase-beta.wmflabs.org ([[ https://restbase-beta.wmflabs.org/en.wikipedia.beta.wmfl... [18:40:37] (03PS1) 10Jforrester: Follow-up 5e35c61: Zuul: [labs/tools/VideoCutTool] Switch from tox to node [integration/config] - 10https://gerrit.wikimedia.org/r/570936 (https://phabricator.wikimedia.org/T244079) [18:40:39] (03PS1) 10Jforrester: Zuul: [wikidata/query/gui] Use node10 template rather than direct jobs [integration/config] - 10https://gerrit.wikimedia.org/r/570937 [18:54:06] (03CR) 10Jforrester: [C: 03+2] Zuul: [wikidata/query/gui] Use node10 template rather than direct jobs [integration/config] - 10https://gerrit.wikimedia.org/r/570937 (owner: 10Jforrester) [18:54:09] (03CR) 10Jforrester: [C: 03+2] Follow-up 5e35c61: Zuul: [labs/tools/VideoCutTool] Switch from tox to node [integration/config] - 10https://gerrit.wikimedia.org/r/570936 (https://phabricator.wikimedia.org/T244079) (owner: 10Jforrester) [18:55:12] (03Merged) 10jenkins-bot: Follow-up 5e35c61: Zuul: [labs/tools/VideoCutTool] Switch from tox to node [integration/config] - 10https://gerrit.wikimedia.org/r/570936 (https://phabricator.wikimedia.org/T244079) (owner: 10Jforrester) [18:55:15] (03Merged) 10jenkins-bot: Zuul: [wikidata/query/gui] Use node10 template rather than direct jobs [integration/config] - 10https://gerrit.wikimedia.org/r/570937 (owner: 10Jforrester) [18:56:18] 10Continuous-Integration-Config, 10Wikidata: Wikibase CI: Quibble job should possibly include Math extension - https://phabricator.wikimedia.org/T201496 (10Physikerwelt) Yes. Almost certainly. Especially in the beginning of the weekend it is sometimes frustrating if you start to realize that wikibase became in... [19:01:29] !log Zuul: [labs/tools/VideoCutTool] Switch from tox to node T244079 [19:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [19:01:31] T244079: Set up continuous integration for VideoCutTool - https://phabricator.wikimedia.org/T244079 [19:06:13] (03CR) 10Jforrester: "> Patch Set 4:" [integration/config] - 10https://gerrit.wikimedia.org/r/569352 (https://phabricator.wikimedia.org/T244079) (owner: 10Gopavasanth) [19:06:24] (03PS2) 10Jforrester: Add labs/tools/video-cut-tool-back-end [integration/config] - 10https://gerrit.wikimedia.org/r/569697 (https://phabricator.wikimedia.org/T244079) (owner: 10Gopavasanth) [19:08:22] (03PS3) 10Jforrester: Zuul: [labs/tools/video-cut-tool-back-end] Add node10 CI [integration/config] - 10https://gerrit.wikimedia.org/r/569697 (https://phabricator.wikimedia.org/T244079) (owner: 10Gopavasanth) [19:09:29] (03CR) 10Jforrester: [C: 03+2] Zuul: [labs/tools/video-cut-tool-back-end] Add node10 CI [integration/config] - 10https://gerrit.wikimedia.org/r/569697 (https://phabricator.wikimedia.org/T244079) (owner: 10Gopavasanth) [19:10:24] (03Merged) 10jenkins-bot: Zuul: [labs/tools/video-cut-tool-back-end] Add node10 CI [integration/config] - 10https://gerrit.wikimedia.org/r/569697 (https://phabricator.wikimedia.org/T244079) (owner: 10Gopavasanth) [19:10:53] !log Zuul: [labs/tools/video-cut-tool-back-end] Add node10 CI T244079 [19:10:54] 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Release Pipeline: Allow additional helm overrides in PipelineLib config - https://phabricator.wikimedia.org/T244512 (10dduvall) p:05Triage→03Medium [19:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [19:10:57] T244079: Set up continuous integration for VideoCutTool - https://phabricator.wikimedia.org/T244079 [19:16:36] (03PS1) 10Jforrester: Run Math tests for every change in Wikibase [integration/config] - 10https://gerrit.wikimedia.org/r/570941 (https://phabricator.wikimedia.org/T201496) [19:17:22] 10Continuous-Integration-Config, 10Wikidata, 10Patch-For-Review: Wikibase CI: Quibble job should possibly include Math extension - https://phabricator.wikimedia.org/T201496 (10Jdforrester-WMF) Happy to make this change, if the Wikibase team are OK with it. [19:25:59] (03PS1) 10Dduvall: Allow helm chart overrides in deploy configuration [integration/pipelinelib] - 10https://gerrit.wikimedia.org/r/570944 (https://phabricator.wikimedia.org/T244512) [19:27:21] 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Release Pipeline, 10Patch-For-Review: Allow additional helm overrides in PipelineLib config - https://phabricator.wikimedia.org/T244512 (10dduvall) [19:44:06] James_F: who 'owns' the page, wherever it is, that says 'for unscheduled/ubn deploys, do this, do that, check with releng'? [19:44:25] I want to add 'check with sre' on there as a bullet point, to make sure it's covered [19:50:17] 10Beta-Cluster-Infrastructure, 10RESTBase: Restbase [might be] down on beta - https://phabricator.wikimedia.org/T244586 (10DLynch) [19:50:42] apergos: Err. I guess we do (specifically, greg-g). But not sure it's written down. [19:50:55] hmm [19:52:21] well I'll add it as an action item for the related incident and we'll figure something out [19:52:22] https://wikitech.wikimedia.org/w/index.php?search=%22emergency+deploy%22&title=Special:Search isn't filling me with confidence. [19:52:23] thanks! [19:52:35] ouch :-D [19:52:40] There's a 2013 incident report saying we should have an emergency procedures document. [19:52:53] (https://wikitech.wikimedia.org/wiki/Incident_documentation/20130628-Site) [19:53:47] Wiki 911 [19:53:53] or 999 [19:55:58] to laugh or to cry... [19:56:30] apergos: I've started https://wikitech.wikimedia.org/wiki/Deployments/Emergencies [19:57:15] feel free to use T31508 as an example case James_F :P [19:57:25] I categorized it :-D [19:57:34] Ha. Ops. [19:57:46] :-D [19:58:37] where can it be linked off the main deployments page? [19:58:49] I'm looking at the available sections but... brain dead tbh [19:58:57] been one of those months all week [19:59:22] I'll add that. [19:59:37] ty! [20:25:08] 10Phabricator, 10Release-Engineering-Team: Make "Related Gerrit patches" easier to click and/or to visually associate its meta data - https://phabricator.wikimedia.org/T244601 (10Krinkle) [20:59:29] 10Beta-Cluster-Infrastructure, 10RESTBase: Restbase [might be] down on beta - https://phabricator.wikimedia.org/T244586 (10Ryasmeen) p:05Triage→03Unbreak! [21:00:48] 10Beta-Cluster-Infrastructure, 10RESTBase: Restbase [might be] down on beta - https://phabricator.wikimedia.org/T244586 (10Ryasmeen) I am raising it as UBN, since all testing activities are stalled because of this. [21:05:51] (03CR) 10Umherirrender: [C: 03+2] Rewrite MultipleEmptyLines sniff for performance [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/566552 (owner: 10Thiemo Kreuz (WMDE)) [21:06:25] (03Merged) 10jenkins-bot: Rewrite MultipleEmptyLines sniff for performance [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/566552 (owner: 10Thiemo Kreuz (WMDE)) [21:16:38] (03PS1) 10Umherirrender: Replace isset() by File::numTokens to check for overflow [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/570953 [21:17:42] (03PS2) 10Umherirrender: Replace isset() by File::numTokens to check for end of array [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/570953 [21:24:45] 10Beta-Cluster-Infrastructure, 10RESTBase: Restbase down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Jdforrester-WMF) [21:26:08] 10Beta-Cluster-Infrastructure, 10RESTBase: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Jdforrester-WMF) [21:43:40] apergos: I do/Release Engineering. [21:44:08] 10Beta-Cluster-Infrastructure, 10RESTBase: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Mholloway) To help isolate possible culprits, when was this last working? [21:44:08] late to the party, greg-g, see the links :-) [21:44:21] I know I know :) better late than never [21:44:24] heh [21:58:28] 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Pchelolo) RESTBase itself seem to be working correctly. Something is wrong with routing before RESTBase. If you look at https://en.wikipedia.beta.wmfl... [22:09:24] (03PS1) 10Umherirrender: Detect multiple empty lines after single line comment [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/570955 [22:26:18] 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Patch-For-Review, 10Release, 10Train Deployments: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 (10matmarex) [23:01:40] (03PS2) 10Umherirrender: Detect multiple empty lines after single line comment [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/570955 [23:14:26] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] "Built and checked against docker-compose proof-of-concept, looks good to me." [releng/dev-images] - 10https://gerrit.wikimedia.org/r/556798 (owner: 10Kosta Harlan) [23:15:03] (03PS3) 10Umherirrender: Detect multiple empty lines after single line comment [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/570955 [23:15:38] (03PS2) 10Brennen Bearnes: Add XDebug-enabled variant of php72-fpm-apache2 [releng/dev-images] - 10https://gerrit.wikimedia.org/r/556798 (https://phabricator.wikimedia.org/T244382) (owner: 10Kosta Harlan) [23:15:59] 10Phabricator: Rename my Phabricator account - https://phabricator.wikimedia.org/T244537 (10AronManning) >>! In T244537#5859341, @Aklapper wrote: > Renamed Thank you, Aklapper! [23:16:28] 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Jdforrester-WMF) >>! In T244586#5860963, @Mholloway wrote: > To help isolate possible culprits, when was this last working? Yesterday, but not sure w... [23:17:59] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] "Added T24438 to commit, no other changes." [releng/dev-images] - 10https://gerrit.wikimedia.org/r/556798 (https://phabricator.wikimedia.org/T244382) (owner: 10Kosta Harlan) [23:20:28] !log puppetdb on deployment-puppetdb03 was killed by kernel OOM at Feb 7 09:50:29, per syslog. I just ran `systemctl start puppetdb` on that host, to fix puppet issues in beta. [23:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [23:20:47] 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Krenair) Did fix puppet on cache-text05 yesterday, it did a lot of stuff to replace some nginx/varnish stuff with ATS. May be related? [23:20:56] !log Updating dev-images docker-pkg files on contint1001 for T244382 [23:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [23:20:59] T244382: dev-images: Add XDebug-enabled variant of stretch-php72-fpm-apache2 - https://phabricator.wikimedia.org/T244382 [23:21:23] 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Jdforrester-WMF) >>! In T244586#5861250, @Krenair wrote: > Did fix puppet on cache-text05 yesterday, it did a lot of stuff to replace > some nginx/var... [23:22:11] 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Mholloway) It looks like routing for /api/rest_v1/ in Beta is set up in the prefix puppet settings for deployment-cache-text (as seen [[ https://gerri... [23:23:04] 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10MarcoAurelio) Not sure T243226 might be related. [23:24:18] 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Jdforrester-WMF) deployment-restbase01.deployment-prep.eqiad.wmflabs reports `The last Puppet run was at Mon Jan 20 10:54:08 UTC 2020 (26669 minutes a... [23:25:03] 10Beta-Cluster-Infrastructure, 10Operations: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10dpifke) puppetdb on deployment-puppetdb03 was killed by kernel OOM at Feb 7 09:50:29, per syslog. I just now ran `systemctl start puppetdb` on th... [23:25:14] 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Pchelolo) >>! In T244586#5861273, @Jdforrester-WMF wrote: > deployment-restbase01.deployment-prep.eqiad.wmflabs reports `The last Puppet run was at Mo... [23:26:00] 10Beta-Cluster-Infrastructure, 10Operations, 10RESTBase, 10Traffic: Restbase routing down on beta, 2020-02-07 - https://phabricator.wikimedia.org/T244586 (10Jdforrester-WMF) Ah, right, that's the please-upgrade-puppet task that @MarcoAurelio linked above. [23:36:50] 10Beta-Cluster-Infrastructure, 10Operations, 10observability: Beta puppet patch "prometheus: make ferm DNS record type configurable" - https://phabricator.wikimedia.org/T244624 (10Krinkle) [23:37:45] 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10dev-images, 10Developer Productivity, 10Patch-For-Review, 10User-brennen: dev-images: Add XDebug-enabled variant of stretch-php72-fpm-apache2 - https://phabricator.wikimedia.org/T244382 (10brennen) 05Open→03Resolved [23:40:01] 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10User-greg: Developer Satisfaction Survey 2020 - https://phabricator.wikimedia.org/T243439 (10greg)