[00:01:27] 10Beta-Cluster-Infrastructure: deployment-puppetmaster04 disc use exploding - https://phabricator.wikimedia.org/T246937 (10Krenair) Oh, yeah, thanks James. This turned out to be an uncommented debug line lurking in /usr/share/puppet/rack/puppet-master/config.ru. [00:03:16] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)): Puppet fails on Beta Cluster because "did not find a value for the name 'profile::envoy::ensure'" - https://phabricator.wikimedia.org/T247147 (10Jdforrester-WMF) Possibly broken by `b35d6dca8cccfb6220e0682c24a0cec053f30ff8`? [00:03:29] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)): Puppet fails on Beta Cluster because "did not find a value for the name 'profile::envoy::ensure'" - https://phabricator.wikimedia.org/T247147 (10Krenair) did you try setting `profile::envoy::ensure: absent` in hiera data... [00:04:07] Krenair: I didn't; when did puppet last successfully run? Maybe we can work backwards to what broke. [00:05:29] James_F, a "did not find a value for the name" error is not massively interesting [00:05:42] in most cases that'll just be the host doesn't have all the config we now expect [00:06:04] Do any of them? [00:06:14] often this is due to us not using the hiera role backend thing in labs [00:06:22] Right. [00:06:48] But doesn't https://gerrit.wikimedia.org/r/c/operations/puppet/+/572831/2/hieradata/cloud/eqiad1/devtools/common.yaml explicitly set it? [00:07:41] we're not in the `devtools` project [00:07:47] we're in the deployment-prep project [00:10:31] Oh, ha. [00:10:36] That'd not help. [00:11:30] I dislike those cloud project files living in operations/puppet.git by the way [00:11:37] makes it unnecessarily hard to get changed [00:11:56] and is yet another place to look for hieradata [00:12:13] * James_F nods. [00:12:47] Should I just try setting it to something in horizon and see if it works? [00:13:16] might be time to just merge it all into horizon per-instance/project-wide hieradata (as appropriate) and get rid of the puppet.git files for specific cloud projects/hosts [00:13:29] Yeah. [00:13:30] yes, in this case you'll want 'absent' or 'present' [00:13:43] Though that means that SRE can't see and fix the breakage they're making as they go. [00:13:53] (the two usual values of 'ensure' things in puppet) [00:14:04] I guess with Horizon over-rides, as well as one-off cherry-picks to BC puppet, that's already the case, but… [00:14:06] absent will mean you won't have whatever envoy stuff this controls, but that might be ok? [00:14:16] envoy is TLS stuff. [00:14:35] Which Beta Cluster doesn't do at all, and is done at the edge, right? [00:15:27] well.. if you had the caching ATS servers within beta .. and they would talk to other instances within the project [00:15:28] we do TLS externally but probably not between most internal hosts [00:15:42] we do have caching ats servers within beta [00:15:44] then envoy could run and ensure TLS between the instances [00:15:55] Do we care? [00:16:04] probably much less so than prod [00:16:09] we only run in one DC [00:16:14] Yeah. [00:16:24] on one hand we don't but on the other hand this is always where the "it's different from prod anyways" starts and then you get all the follow-ups [00:16:27] If we absent it, will something break? [00:16:36] mutante: Yeah, true. [00:17:18] you can try setting it to present and getting everything set up - I don't know how much of a rabbit hole it is [00:17:31] might Just Work [00:17:32] Probably lots. :-( [00:17:32] it is unclear which rabbit hole is deeper [00:17:52] try both :p [00:18:29] Eurgh. [00:19:03] you never know, sometimes it surprises you and just works [00:19:17] but often enough there is another unexpected issue behind that [00:19:57] I don't know much about the envoy stuff in prod [00:19:59] is it using puppet certs? [00:20:20] Krenair: yes, using this https://wikitech.wikimedia.org/wiki/Cergen#Cheatsheet [00:20:33] i do add fake certs to labs/private when i do [00:21:23] this doesn't look like your usual puppet certs [00:21:31] might be signed by the puppet CA [00:21:36] no, but using the puppet CA, yes [00:21:41] here is an example: [00:21:45] add envoy: https://gerrit.wikimedia.org/r/c/operations/puppet/+/572381 [00:21:52] add cert https://gerrit.wikimedia.org/r/c/operations/puppet/+/576428 [00:22:03] it looks like this is for generating certs matching hosts other than the server's internal names [00:22:04] switch ATS backend url to https https://gerrit.wikimedia.org/r/c/operations/puppet/+/576434 [00:22:07] e.g. for load balanced stuff? [00:22:47] yes, the certs have multiple names as SANs [00:22:58] the actual hostnames of the backends and a discovery.wmnet record [00:23:02] given the lack of ability to run LVS and our scale we may as well just use ordinary single internal server names? [00:23:22] can still have envoy and certs signed by the puppet CA etc [00:23:26] it does not necessarily mean all services are behind it also have LVS config [00:23:42] misc things just have CNAMEs in DNS [00:23:55] and something.discovery.wmnet points to a host name directly [00:25:08] 5539 ; misc web services with multiple backends but without geoip [00:25:14] OK, absent'ing envoy instead gets me "Conftool::Scripts::Safe_service_restart[php7.2-fpm]: has no parameter named 'lvs_services'" [00:25:23] dns/templates/wmnet line 5539 ff [00:25:25] This is just going to be a mess all the way down. [00:26:11] James_F: has_lvs: false normally fixed/skipped the LVS stuff in cloud [00:26:21] It already had that. [00:26:40] maybe the bug is that the new envoy stuff does not respect has_lvs [00:27:06] But the new envoy stuff is absented. [00:27:21] eh, sorry, the Safe_service_restart stuff [00:27:31] Oh, yeah. [00:27:45] I mean, if LVSing Parsoid makes things easier we could just do that. [00:28:05] I'd rather not burn an instance, but it'd make it more prod-like, certainly. [00:34:59] lvs does not exist in Cloud VPS projects. [00:35:20] Right. [00:36:26] it maybe could someday, but it I'm not sure exactly what Neutron changes would be needed to support it [00:36:29] therefore all the puppet code would need to support the absence of LVS, which has_lvs: false used to do for some things [00:38:04] mutante: Hmm, the patch you wrote also has `profile::mediawiki::webserver::has_tls: true`; should that be false? [00:40:06] James_F: yes! [00:40:12] that looks pretty related [00:40:30] there is "if $has_lvs" stuff right around there too [00:41:16] Sadly it still fails at the Safe_service_restart / lvs_services step. [00:42:39] PROBLEM - Host deployment-dumps-puppetmaster02 is DOWN: CRITICAL - Host Unreachable (172.16.4.101) [00:44:24] James_F: profile::mediawiki::php::restarts::has_lvs: false [00:44:47] try that [00:44:47] bd808, something like https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron_LVS possibly :) [00:45:20] in profile::mediawiki::php::restarts it does the "conftool::scripts::safe_service_restart" [00:45:41] and the class also has a "has_lvs" parameter [00:45:45] There's stuff going on in there that is not exposed to us as mere mortals however, so we wouldn't be able to/want to use this as-is [00:54:07] That was also done when we were running Mitaka. We are on Pike now, Queens soon, and hopefully Rocky by the end of the month. [00:54:22] so things may be different already [00:55:36] possibly yeah [00:55:54] we may just need some more permissions in horizon to do stuff [01:06:03] 10Beta-Cluster-Infrastructure: Main pages of several Beta Cluster wikis redirect to other production wikis - https://phabricator.wikimedia.org/T247078 (10mmodell) So is it mediawiki or beta cluster infrastructure? It's apparently cache related? [01:18:35] mutante: Sadly no dice. [01:18:58] hrmm, ok [01:20:53] 10Beta-Cluster-Infrastructure: Main pages of several Beta Cluster wikis redirect to other production wikis - https://phabricator.wikimedia.org/T247078 (10DannyS712) >>! In T247078#5948285, @Urbanecm wrote: > I was live-debugging this with enwiki beta. Few notes: > > * As-of `includes/MediaWiki.php:123`, `$ret->... [01:20:57] we'll have to get back to it or make a ticket or pastebin or something with the current error ..i guess [01:21:46] i mean.. yea, part of the existing ticket about the "purer" instance. nod [01:27:33] Yeah. [01:28:12] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)): Puppet fails on Beta Cluster because "did not find a value for the name 'profile::envoy::ensure'" - https://phabricator.wikimedia.org/T247147 (10Jdforrester-WMF) 05Open→03Resolved a:03Jdforrester-WMF >>! In T247147#... [01:28:16] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Parsoid, 10Patch-For-Review: Replace deployment-mediawiki-parsoid10 with a "purer" deployment-parsoid11 box - https://phabricator.wikimedia.org/T246854 (10Jdforrester-WMF) [01:28:26] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)): Puppet fails on Beta Cluster because "did not find a value for the name 'profile::envoy::ensure'" - https://phabricator.wikimedia.org/T247147 (10Jdforrester-WMF) 05Resolved→03Open Actually, we'll need to set this on a... [01:28:30] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Parsoid, 10Patch-For-Review: Replace deployment-mediawiki-parsoid10 with a "purer" deployment-parsoid11 box - https://phabricator.wikimedia.org/T246854 (10Jdforrester-WMF) [01:29:38] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)): Puppet fails on Beta Cluster because Safe_service_restart is expecting lvs_services set though `has_lvs: false` - https://phabricator.wikimedia.org/T247151 (10Jdforrester-WMF) [01:29:44] Done. [01:33:57] Krenair: BTW, did you need any support/cheering on with your work on puppetmaster/cumin? [01:37:41] phab tokens? [01:38:39] James_F, for integration? puppetmaster is done, cumin is blocked on part of https://phabricator.wikimedia.org/T245114 [01:39:35] Ah, yeah, saw that and forgot. [01:39:40] Thanks for working on it! [01:40:01] until that gets sorted I'm just handling other random things on https://phabricator.wikimedia.org/T241719 - probably gonna do toolsbeta and openstack [01:40:35] There's a maps project? Eurgh. Of course there is. [01:43:21] James_F, it runs wma.wmflabs.org and tiles.wmflabs.org and stuff, not sure what relation it has to other cartographic work around wikimedia [01:44:12] Knowing us, it was an early test system that was discarded when we built the main thing but never torn down, and now it's a vital part of some community's workflow because of special reasons and we can't get rid of it. [01:44:19] Not that I'm bitter. ;-) [01:45:50] eh [01:46:17] it sounds like maybe it could be a vital part of a community's workflow [01:46:21] https://meta.wikimedia.org/wiki/WikiMiniAtlas [01:47:03] it doesn't look like a foundation thing, it came from toolserver [01:53:43] * James_F sighs. [01:53:50] I hate to be right. [02:20:08] 10Release-Engineering-Team (Local Dev), 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10MediaWiki-Docker, 10dev-images, 10User-brennen: ffmpeg in MW-Docker lacks -row-mt option for improved multithreaded VP9 encoding - https://phabricator.wikimedia.org/T247153 (10brion) [02:20:47] 10Release-Engineering-Team, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10MediaWiki-Docker, 10dev-images, and 2 others: Increase Docker image's PHP upload limit from 2 MiB to 100 MiB - https://phabricator.wikimedia.org/T246930 (10brion) Confirmed working on a fresh install, thanks all! [02:37:38] 10Beta-Cluster-Infrastructure, 10MediaWiki-Cache: Main pages of several Beta Cluster wikis redirect to other production wikis - https://phabricator.wikimedia.org/T247078 (10mmodell) [03:43:24] 10Release-Engineering-Team (Local Dev), 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10MediaWiki-Docker, 10dev-images, 10User-brennen: ffmpeg in MW-Docker lacks -row-mt option for improved multithreaded VP9 encoding - https://phabricator.wikimedia.org/T247153 (10brennen) > I believe in prod... [04:17:12] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)): Puppet fails on Beta Cluster because Safe_service_restart is expecting lvs_services set though `has_lvs: false` - https://phabricator.wikimedia.org/T247151 (10Krenair) reading the error more closely it seems to be that Sa... [04:41:28] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Patch-For-Review: Puppet fails on Beta Cluster because Safe_service_restart is expecting lvs_services set though `has_lvs: false` - https://phabricator.wikimedia.org/T247151 (10Krenair) cherry-picked the above, your i... [04:45:55] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Patch-For-Review: Puppet fails on Beta Cluster because Safe_service_restart is expecting lvs_services set though `has_lvs: false` - https://phabricator.wikimedia.org/T247151 (10Krenair) Remaining error in there is abo... [08:55:14] 10Release-Engineering-Team (Code Health), 10MediaWiki-extensions-Nuke, 10Code-Stewardship-Reviews: Nuke Extension: Code Stewardship Review - https://phabricator.wikimedia.org/T221155 (10DannyS712) I would be interested in helping out with this extension - do maintainers need to be WMF teams? Regardless, woul... [09:27:29] 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Wikidata, 10Wikidata-Campsite, 10Wikidata-Query-Service: Migrate wikidata-query-rdf-release-silent release job to Docker - https://phabricator.wikimedia.org/T247123 (10Lydia_Pintscher) [11:24:22] (03PS1) 10Reedy: Display IPUtils on https://doc.wikimedia.org/ [integration/docroot] - 10https://gerrit.wikimedia.org/r/577747 [11:26:22] (03PS2) 10Reedy: Display IPUtils on https://doc.wikimedia.org/ [integration/docroot] - 10https://gerrit.wikimedia.org/r/577747 [11:27:44] (03CR) 10Reedy: [C: 03+2] Display IPUtils on https://doc.wikimedia.org/ [integration/docroot] - 10https://gerrit.wikimedia.org/r/577747 (owner: 10Reedy) [11:28:19] (03Merged) 10jenkins-bot: Display IPUtils on https://doc.wikimedia.org/ [integration/docroot] - 10https://gerrit.wikimedia.org/r/577747 (owner: 10Reedy) [12:13:46] Does check coverage not work pre-submit [12:48:07] 10Beta-Cluster-Infrastructure, 10Wikidata: Special:UserLogin at beta.wmflabs.org redirects to wikidata.org - https://phabricator.wikimedia.org/T247158 (10zeljkofilipin) [12:48:21] 10Beta-Cluster-Infrastructure, 10Wikidata: Special:UserLogin at beta.wmflabs.org redirects to wikidata.org - https://phabricator.wikimedia.org/T247158 (10zeljkofilipin) p:05Triage→03Unbreak! [12:49:10] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Wikidata: Special:UserLogin at beta.wmflabs.org redirects to wikidata.org - https://phabricator.wikimedia.org/T247158 (10zeljkofilipin) [12:49:25] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Wikidata: Special:UserLogin at beta.wmflabs.org redirects to wikidata.org - https://phabricator.wikimedia.org/T247158 (10Reedy) [12:49:28] 10Beta-Cluster-Infrastructure, 10MediaWiki-Cache: Main pages of several Beta Cluster wikis redirect to other production wikis - https://phabricator.wikimedia.org/T247078 (10Reedy) [12:50:08] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Wikidata: Special:UserLogin at beta.wmflabs.org redirects to wikidata.org - https://phabricator.wikimedia.org/T247158 (10zeljkofilipin) Thanks @Reedy! Didn't know about the other task. :) [12:50:18] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Wikidata: Special:UserLogin at beta.wmflabs.org redirects to wikidata.org - https://phabricator.wikimedia.org/T247158 (10Ammarpad) This is duplicate of T247078, I believe [12:51:42] 10Beta-Cluster-Infrastructure, 10MediaWiki-Cache, 10User-zeljkofilipin: Main pages of several Beta Cluster wikis redirect to other production wikis - https://phabricator.wikimedia.org/T247078 (10zeljkofilipin) [12:52:24] 10Beta-Cluster-Infrastructure, 10MediaWiki-Cache, 10User-zeljkofilipin: Main pages of several Beta Cluster wikis redirect to other production wikis - https://phabricator.wikimedia.org/T247078 (10zeljkofilipin) @Cparle this is causing https://integration.wikimedia.org/ci/view/Selenium/job/selenium-daily-betac... [12:55:27] 10Beta-Cluster-Infrastructure, 10MediaWiki-Cache, 10User-zeljkofilipin: Main pages of several Beta Cluster wikis redirect to other production wikis - https://phabricator.wikimedia.org/T247078 (10zeljkofilipin) @dom_walden you've noticed this yesterday, right? [13:50:57] (03PS1) 10Reedy: Whitelist RhinosF1 [integration/config] - 10https://gerrit.wikimedia.org/r/577776 [13:51:31] (03CR) 10Reedy: [C: 03+2] Whitelist RhinosF1 [integration/config] - 10https://gerrit.wikimedia.org/r/577776 (owner: 10Reedy) [13:52:22] (03Merged) 10jenkins-bot: Whitelist RhinosF1 [integration/config] - 10https://gerrit.wikimedia.org/r/577776 (owner: 10Reedy) [13:52:50] !log Reloading Zuul to deploy https://gerrit.wikimedia.org/r/577776 [13:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:41:08] 10Continuous-Integration-Config, 10Pywikibot, 10Pywikibot-tests: Jenkins output for pywikibot job is hard to read - https://phabricator.wikimedia.org/T117570 (10Xqt) >>! In T117570#5947525, @Dvorapa wrote: > But it is hard to find in the log. The example in description is also easy to read, but it takes a wh... [16:38:22] 10Continuous-Integration-Config, 10Pywikibot, 10Pywikibot-tests: Jenkins output for pywikibot job is hard to read - https://phabricator.wikimedia.org/T117570 (10Dzahn) Seems to me what is hard to read and what isn't is a subjective opinion. These kinds of tickets are hard to ever resolve, especially without... [17:24:18] 10Continuous-Integration-Config, 10Pywikibot, 10Pywikibot-tests: Jenkins output for pywikibot job is hard to read - https://phabricator.wikimedia.org/T117570 (10Dvorapa) The goal of this task is clear. One does not want to look 5 minutes into the log to find what's wrong. The output should be either at the e... [18:08:37] 10Continuous-Integration-Config, 10Pywikibot, 10Pywikibot-tests: Jenkins output for pywikibot job is hard to read - https://phabricator.wikimedia.org/T117570 (10Xqt) We could implement a maintenance bot which copies the related messages to gerrit. But a lot of work. Seems easier to give a hint to the develop... [19:07:24] 10Beta-Cluster-Infrastructure: deployment-mediawiki-07 socket timeout 2020-03-01/02 - https://phabricator.wikimedia.org/T246577 (10AlexisJazz) Request from - via deployment-cache-text05.deployment-prep.eqiad.wmflabs, ATS/8.0.5 Error: 502, Next Hop Connection Failed at 2020-03-07 19:05:57 GMT [21:05:40] Krinkle: can you see if the error mentioned above by AlexisJazz is related to the task as shinken-wm hasn’t shown any error like it normally does? The db is also locking with no explanation according to them [21:06:36] Right below they mention a replag issue sometimes as well [23:13:37] PROBLEM - Host integration-puppetmaster01 is DOWN: CRITICAL - Host Unreachable (172.16.3.17)