[00:28:38] PROBLEM - Free space - all mounts on deployment-fluorine02 is CRITICAL: CRITICAL: deployment-prep.deployment-fluorine02.diskspace._srv.byte_percentfree (<33.33%) [07:03:35] RECOVERY - Free space - all mounts on deployment-fluorine02 is OK: OK: All targets OK [08:33:15] PROBLEM - Free space - all mounts on deployment-mwmaint01 is CRITICAL: CRITICAL: deployment-prep.deployment-mwmaint01.diskspace.root.byte_percentfree (<11.11%) [08:38:16] RECOVERY - Free space - all mounts on deployment-mwmaint01 is OK: OK: All targets OK [12:34:33] 10Project-Admins: Please create project 'conftool' - https://phabricator.wikimedia.org/T229658 (10CDanis) [12:34:38] 10Project-Admins: Please create project 'conftool' - https://phabricator.wikimedia.org/T229658 (10CDanis) [12:52:42] 10Project-Admins: Please create project 'conftool' - https://phabricator.wikimedia.org/T229658 (10CDanis) [13:04:14] 10Project-Admins: Please create project 'conftool' - https://phabricator.wikimedia.org/T229658 (10Aklapper) 05Open→03Resolved a:03Aklapper I changed the description a bit so you hopefully don't get mis-filed #mediawiki-configuration issues or MediaWiki support requests. Feel free to edit / adjust. Request... [13:59:25] (03CR) 1020after4: [C: 03+1] php7x: restart php-fpm after all sync operations [tools/scap] - 10https://gerrit.wikimedia.org/r/525119 (https://phabricator.wikimedia.org/T224857) (owner: 10Thcipriani) [14:03:00] (03CR) 1020after4: "> Patch Set 1:" [integration/config] - 10https://gerrit.wikimedia.org/r/526524 (https://phabricator.wikimedia.org/T229370) (owner: 10markahershberger) [14:03:46] (03CR) 1020after4: "Can we just exempt the files in question without disabling the test completely?" [integration/config] - 10https://gerrit.wikimedia.org/r/526524 (https://phabricator.wikimedia.org/T229370) (owner: 10markahershberger) [14:38:05] 10Project-Admins, 10Core Platform Team, 10Performance-Team: Narrow scope of MediaWiki-Database workboard - https://phabricator.wikimedia.org/T228360 (10Krinkle) 05Open→03Resolved [14:40:38] 10Phabricator (Upstream), 10Developer-Wishlist (2017), 10Upstream: Cannot disable "Notify" for token award in phabricator - https://phabricator.wikimedia.org/T91289 (10Aklapper) [14:46:22] looks like CI is very backlogged this morning [14:46:28] again 🙃 [14:47:54] 10Phabricator: Duplicated Homepage and Welcome in default sidebar - https://phabricator.wikimedia.org/T229161 (10Aklapper) 05Open→03Resolved a:03Aklapper Thanks for catching this! Not sure how this happened. I removed the "Welcome" item via https://phabricator.wikimedia.org/home/menu/configure/global/ [14:47:58] 10Phabricator: Improve start page window title - https://phabricator.wikimedia.org/T229225 (10Aklapper) 05Open→03Resolved a:03Aklapper I renamed the default Dashboard title from `Homepage` to `Wikimedia Phabricator`. [14:52:02] cdanis: Not only backlogged, it's also barely doing anything [14:52:13] Looks like something is broken in Zuul and/or Gearman again [14:52:23] It's processing maybe 5 changes and a dozen jobs. [14:52:27] We have far more capacity than that [14:53:53] (03PS1) 10Lucas Werkmeister (WMDE): Remove l10n-update from manpage [tools/scap] - 10https://gerrit.wikimedia.org/r/527573 [14:53:53] OK. on the Jenkins side I do see it doing much more. 9 docker workers processing 4 jobs each. [14:54:15] maybe it's that the jobs got slower and take longer to process, again. [14:54:25] Yeah.. yikes, that's not good. [14:54:41] It's back from from ~ 9-15 to 20-25 min :( [14:56:47] Project beta-scap-eqiad build #260627: 04FAILURE in 2 min 23 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/260627/ [15:06:28] Yippee, build fixed! [15:06:29] Project beta-scap-eqiad build #260628: 09FIXED in 2 min 5 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/260628/ [15:08:43] 10Phabricator: Some users can't add the fr-tech private task form to their task selection menu. - https://phabricator.wikimedia.org/T229628 (10Aklapper) 05Open→03Stalled When going to https://phabricator.wikimedia.org/favorites/menu/new/custom/editengine/ and opening the "network" tab of your web browser's d... [15:10:55] 10Continuous-Integration-Config, 10MediaWiki-Release-Tools, 10Patch-For-Review: Disable hhvm/php5.x (composer-hhvm-docker) tests for release-tools - https://phabricator.wikimedia.org/T229370 (10CCicalese_WMF) [15:17:36] 10Release-Engineering-Team-TODO, 10MediaWiki-Release-Tools, 10MediaWiki-Releasing (Workflow Improvements): merge branch.py and make-wmf-branch - https://phabricator.wikimedia.org/T222829 (10CCicalese_WMF) a:05MarkAHershberger→03None [15:30:58] 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO, 10DBA, and 2 others: Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Dzahn) a:03Dzahn I'll add it. Thanks Manuel! [15:40:45] (03CR) 1020after4: [C: 03+1] "I'm ok with merging this if we can figure out a way around the build failure" [tools/release] - 10https://gerrit.wikimedia.org/r/521559 (https://phabricator.wikimedia.org/T217960) (owner: 10markahershberger) [16:01:21] Krinkle: I had long believed that the issue was a provisioning one, but if we aren't using capacity that we do have... [16:22:35] cdanis: we are, see the docker* slaves (sic): https://integration.wikimedia.org/ci/ [16:23:12] the jessie* hosts are special one-off ones for running special snowflake tests that haven't migrated to the docker system yet [16:28:57] greg-g: i'm probably missing something but i'm not sure how that caused so much queuing this morning? [16:29:19] lemme look at the graphs, my guess would be too many patches. [16:29:32] see also what Krinkle said above [16:29:39] things like this cause a lot of executors to be used, FWIW https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/527592/ [16:29:40] (or was that a reply to that?) [16:29:58] 7 patches each using 9 executors [16:30:23] e.g. https://gerrit.wikimedia.org/r/c/operations/software/conftool/+/527564 was uploaded at 14:22, jenkins reported a successful build with duration 2m36s at 14:59 [16:30:36] and that's just a tox-docker flavor of job [16:31:20] was that queued for a long time? [16:31:32] things are starting to take longer for some reason, but that could be due to any number of reasons including new test runners in place (eg: sonarqube, phan, etc): https://grafana.wikimedia.org/d/000000321/zuul?panelId=14&fullscreen&orgId=1 [16:31:34] ah, yes, I see [16:31:41] 34 minutes or so [16:31:48] so there's a few factors [16:32:42] the test queue has a priority of "normal", so it has a lower priority than gate-and-submit jobs for instance [16:33:04] there could also be a lot of test jobs submitted at once [16:33:48] as is the current case with wikibase, 7 patches, with 9 tests each: 63 executors needed where we have 10 boxen with 4 executors on each, so stuff gets queued. [16:35:04] 10Diffusion, 10Phabricator (Upstream), 10Upstream: Investigate or work on how to make Diffusion repositories deletable via web interface - https://phabricator.wikimedia.org/T180666 (10epriestley) The misleading messaging has been fixed upstream by . [16:35:19] also, it seems like we've added some tests recently for php73 -- I think I counted something like 14 jobs per core patchset, which seems excessive when we have 40 workers at any given moment. [16:35:25] greg-g: is 'gate time' basically 'time spent in the queue before execution began'? [16:35:51] (that's my guess based on also seeing 'gate+pipeline time') [16:37:07] "launch wait" is what we refer to that time as, where do you see "gate+pipeline"? [16:37:49] https://grafana.wikimedia.org/d/000000321/zuul?panelId=7&fullscreen&orgId=1 [16:39:21] not sure why it's called that, but it's just measuring the gate+submit (aka merge) pipeline time (aka: when you hit +2) [16:39:23] https://grafana.wikimedia.org/d/000000321/zuul?panelId=7&fullscreen&edit&orgId=1 [16:39:28] ah [16:39:44] vs the test pipeline, which is run on new patchsets [16:39:56] which has a lower prio (just to be explict/repeat what tyler said) [16:40:32] sure, sure [16:40:35] is it wrong to think of increasing launch wait as likely indicating underprovisioning of workers? [16:40:45] no :) [16:41:08] (unless there's some weird reason why, but in the general case that's right) [16:56:21] 10Phabricator (Upstream), 10Upstream: Multiple grep results in one line displayed incorrectly - https://phabricator.wikimedia.org/T197935 (10epriestley) This should be resolved upstream by . [17:18:45] (03PS1) 10Krinkle: zuul: Remove php70 jobs from operations/mediawik-config [integration/config] - 10https://gerrit.wikimedia.org/r/527607 [17:20:07] cdanis: no more php70 in prod relating to MW anywhere, right? [17:20:19] (just hhvm and php72) [17:20:29] I'm not 100% sure, but I'm 99% sure [17:20:42] cdanis: who would know 100%? [17:20:56] and please don't say me :P [17:21:05] someone on serviceops -- _joe_, jijiki, mutante, etc [17:21:24] k, will double check just in case. thanks :) [17:22:47] <_joe_> yes, we don't use php7.0 for mediawiki in production [17:23:09] <_joe_> unless [17:23:11] <_joe_> dumps [17:23:52] <_joe_> yeah, dumps might still be on php7.0, ping apergos :) [17:28:14] (03CR) 10Jforrester: [C: 03+2] zuul: Remove php70 jobs from operations/mediawik-config [integration/config] - 10https://gerrit.wikimedia.org/r/527607 (owner: 10Krinkle) [17:29:23] WikibaseView appears in ExtensionMessages-1.34.0-wmf.16.php now, so far so good… [17:29:24] Krinkle: We're running php70 jobs for master on prod code (because TechCom haven't let us proceed) but shouldn't be running it anywhere for branches purely for prod at this point. [17:29:49] (03Merged) 10jenkins-bot: zuul: Remove php70 jobs from operations/mediawik-config [integration/config] - 10https://gerrit.wikimedia.org/r/527607 (owner: 10Krinkle) [17:30:19] !log Zuul: [operations/mediawiki-config] Stop running php70 jobs" [17:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [17:30:29] _joe_: that's worrying. Good to know. Will follow up with Ariel. [17:30:42] dumps where the first to go on PHP 7, before we finished up php72 packaging. [17:30:51] might still be on that indeed. [17:31:03] Clearly we should move dumps to php73 to continue the complexity. [17:31:13] James_F: Has the RFC to require PHP 7.2 started yet? I missed last TC meeting, but haven't seen it yet. [17:31:30] I know of the CI-specific task, and of the general task about PHP policy. [17:31:30] Krinkle: I don't know. Radio silence from TechCom. [17:31:45] I'm asking whether an RFC task exists that has been submitted to TechCom-RFC. [17:31:59] The general task is the task. Everything is waiting on TechCom moving that task forward. [17:32:25] https://phabricator.wikimedia.org/T228342 is listed as "Under discussion". [17:33:09] It doesn't have a shepherd assigned (is that still a thing?). [17:33:24] Right, that task was filed 2 weeks ago. We've had 1 meeting since, in which we did the fastest thing we can, which is to not require more information and go straight to under discussion. [17:33:40] We've not had a meeting since, but next meeting it could go on last call. That would be an exceptional and fastest RFC ever. And might happen actually. [17:34:00] it woudl then be approved when the last call finishes 2 weeks after that [17:34:03] It was transitioned after three weeks of inaction from a previous task. [17:34:29] .. over which TechCom unfortunately has little authority (CI resourcing and MW product test coverage) [17:34:42] but 2 members did comment positively iirc. [17:35:10] Sorry, no, previous task was the RfC from 10 July to 17 July, it just felt like three weeks. :-) [17:35:39] Anyway. Would love to be able to speed up CI. [17:35:59] https://phabricator.wikimedia.org/T225628 is not an RFC btw. [17:36:19] I should fix that title to make clear what Antoine is proposing. [17:36:23] if it was, we'd have to decline it. Anyway, policy details. [17:36:25] Because people are mis-understsanding it. [17:36:46] 10Continuous-Integration-Config, 10TechCom, 10Patch-For-Review: On CI, stop testing MediaWiki with php7.0 ahead of dropping support - https://phabricator.wikimedia.org/T225628 (10Jdforrester-WMF) [17:37:27] Current trajectory if it works out in my favour, is to require PHP 7.2 for MW 1.34+, and for Lego to show that keeping php70 packages for older branches for a little while longer won't take much effort, is within Debian LTS, and only affects rare commits (so no load on CI overall). [17:37:47] * James_F nods. [17:38:32] PHP 7.2 goes EOL at the end of 2020; are we expecting to migrate to 7.3, to 7.4, or straight to 8? [17:39:25] If that all happens, I'd be pretty satisfied. I"m neutral on whether we test older REL branches on PHP 7.0 or not (until their MW-EOL). But if CPT likes it to be tested and Lego says it can be done with little effort and securely (Debian LTS), that'd be cool too. [17:39:38] James_F: not sure yet. Probably not 7.3, but maybe 7.4 (for WMF). [17:39:43] Right. [17:40:00] I don't have the time to think about 8 yet. [17:40:01] Keeping on running PHP 7.0 anywhere in CI is about as much effort for RelEng as running it on every patch. [17:40:24] And as the person doing the effort, I think I should get a vote. ;-) [17:40:24] I agree. [17:40:57] But I also agree that stopping testing in CI is very likely effective dropping of support, in practice. [17:41:28] But I also think it should be allowed for you to choose not to upgrade those images at the same time and let someone else do it if they wish. There's also non-zero room for improvement with regards to coupling of images and packages so that we can e.g. have the php70 images just be frozen as-is (not getting quibble updates). [17:41:35] Yet another reason I think the people who think that LTS releases are a good idea should be the ones to do all the work, rather than dumping it implicitly on the rest of us. [17:41:59] with the exception of the occasional ci-stretch base image update. [17:42:12] Agreed :) [17:42:13] Can we? Won't they break when e.g. npm changes their certificates and we can't back-port because there's too much change? [17:42:38] (Or whatever ugly hack we'll need to fix next.) [17:43:10] Hm.. I think that changes for npm in 3.x or 4.x. They have a full chain now (like composer) which means unless the OS root certs expire, we should be fine. [17:43:13] changed* [17:44:24] In other news though, my mom couldn't use the government website anymore in either Chrome or Safari because the macOS certs where to out of date. Had to install Firefox to get around the issue (which bundles its own certs yay) [17:45:08] (also government was using terrible certs, but that's pretty much implicit, except for UK gov) [17:46:31] _joe_: snapshot hosts are php7.2 and php=hhvm 3.18 [17:47:05] (checked two random hosts) [17:47:43] Cool. [18:15:06] Krinkle: James_F: And in general, I would appreciate is we stop just assuming that Release Engineering can continue to support all changes people-who-can-make-them make without prior agreement. For example, adding new jobs to the mix. How does that effect our merge times? Who's responsible and gets yelled at (RelEng)? We need to be MUCH more deliberate and, dare I say organized with these changes in [18:15:08] CI. [18:15:47] greg-g: +∞ [18:16:48] Dropping PHP 7.0 and 7.1 will reduce demand substantially (roughly 40% drop in jobs), but that's addressing a symptom, not the general problem. [18:25:31] greg-g: Yep! Same feeling here. I would extend it beyond jobs and images, and also to the content of (integration) tests. [18:25:38] In context of https://phabricator.wikimedia.org/T225730 btw, seems that we've regressed recently. [18:25:51] exactly. [18:25:52] looks like the new way of executing wdio tests may've added 5 min to the run time [18:26:04] not entirely sure yet, may've also been a specific test doing it. [18:27:40] there's still a very big mystery waiting to be uncovered about why wdio is so amazingly slow and inefficient. Hopefully just some config setting we haven't found yet, but right now for every step of every test we're re-doing 3 api requests whcih in turn are both taking multiple seconds to return (e.g. the login as admin request). That makes 0 sense to be because at the same time we have Fresnel jobs doing dozens and dozens of index.php and [18:27:40] api.php responses in < 100 ms [18:28:05] so the CI infra itself is definitely not slow (in fact, its an order of magnitude faster than my docker-dev locally) [18:44:57] hey [18:45:44] hi [18:45:45] thcipriani: we can finally activate gerrit on 2001 [18:45:57] dba made the codfw dbproxy for misc [18:46:22] and we just opened the firewall hole on them for gerrit2001 [18:46:57] yay! [18:47:05] double checking replication.config and on cobalt it says to replicate to 2001 and on gerrit2001 it does not exist [18:47:21] though that should be separate anyways [18:47:35] because it doesn't require the gerrit service to run on the target afaict [18:48:34] so what we get is a running slave service one can clone from a restore a master from, hopefully [18:49:28] paladox: wait.. so to replicate _to_ there needs to be no gerrit service running.. but to clone from it it does have to run? is it really like that [18:49:37] It uses ssh [18:49:40] to replicate [18:49:54] if you cat for gerrit2001 in cobalt replication file, it should have ssh:// [18:50:11] *grep [18:50:32] well. does not really have protocol [18:50:35] url = gerrit2@gerrit2001.wikimedia.org:/srv/gerrit/git/${name}.git [18:50:39] but looks like it, ack [18:50:52] yeh [18:57:46] nice re:starting :) [18:58:04] I see gerrit push to 2001 all the time... [18:58:13] mutante ^^ [18:59:28] scap looks up-to-date there afaict via spot-check [18:59:39] git -C /srv/gerrit/git/mediawiki/tools/scap.git/ log -p [18:59:54] ok, cool [19:00:08] so what do we get out of being able to start the service [19:00:32] if a master goes down we can now promote the slave to a new master ? [19:00:39] that's what we want at least [19:00:42] yeh [19:01:03] ack, just confirming all this :) [19:01:04] also it'll allow us to offload traffic to the slaves (e.g phabricator) [19:01:07] also you could point folks to it for cloning if they don't require the very latest [19:01:14] like phabricator [19:01:43] yea:) [19:02:00] FWIW, it looks like the replication setup uses ssh git-receive-pack to upload to this machine [19:04:26] ok.. so this is done https://gerrit.wikimedia.org/r/c/operations/puppet/+/527595 [19:04:41] now we could [19:05:05] this is also done https://gerrit.wikimedia.org/r/c/operations/dns/+/527114 [19:05:22] thcipriani: any concerns i just try to activate the service now [19:06:28] * thcipriani double-checks some things [19:06:36] oh.. i also need to check this https://gerrit.wikimedia.org/r/c/operations/dns/+/527462 [19:06:42] another one from Manuel [19:06:49] thcipriani: thanks :) [19:08:25] merging the DNS change.. that's not touching anything existing [19:08:33] just adds the new name for m1 [19:10:35] container.slave = true is set in gerrit.config and --slave is added in the systemctl unit [19:10:42] I think that means it should be good. [19:11:06] yup [19:12:16] yea, that sounds right [19:13:24] "Periodic indexing is intended to run only on slaves" [19:13:36] slaves need an updated group index to resolve memberships of users for ACL validation [19:13:52] "Gerrit slave periodically scans the group refs in the All-Users repository to reindex groups if they are stale." [19:14:56] that's a 2.16 feature [19:15:05] you can ignore that for 2.15 (https://gerrit.wikimedia.org/r/Documentation/config-gerrit.html#index.scheduledIndexer dosen't exist) [19:15:09] there's a section on https://gerrit-review.googlesource.com/Documentation/config-gerrit.html starting at "index.scheduledIndexer" [19:15:30] paladox: ok :) [19:17:40] that would make sense since in 2.15 groups are still in reviewdb [19:17:53] so scanning all-users won't tell you that info yet [19:17:54] ^ [19:18:03] ack, right [19:19:06] yea, so the whole replication part is already running and we dont change it with this [19:19:12] i also see git-receive-pack running, yep [19:19:49] as long as we run with --slave i guess that's it [19:19:54] yup [19:22:22] seems as though [19:22:34] <%- if @slave -%> --slave<% end %> [19:22:47] let's go? [19:23:22] +1 [19:23:23] +1 [19:23:44] ok, re-enabling puppet [19:25:04] Notice: /Stage[main]/Gerrit::Jetty/Systemd::Service[gerrit]/Service[gerrit]/ensure: ensure changed 'stopped' to 'running' [19:25:18] \o/ [19:25:19] :) that's been a while [19:25:30] * paladox waits to see if https://gerrit-slave.wikimedia.org/r/ turns white [19:25:32] happy to remove the exceptions for icinga monitoring [19:26:06] hrm lots of cannot load plugin [19:26:23] I guess that's expected? don't want most plugins on the slave [19:26:34] /var/lib/gerrit2/review_site --slave [19:26:40] just to confirm :p [19:26:55] does gerrit show as started in the logs? [19:26:59] (gerrit.log) [19:26:59] well the review_site is an argument to -d [19:27:15] paladox: yeah, also shows that the project cache is loaded [19:27:16] i think it should load the plugins still in slave mode [19:27:18] yes, i only pasted a part of it [19:27:20] \o/ [19:27:30] hmm https://gerrit-slave.wikimedia.org/r/ hasen't turned white [19:27:32] paladox: Active: active (running) since Fri 2019-08-02 19:24:21 UTC; 3min 5s ago [19:27:39] although I don't see port 8080 [19:27:41] yeh, that can fool you apparently [19:28:06] is slave mode headless by default? [19:28:56] hrm, no [19:28:58] Well, it starts jetty (but won't host content over it, only git repos) [19:29:13] --enable-httpd [19:29:13] --disable-httpd [19:29:13] Enable (or disable) the internal HTTP daemon, answering web requests. Enabled by default when --slave is not used. [19:29:16] ^ [19:29:21] ohh [19:29:29] 10Phabricator (Upstream), 10Upstream: Footer is not visible in workboards - https://phabricator.wikimedia.org/T85440 (10epriestley) This is moderately technically complicated, the use case isn't clear to me (why is it important to access the footer on these particular pages?), and making this change on full-sc... [19:29:41] that's not very good phrasing [19:29:46] yeah [19:29:50] but it sounds to me like IF slave is used then it's off [19:29:54] unless you enable it [19:29:55] I said "no" because I also don't see the sshd port open [19:29:55] * paladox adds that [19:30:37] disable is enabled if slave is NOT used :p [19:31:01] oh boy [19:31:10] thcipriani mutante https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/527621/ [19:31:11] so that explains why no http [19:31:14] but why no sshd? [19:31:22] --enable-sshd [19:31:23] --disable-sshd [19:31:29] " Enabled by default." [19:31:47] that also leaves room for speculation that "--slave" means not default [19:31:51] also says --slave implies --enable-sshd [19:32:05] > This option automatically implies '--enable-sshd'. [19:32:53] is there some possibility it's still initialising...? [19:33:25] Nope, it would have initialised before gerrit wrote something like "gerrit started" [19:33:41] oh oh [19:33:45] there is an excepton [19:33:50] in sshd_log [19:34:02] oh? [19:34:23] sshd_log looks empty? [19:34:26] no, i'm sorry, that's plugin_log [19:34:33] i did tail -f * [19:35:39] mutante: /var/lib/gerrit2/tmp is owned by root [19:35:48] and the error_log about mysql is NOT new..July 3rd [19:36:01] er /var/lib/gerrit2/review_site/tmp [19:36:07] seems to be what the plugin log is saying [19:36:24] oh.. so is cache, data, index [19:36:25] > java.nio.file.AccessDeniedException: /var/lib/gerrit2/review_site/tmp/plugin_zuul_190802_1924_156970585319091907.jar [19:37:09] chown gerrit2:gerrit2 tmp [19:37:53] chown gerrit2:gerrit2 cache [19:37:59] (both empty) [19:38:09] probably also index [19:38:14] chown gerrit2:gerrit2 data [19:38:27] index actually has a file [19:38:51] chown -R gerrit2:gerrit2 index [19:39:04] from 2017-09-26 :) [19:39:15] thcipriani aha! [19:39:32] chown gerrit2:gerrit2 plugins (that's a symlink into /srv/deployment !) [19:39:52] mutante it's now gerrit.log, no longer error_log :) [19:40:15] plugin files in /srv/deployment/gerrit/gerrit/plugins/ are gerrit2:gerrit2 [19:40:44] your change to --enable-httpd also looks good. only on slave [19:41:56] yup [19:42:21] merged.. running puppet.. then restarting gerrit [19:42:23] [2019-08-02 19:36:23,944] [main] INFO com.google.gerrit.sshd.SshDaemon : Started Gerrit SSHD-CORE-2.0.0 on gerrit.git.wmflabs.org:29418 [19:42:30] ssh works for me in slave mode (2.16) [19:43:00] noop on cobalt. applied on gerrit2001 [19:43:08] :) [19:43:25] paladox: haha, yes, you are right about gerrit.log, i merged it and still look there :P [19:43:32] :D [19:44:15] restarting gerrit on 2001 [19:44:16] ..review_site --slave --enable-httpd [19:44:23] https://gerrit-slave.wikimedia.org/r/ [19:44:28] 404 but that means it works [19:44:28] lol [19:44:47] yay! [19:44:52] now the test is: [19:44:58] hrm, still no ssh [19:45:00] git clone https://gerrit-slave.wikimedia.org/r/operations/puppet.git [19:45:16] why does ssh work for paladox? [19:45:18] works \o/ [19:45:27] anything in the logs? [19:45:28] well that's http [19:45:29] grep ssh [19:45:36] > ss -tlnp | grep -c 29418 => 0 [19:45:45] on gerrit 2001 [19:45:49] git clone https://gerrit-slave.wikimedia.org/r/operations/puppet.git [19:45:50] Cloning into 'puppet'... [19:46:12] thcipriani gerrit would have logged to gerrit.log (if ssh is not working) [19:47:47] nothing about ssh in there [19:47:49] i see it loading plugins including javamelody, fwiw [19:47:59] there is: Unable to determine any canonical git URL from gerrit.config [19:48:01] just fails to provision gitiles [19:48:09] https://gerrit-slave.wikimedia.org/r/monitoring [19:48:22] Forbidden access [19:48:47] i went to /login [19:49:00] then it redirected me to gerrit.w.org. Then i could view https://gerrit-slave.wikimedia.org/r/monitoring [19:49:13] anyways "Unable to determine any canonical git URL from gerrit.config" i wonder why that's throwing. [19:49:24] paladox: the config you have on gerrit.git.wmflabs.org is that a slave? [19:49:25] something in gitiles it seems [19:49:29] is throwing that [19:49:36] yup (i just set it as a slave to test) [19:49:50] paladox: and after doing that and restart.. sshd still starts? [19:49:53] yup [19:50:13] hmm.. let's do a diff of the config files or something [19:51:01] ohhhhhh [19:51:04] thcipriani mutante found it [19:51:08] sshd is switched off [19:51:10] the part that sshd_log is empty but gerrit.log also does not have anything.. [19:51:10] for slave [19:51:16] see https://github.com/wikimedia/puppet/blob/production/modules/gerrit/templates/gerrit.config.erb#L221 [19:51:17] is that logging at all? [19:51:19] that should be removed [19:51:39] heh :) [19:51:40] paladox: nice find :) [19:51:50] * paladox enables it [19:54:00] thcipriani mutante https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/527631/ [19:54:37] my guess is that also fixes our gitiles problem: https://github.com/GerritCodeReview/plugins_gitiles/blob/master/src/main/java/com/googlesource/gerrit/plugins/gitiles/Module.java#L94 [19:55:01] 10Continuous-Integration-Config, 10MediaWiki-extensions-CentralAuth, 10ci-test-error: CentralAuth tests are broken - https://phabricator.wikimedia.org/T229613 (10Jdforrester-WMF) [19:55:20] thcipriani yeh [19:55:57] $host and $ipv6 are hopefully correct.. let's check [19:56:14] but i kind of remember adding tit :p [19:56:34] :) [19:56:42] yeh, you added the ipv6 config [19:57:11] role/codfw/gerrit.yaml [19:57:24] that part looks good [19:57:48] :) [19:58:40] https://gerrit-slave.wikimedia.org/r/monitoring?part=graph&graph=httpMeanTimes i wonder who is trying to hit gerrit-slave already? [19:59:24] well we did to clone some stuff :) [20:00:07] heh [20:00:13] paladox: nope,.. fail :p [20:00:19] but over 68k [20:00:24] had a gut feeling about $host [20:00:30] oh? [20:00:43] the IPv6 part is ok [20:00:46] the IPv4 is not [20:00:51] + listenAddress = gerrit.wikimedia.org:29418 [20:00:52] ah, wrong ip? [20:00:55] ohhh [20:01:34] paladox: googlebot? [20:01:53] im not sure [20:01:55] i can do this fix [20:02:25] well.. actually.. [20:03:45] it needs to use $ipv4 not $host [20:04:10] wouldn't that break cloning over the host name? [20:04:19] since it'll be listening for an ip not the hostname [20:05:13] oh [20:05:23] damn i misread the wrong column on javamelody [20:05:29] so that 68k figure was for mean time [20:05:37] yea, so we need "$host-slave". looking [20:05:39] https://gerrit-slave.wikimedia.org/r/monitoring?part=graph&graph=httpHitsRate [20:05:43] shows over 5k hits [20:06:33] gerrit::server::host: [20:06:36] mutante ^ [20:07:02] yea, i see that. and right below we have a list of slave names [20:07:12] so we can either use the first one of that [20:07:29] or add a separate "host-slave" to be used for sshd [20:07:40] what if you have a second slave [20:09:23] it would be per host? (so you would set the hiera key per host, rather then per role/location) [20:09:27] hrm, that httpHits thing seems weird. I see 391 hits in the apache logs... [20:10:30] or we need to split the domain from it, so we can set the host in hiera to "gerrit" and in puppet we do "if $slave then $realhost = ${host}-slave [20:10:57] paladox: needs more changes since now it's an array of slaves [20:11:09] oh, ok [20:12:05] oh, look [20:12:05] $tls_host = $slave_hosts[0] [20:12:12] that's how we do it for httpd [20:12:19] yeh [20:12:22] first of the slaves.. so yea.. use $tls_host [20:12:33] so that'll need an eventual upgrade to support > 1 slaves [20:12:38] $tls_host is also our $sshd_host [20:12:39] but for now yeh [20:12:43] just $host is bad [20:12:47] yeh [20:13:07] (just realized the httpHitsRate graph shows over 5 hits/minute :P) [20:13:17] oh [20:13:35] * paladox facepalms [20:14:21] heh, I did the same doubletake [20:14:47] used to a different order of magnitude [20:15:39]