[00:58:33] (03PS1) 10DannyS712: Start branching GlobalWatchlist extension [tools/release] - 10https://gerrit.wikimedia.org/r/621242 (https://phabricator.wikimedia.org/T260862) [00:58:57] (03PS2) 10DannyS712: Start branching GlobalWatchlist extension [tools/release] - 10https://gerrit.wikimedia.org/r/621242 (https://phabricator.wikimedia.org/T260862) [00:59:02] (03CR) 10DannyS712: "This change is ready for review." [tools/release] - 10https://gerrit.wikimedia.org/r/621242 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712) [06:40:31] maintenance-disconnect-full-disks build 204679 integration-agent-docker-1003 (/: 18%, /srv: 100%, /var/lib/docker: 49%): OFFLINE due to disk space [06:45:28] maintenance-disconnect-full-disks build 204680 integration-agent-docker-1003 (/: 18%, /srv: 88%, /var/lib/docker: 45%): RECOVERY disk space OK [06:50:28] maintenance-disconnect-full-disks build 204681 integration-agent-docker-1003 (/: 18%, /srv: 98%, /var/lib/docker: 50%): OFFLINE due to disk space [06:55:31] maintenance-disconnect-full-disks build 204682 integration-agent-docker-1003 (/: 18%, /srv: 83%, /var/lib/docker: 45%): RECOVERY disk space OK [07:23:50] (03PS2) 10Hashar: Upgrade gear from 0.7.0 to 1.15.1+wmf1 [integration/zuul/deploy] - 10https://gerrit.wikimedia.org/r/617404 (https://phabricator.wikimedia.org/T258630) [07:26:27] (03CR) 10Hashar: [V: 03+2 C: 03+2] "Rolling rolling!" [integration/zuul/deploy] - 10https://gerrit.wikimedia.org/r/617404 (https://phabricator.wikimedia.org/T258630) (owner: 10Hashar) [08:00:01] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10Zuul, and 3 others: Improve scheduling of CI jobs invoked by zuul - https://phabricator.wikimedia.org/T258630 (10hashar) That also affect the load... [08:07:14] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10Zuul, and 4 others: Improve scheduling of CI jobs invoked by zuul - https://phabricator.wikimedia.org/T258630 (10hashar) The last action is to upst... [08:40:42] 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10Wikispeech-Text-to-Speech, and 3 others: Tell Jenkins and Zuul about the Speechoid pipielines - https://phabricator.wikimedia.org/T259911 (10Lokal_Profil) [08:48:54] 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10Wikispeech-Text-to-Speech, and 3 others: Tell Jenkins and Zuul about the Speechoid pipielines - https://phabricator.wikimedia.org/T259911 (10Lokal_Profil)... [09:21:11] (03PS3) 10Hashar: Squelch ref-replication gerrit warnings [integration/zuul] (patch-queue/debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/594526 (owner: 10Jforrester) [09:21:24] (03PS3) 10Hashar: Add entry for ref-replication-scheduled event [integration/zuul] (patch-queue/debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/594527 (owner: 10Jforrester) [09:22:43] (03CR) 10Hashar: [V: 03+2 C: 03+2] "Sorry for the long delay in handling that backporting." [integration/zuul] (patch-queue/debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/594526 (owner: 10Jforrester) [09:22:55] (03CR) 10Hashar: [V: 03+2 C: 03+2] "Sorry for the long delay in handling that backporting." [integration/zuul] (patch-queue/debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/594527 (owner: 10Jforrester) [09:29:05] (03PS1) 10Hashar: Support Gerrit replication events [integration/zuul/deploy] - 10https://gerrit.wikimedia.org/r/621478 [09:29:54] James_F: hiiii so I finally managed to review the zuul backport patches for ref-replication Gerrit events [09:30:06] both backports patches are now merged in our integration/zuul fork [09:30:14] and I have bumped the source module in the deploy repository: https://gerrit.wikimedia.org/r/c/integration/zuul/deploy/+/621478 [09:30:28] so if that looks right, I guess I can then scap deploy the fixup [09:31:46] (03CR) 10Hashar: "I think I might have to regenerate the zuul wheel first though :-\" [integration/zuul/deploy] - 10https://gerrit.wikimedia.org/r/621478 (owner: 10Hashar) [09:42:20] (03PS2) 10Hashar: Support Gerrit replication events [integration/zuul/deploy] - 10https://gerrit.wikimedia.org/r/621478 [09:42:26] nicer [09:46:53] hashar: Nice. [10:07:30] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:22:06] 10Project-Admins, 10Developer-Advocacy: Create #Wikimedia-Developer-Portal tag for Developer Advocacy's project - https://phabricator.wikimedia.org/T260892 (10Aklapper) p:05Triage→03Medium [10:23:04] 10Project-Admins, 10Developer-Advocacy: Create #Wikimedia-Developer-Portal tag for Developer Advocacy's project - https://phabricator.wikimedia.org/T260892 (10Aklapper) 05Open→03Resolved Done in https://phabricator.wikimedia.org/project/view/4941/ [10:23:28] (03CR) 10Hashar: [V: 03+2 C: 03+2] Support Gerrit replication events [integration/zuul/deploy] - 10https://gerrit.wikimedia.org/r/621478 (owner: 10Hashar) [10:26:33] 10Project-Admins, 10Developer-Advocacy, 10Wikimedia-Developer-Portal: Create #Wikimedia-Developer-Portal tag for Developer Advocacy's project - https://phabricator.wikimedia.org/T260892 (10Aklapper) [10:31:54] hashar: Did you deploy? I'm still seeing `2020-08-20 10:31:30,639 WARNING zuul.GerritEventConnector: Received unrecognized event type 'ref-replication-scheduled' from Gerrit.` [10:32:06] James_F: yeah but only restrated the zuul merger [10:32:16] cause there are a lot of changes in the queues :/ [10:32:20] Oh, right. :-( [10:32:46] cause someone send a fairly long serie of patches targetting mediawiki/core [10:32:53] Mostly epic puppeteer patches. [10:32:58] Just drop the lot of them. [10:33:02] They're all WIPs. [10:33:23] Well, maybe wait for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/621037 to clear G&S? [10:33:56] also the wmf-quibble-selenium-php72-docker ends up being super long :/ [10:34:09] Yeah. :-( [10:34:13] I was looking at it last night and it seems a good chunk of it is simply due to a lot of new selenium tests being added [10:34:22] such as in MinervaNeue [10:34:32] which is a good thing, then we should probably not run it for every changes [10:34:37] Yeah. [10:34:40] Erk. No no no. [10:34:46] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:34:51] so we could use something like you did for phpunit @group Standalone [10:34:54] The world breaking three hours after a patch has merged is terrible CI. [10:35:05] We already do split out the selenium tests into their own job. [10:35:33] We could split the skin ones into their own special one because there are so many of them, but that'll just be an endless fight of ever-yet-more jobs. [10:38:09] yeah :-\ [10:38:37] Maybe we need to give each skin and extension a time budget for selenium tests. [10:38:51] They can have 10s of tests. Pick wisely. [10:39:29] But K.rinkle keeps pointing out that our selenium tests run disastrously slowly compared to local runs. [10:40:02] So maybe there's something we could fix in quibble to make them much faster, if only we could find which. [10:40:18] Running the different extensions' selenium runs in parallel might help? [10:41:53] yeah pretty sure there are some optimization that are requires [10:42:06] one that has been identified a few months ago is that php built in webserver is single thread [10:42:08] Maybe make it QTE's problem? [10:42:12] so all requests are serialized [10:42:19] Yeah, how is the move to Apache going? [10:42:31] The main work last I looked was with Adam, and stalled. [10:42:58] Also migrating selenium runs of quibble to buster (and PHP 7.3 and Chrome 83) would make things much faster, IIRC? [10:43:09] But more complex. :-( [10:43:17] might be [10:43:54] running with the old chromium 73 might at least catch issues with that old browser [10:44:00] but maybe that is not much of an argument [10:44:15] php7.3 , potentially that would might not catch a php7.2 issue [10:44:27] We'd run the phpunit tests on php7.2 [10:44:30] Just the selenium ones. [10:44:37] Hence the complexity. [10:44:52] so well, we want to add support for Apache in Quibble [10:44:55] instead of using php -S [10:45:06] PHP 7.3's php -S is faster, though? [10:45:14] So that might be a quicker win before moving to Apache? [10:45:23] and I guess we need a solution to have some selenium tests to be filtered out (like @group standalone) [10:45:31] Eh. :-( [10:45:44] and I don't think 7.3 is considerably faster than 7.2 [10:45:52] Selenium tests are exactly the thing that are hard to isolate and often broken by other extensions' changes. [10:46:04] yeah [10:46:20] hashar: OK, the world is quiet. Restart zuul? [10:46:26] then from a quick look at MinervaNeue tests, it seems lot of scenario are just testing that skin and would barely have interaction with other repos [10:46:28] but might be wrong [10:50:18] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:56:10] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:00:12] James_F: will restart zuul when it is quieter [11:01:09] hashar: I can just start aborting jobs in Jenkins to make it quieter. ;-) [11:02:49] i need to revisit graceful shutdown [11:02:57] Yeah. :-( [11:03:00] sending SIGUSR1 to zuul causes it to stop triggering jobs [11:03:12] (03PS1) 10Gergő Tisza: Add @see to UnusedUseStatementSniff [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/621508 [11:03:27] But the queues don't empty? [11:28:50] James_F: yeah [11:28:59] SIGUSR1 cause Zuul to stop adding changes in the queues [11:29:14] it just keep all new received events on hold [11:29:26] once all jobs have been completed, it would then restart [11:29:34] load the saved events and resume processing them [11:46:36] (03PS3) 10Esanders: Add PSR2.ControlStructures.SwitchDeclaration [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/613734 (https://phabricator.wikimedia.org/T182546) [12:11:03] !log Gracefully stopping Zuul [12:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [12:17:01] James_F: I have restarted zuul [13:00:39] (03CR) 10Lars Wirzenius: [C: 03+1] "still LGTM (which is short for Lars Giggles To Moomins)" [blubber] - 10https://gerrit.wikimedia.org/r/619072 (https://phabricator.wikimedia.org/T260830) (owner: 10Dduvall) [13:48:58] phabricator seems to be having some DB connection trouble [13:49:24] "Unable to establish a connection to any database host (while trying "phabricator_token"). All masters and replicas are completely unreachable. AphrontConnectionQueryException: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2002: Cannot assign requested address." [13:51:38] I saw errors as well, but now it seems to be back for me [13:53:42] working at the moment here as well [13:57:50] never mind, still hitting errorrs [13:58:50] this one also: "upstream connect error or disconnect/reset before headers. reset reason: connection failure" [14:05:06] ^ twentyafterfour I'm seeing a few things, too: http://tyler.zone/phab-trouble-2020-08-20.png [14:05:51] hmm [14:08:54] sp there was a huuuge spike in rows read per second a few minutes ago [14:09:51] thcipriani: was that just transient? I also had a couple of connections reset a few minutes ago but everything appears to be back to normal now? [14:10:21] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=misc&var-shard=All&var-role=All&from=now-3h&to=now [14:10:35] a couple of really big spikes but then everything normal [14:16:28] hrm, lots of messages in the error log, but with the whole stack trace without new lines it's hard to parse :( [14:40:47] thcipriani: yeah I noticed that as well, kinda strange I'm not sure when that started happening [14:43:16] hi all! If I have questions about worker VMs in the integration project, should I start an async conversation with hashar on phab or are there other people in my TZ involved with that? [14:43:45] andrewbogott: shoot? ;) [14:43:54] I mean, please shoot your question! [14:44:12] oh! You're here :) You're not in the alphabetical user list but that's 'cause you're at the top [14:45:11] So, background: we're very tentatively moving some VMs to ceph-backed storage. In the interest of caution, we're trying to do this with mostly cattle-type instances. [14:45:25] I'm wondering if integration workers are good candidates for that [14:46:12] ceph? isn't opnstack coming with Swift and we have in house knowledge of swift already [14:46:13] ? [14:46:22] (by 'cattle' I mean: can definitely be rebuilt without much trouble if they are suddenly deleted) [14:46:26] anyway, we don't rely on any external storage [14:46:46] the instances nowaday are just the Stretch image with Docker installed, extended disk allocated for storing the images and that is about it [14:46:55] ah, ok — swift is an object store. Ceph can also be used as an object store, but that's not what we're using it for currently. [14:47:10] the VMs themselves are moving from local hypervisor storage to distributed storage. [14:48:37] i have no idea how storage is managed right now though [14:48:55] but the CI instances do a bunch of disk i/o. I imagine that is written down in the local storage right now [14:49:05] is that a qcow image maybe? [14:49:47] if that moves to a distributed storage maybe the io would end up being slower [14:50:00] Right now all the 'local' storage on a VM (aka '/') is stored in a local file on the hypervisor. [14:50:42] We don't have very specific performance stats currently, part of why I'm hoping to use your nodes is that you're good at noticing these changes :) [14:51:06] But also, everything is moving to ceph eventually, so this isn't optional in the long run, I'm just looking for early adopters. [14:51:43] yeah got it [14:51:53] well maybe it is better to pick some instances from tools labs [14:52:30] I would rather not risk possible issues that would encountered by being the first to adopt that new system [14:52:55] most important would be integrity, imho [14:52:56] guess my main concern is how it might slow down disk/io (but that my be a red hearing) [14:53:10] I remember there were some issues years ago with... glusterfs? [14:53:30] ok, next question (tangential): I'm adding some modest short-term backups for VM backing files. We don't have storage for absolutely everything so I'm trying to (again) identify cattle vs. pets so I can exclude cattle from the backup [14:53:40] I suppose disk/io will have to be slower [14:53:51] (since in some cases like e.g. k8s workers it's easier to rebuild from scratch than restore from backup) [14:53:52] question is if so much that it matters [14:54:30] disk performance with ceph is actually quite fast because of elaborate distributed striping algos [14:54:38] but I don't have clear side-by-side numbers yet [14:55:20] so, regarding backups… I have a regexp system for excluding things that don't need backing up: https://gerrit.wikimedia.org/r/c/operations/puppet/+/621281/1/modules/profile/templates/wmcs/backy2/wmcs_backup_instances.yaml.erb [14:55:25] perhaps, if a "disk" is used by one node exclusively, it is pinned there and you don't notice an io difference [14:55:46] although being a distributed system, I guess that needs to be replicated somewhere [14:55:47] hashar: would you be intersted in writing a patch to that file to exclude those integration nodes that don't need backups? Or should I just exclude the whole project? [14:56:16] Platonides: the glusterfs thing was 8 years ago; pretty much everyone in the OpenStack world uses ceph for instance storage these days. [14:56:33] It's not nearly as simple as one file on one ceph server, it's scattered about block by block. [14:56:49] so imagine RAID striping except that instead of an array of lots of disks it's an array of lots of servers [14:57:01] andrewbogott: for backup, the puppetmaster: integration-puppetmaster02 , and I guess integration-cumin would be good candidates. The rest we can afford to just looseo them entirely [14:57:09] 10Release-Engineering-Team (Onboarding), 10Release-Engineering-Team-TODO, 10Scap, 10User-dancy: Better error message than "Scap failed!: Call to mwscript eval.php returned: None" - https://phabricator.wikimedia.org/T225109 (10dancy) a:03dancy [14:57:33] hashar: ok, I'll look at what you've got and will run the patch by you before it's in [14:57:44] andrewbogott: I don't mean it is wrong to move [14:57:50] or that glusterfs was bad, either [14:57:51] (for what it's worth, the backups only apply to ceph-hosted things, so it'll be moot in the near term) [14:58:03] glusterfs was pretty bad in 2011 :) [14:58:13] Folks seem to be happy with it these days for the most part [14:58:13] I was just remembering that "bad experience", yes, I sound like an old and grumppy grandfather ;) [14:58:25] andrewbogott: though we might just rebuild those instances. So ideally the decision to backit up or not, would be based on the Puppet role maybe [14:58:48] believe me, glusterfs-related trauma is a big part of why it's taking us so long to adopt distributed storage of any kind [15:00:02] hm… role-based exclusion is a good idea, although role assignment to VMs isn't nearly as straightforward as it is in prod [15:00:08] andrewbogott: but yeah essentially I guess it would be nice to backup everything by default then add some regex as to which host not to backup ( /^integration-agent-.*/ can be excluded ). [15:00:43] (03PS3) 10Ahmon Dancy: Improve mwscript error handling [tools/scap] - 10https://gerrit.wikimedia.org/r/620791 (https://phabricator.wikimedia.org/T225109) [15:01:58] cool, I will implement that shortly [15:03:52] 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10Scap, 10User-dancy: Scap fails when checking train version number - https://phabricator.wikimedia.org/T259706 (10dancy) 05Open→03Resolved [15:04:30] hashar: simple as this, yes? https://gerrit.wikimedia.org/r/c/operations/puppet/+/621538 [15:13:29] andrewbogott: yup +1ed ;] [15:13:53] thx [15:14:45] andrewbogott: as for Ceph, surely I don't mind moving integration to it, but I would rather not be the first adopter :]]] [15:15:04] that's fair! [15:15:09] then if the migration is doable on a per instance basis, surely we could start moving some of the Jenkins agents to ceph and compare [15:15:17] there's a fair bit of toolforge on it now, we can let it sit for a few weeks before moving your things [15:15:21] so that out of the X hosts we have, we could have 3 or 4 moved to Ceph [15:15:35] yeah — moving it per instance would just consist of you rebuilding nodes with a ceph-enabled flavor [15:15:41] then [15:15:52] I imagine disk I/O are done on the localhost and are then replicated to the distributed storage [15:16:13] and maybe we can even tweak it so that the distribution is not a blocker (eg make it so that we can loose data eventually) [15:18:36] there actually isn't any local disk i/o — our new hypervisors barely have disks at all, just enough for the OS [15:18:47] kvm mounts the ceph volumes directly [15:18:47] ahhh [15:18:50] it's pretty weird :) [15:19:51] part of the move to ceph will involve moving off of old hypervisors to newer ones, so for things on the low-number hypervisors there will be big CPU improvements. Multiple variables changing at once :/ [15:20:08] big CPU! [15:20:19] now there is an incentive to migrate! :]]] [15:20:36] cause the old cloudvirt were a pain (apparently due to the bios CPU scaling governor) [15:20:56] so anytime we create an instance, we check whether it happened to be scheduled on those fauly cloudvirt [15:21:11] but yeah again [15:21:19] if we can move just a few instances that would be nice [15:22:01] :) [15:22:16] I'll enable the flavor for your project and then you can build things there at your convenience. [15:25:45] sounds good [15:25:57] andrewbogott: so we can just pick the backend when creating a new instance? [15:27:29] hashar: you should see a new flavor in horizon, 'mediumram-ceph' [15:27:37] AHH [15:27:40] VMs with that flavor will be scheduled on ceph-backed hypervisors [15:27:55] (it looks like 'mediumram' is what you're using for most of your workers?) [15:28:33] apparently yes [15:28:59] do you have a meta task for migrating stuff to Ceph? [15:29:16] or maybe a phabricator project [15:29:44] only very high-level at this point [15:29:45] https://phabricator.wikimedia.org/T253365 [15:29:59] I have a fair bit of communication to do with this before we start moving 'normal' VMs [15:30:05] I imagine [15:30:15] maybe a Phabricator subproject would work [15:30:27] that is what I do usually for big migrations, this way I end up with a dedicated workboard [15:31:11] (03CR) 10Jayprakash12345: [C: 03+1] Start branching GlobalWatchlist extension [tools/release] - 10https://gerrit.wikimedia.org/r/621242 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712) [15:31:12] when the main migration happens it will be one hypervisor at a time rather than one project at a time, so won't be so much 'coordinated' with users as just announced :) [15:31:31] although I'm still thinking through whether we can do it one VM at a time instead [15:32:06] damn [15:32:12] we are still on Stretch instances grblblb [15:33:57] andrewbogott: is there a way to move an instance manually to ceph? [15:34:33] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO: Move all Wikimedia CI (WMCS integration project) instances from stretch to buster - https://phabricator.wikimedia.org/T252071 (10hashar) [15:34:33] it's not straightforward because a given hypervisor is one or the other [15:35:06] so the process as it stands now is basically stop everything on a hypervisor, switch the backend to ceph, import everything into ceph, start VMs back up [15:36:22] but maybe you could stop an instance, import it into ceph, change its flavor and have it scheduled on a ceph based hypervisor when it has to be started? [15:36:28] or maybe I am making non sense hehe [15:36:55] no, that makes sense, it's just a fair bit of coding [15:37:15] and I'm not sure yet if it's necessary or not. [15:37:32] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO: Move CI instances to use ceph in WMCS - https://phabricator.wikimedia.org/T260916 (10hashar) [15:37:38] ok ok [15:37:49] guess we can just create fresh new ones since they are easy to provision [15:37:51] I'm not totally sure how to do the 'import it into ceph' stage from a hv that doesn't have ceph enabled. So it might be 'stop an instance, copy to a ceph-enabled hypervisor, import into ceph, etc. etc.' [15:37:54] which is 2x copies [15:38:04] yeah [15:38:20] but that might be faster than having to provision a fresh new instance ;] [15:38:22] anyway [15:38:23] https://phabricator.wikimedia.org/T260916 [15:38:45] filed for it, I am unlikely to think about it anytime soon (if ever :D ) [15:38:58] but I guess it is probably all about creating a new instance with that new flavor [15:39:53] yep, if you're building new instances anyway you can just use that flavor, the difference should be pretty much invisible [15:40:57] !log Created dummy instance integration-agent-docker-1020 using a Ceph backed hypervisor # T260916 [15:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:40:59] T260916: Move CI instances to use ceph in WMCS - https://phabricator.wikimedia.org/T260916 [15:41:08] will see how it goes, and I guess I can pool it in the fleet of jenkins agents [15:41:09] ;D [15:41:51] great! [15:44:04] 90 upgraded, 19 newly installed, 0 to remove and 0 not upgraded. [15:44:10] Need to get 105MB [15:44:11] ;] [15:52:55] btw hashar, creating new ceph VMS is currently very slow because the base image has to be converted during startup. Once we've moved to an all-ceph setup we'll rebuild base images in the format that ceph likes and things will be fast again. [15:54:44] (03CR) 1020after4: [C: 03+2] Improve mwscript error handling [tools/scap] - 10https://gerrit.wikimedia.org/r/620791 (https://phabricator.wikimedia.org/T225109) (owner: 10Ahmon Dancy) [15:56:35] andrewbogott: I was merely ranting how the Stretch base image might deserve a rebuild :]]] [15:56:55] (03Merged) 10jenkins-bot: Improve mwscript error handling [tools/scap] - 10https://gerrit.wikimedia.org/r/620791 (https://phabricator.wikimedia.org/T225109) (owner: 10Ahmon Dancy) [15:57:39] anyway it is build. Will pool it in later on [16:02:51] !log Added integration-agent-docker-1020 to Jenkins # T260916 [16:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:02:53] T260916: Move CI instances to use ceph in WMCS - https://phabricator.wikimedia.org/T260916 [16:05:51] Project mediawiki-core-doxygen-docker build #18269: 04FAILURE in 1 min 22 sec: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/18269/ [16:05:52] 10Phabricator: Phab feature request: Cycle time for a task entering a column to resolution, with support for wildcards - https://phabricator.wikimedia.org/T148805 (10mmodell) Another option besides status, is we could have a tag called "Started" which gets added by a column trigger. [16:07:13] !log Depooled integration-agent-docker-1020 to Jenkins cant connnect to /var/run/docker.sock # T260916 [16:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:16:05] 10Phabricator: Phab feature request: Cycle time for a task entering a column to resolution, with support for wildcards - https://phabricator.wikimedia.org/T148805 (10MBinder_WMF) I think a column trigger would work well, with the one standout limitation being that, as I understand it, column triggers only occur... [16:44:47] 10Continuous-Integration-Config, 10Fresnel, 10Performance-Team: For the Fresnel job, distinguish system failure from assert failure - https://phabricator.wikimedia.org/T216574 (10Krinkle) 05Stalled→03Declined It's fine as it is. It's the same as any other CI job in that it gives a good signal that somet... [16:46:49] 10Phabricator: Phab feature request: Cycle time for a task entering a column to resolution, with support for wildcards - https://phabricator.wikimedia.org/T148805 (10mmodell) Since we are looking for a way to explicitly mark the start of work, it doesn't seem like the limitations are too much of a problem. The t... [17:21:59] 10Phabricator: Phab feature request: Cycle time for a task entering a column to resolution, with support for wildcards - https://phabricator.wikimedia.org/T148805 (10MBinder_WMF) Yea, if column triggers are simple enough, it's definitely the most straightforward way, for now. The manual task dragging isn't ideal... [17:34:56] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): jenkins-deploy user is not in the docker group - https://phabricator.wikimedia.org/T260930 (10hashar) [17:35:17] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): jenkins-deploy user is not in the docker group - https://phabricator.wikimedia.org/T260930 (10hashar) [17:35:19] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO: Move CI instances to use ceph in WMCS - https://phabricator.wikimedia.org/T260916 (10hashar) [17:39:46] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): jenkins-deploy user is not in the docker group - https://phabricator.wikimedia.org/T260930 (10hashar) The group addition has been removed in Puppet by... [17:39:56] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): jenkins-deploy user is not in the docker group - https://phabricator.wikimedia.org/T260930 (10hashar) a:03hashar [17:43:40] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): jenkins-deploy user is not in the docker group - https://phabricator.wikimedia.org/T260930 (10Dzahn) I guess one solution is to add a "if $realm = labs... [18:11:54] (03PS1) 10Ahmon Dancy: Improve mwscript error handling (followup) [tools/scap] - 10https://gerrit.wikimedia.org/r/621560 (https://phabricator.wikimedia.org/T225109) [18:17:57] (03CR) 1020after4: [C: 03+2] Improve mwscript error handling (followup) [tools/scap] - 10https://gerrit.wikimedia.org/r/621560 (https://phabricator.wikimedia.org/T225109) (owner: 10Ahmon Dancy) [18:20:06] (03Merged) 10jenkins-bot: Improve mwscript error handling (followup) [tools/scap] - 10https://gerrit.wikimedia.org/r/621560 (https://phabricator.wikimedia.org/T225109) (owner: 10Ahmon Dancy) [18:43:44] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10Patch-For-Review: jenkins-deploy user is not in the docker group - https://phabricator.wikimedia.org/T260930 (10hashar) Cherry picked the puppet pa... [18:44:10] !log Pooling integration-agent-docker-1020 # T260930 / T260916 [18:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:44:13] T260930: jenkins-deploy user is not in the docker group - https://phabricator.wikimedia.org/T260930 [18:44:14] T260916: Move CI instances to use ceph in WMCS - https://phabricator.wikimedia.org/T260916 [18:52:45] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO: Move CI instances to use ceph in WMCS - https://phabricator.wikimedia.org/T260916 (10hashar) p:05Triage→03Medium [18:52:58] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO: Move all Wikimedia CI (WMCS integration project) instances from stretch to buster - https://phabricator.wikimedia.org/T252071 (10hashar) p:05Triage→03Medium [18:56:02] (03PS4) 10Dduvall: Support scratch images [blubber] - 10https://gerrit.wikimedia.org/r/619072 (https://phabricator.wikimedia.org/T260830) [19:04:53] (03CR) 10Ahmon Dancy: [C: 03+1] Support scratch images [blubber] - 10https://gerrit.wikimedia.org/r/619072 (https://phabricator.wikimedia.org/T260830) (owner: 10Dduvall) [19:09:10] 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10Patch-For-Review, 10Release, 10Train Deployments: 1.36.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T257973 (10mmodell) Note: Probably not a blocker but this is happening occasionally: {T260853} [19:18:35] 10Phabricator: Phab feature request: Cycle time for a task entering a column to resolution, with support for wildcards - https://phabricator.wikimedia.org/T148805 (10Aklapper) I can imagine several potential solutions which each have pros and cons. To evaluate them I'd like to know which "certain criteria" peopl... [20:17:31] (03CR) 10Jayprakash12345: [C: 03+1] Add PSR2.ControlStructures.SwitchDeclaration [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/613734 (https://phabricator.wikimedia.org/T182546) (owner: 10Esanders) [22:17:46] !log Deleted deployment-xhgui01.deployment-prep - no longer need MongoDB test instance (T180761) [22:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [22:17:48] T180761: Move XHGui from tungsten to xhgui-001 - https://phabricator.wikimedia.org/T180761 [22:49:49] I donno if you already know, but there's something very wrong with one of the CI pipelines: https://integration.wikimedia.org/ci/job/mediawiki-quibble-vendor-mysql-php72-docker/28876/console [22:51:50] 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10Patch-For-Review, 10Release, 10Train Deployments: 1.36.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T257973 (10mmodell) 05Open→03Resolved Easiest train, ever. [23:36:04] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Technical-Debt: Clear /srv/.git on contint1001; move integration.wikimedia.org docroot to new location - https://phabricator.wikimedia.org/T149924 (10Dzahn)...