[00:58:33] <wikibugs>	 (03PS1) 10DannyS712: Start branching GlobalWatchlist extension [tools/release] - 10https://gerrit.wikimedia.org/r/621242 (https://phabricator.wikimedia.org/T260862)
[00:58:57] <wikibugs>	 (03PS2) 10DannyS712: Start branching GlobalWatchlist extension [tools/release] - 10https://gerrit.wikimedia.org/r/621242 (https://phabricator.wikimedia.org/T260862)
[00:59:02] <wikibugs>	 (03CR) 10DannyS712: "This change is ready for review." [tools/release] - 10https://gerrit.wikimedia.org/r/621242 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712)
[06:40:31] <wmf-insecte>	 maintenance-disconnect-full-disks build 204679 integration-agent-docker-1003 (/: 18%, /srv: 100%, /var/lib/docker: 49%): OFFLINE due to disk space
[06:45:28] <wmf-insecte>	 maintenance-disconnect-full-disks build 204680 integration-agent-docker-1003 (/: 18%, /srv: 88%, /var/lib/docker: 45%): RECOVERY disk space OK
[06:50:28] <wmf-insecte>	 maintenance-disconnect-full-disks build 204681 integration-agent-docker-1003 (/: 18%, /srv: 98%, /var/lib/docker: 50%): OFFLINE due to disk space
[06:55:31] <wmf-insecte>	 maintenance-disconnect-full-disks build 204682 integration-agent-docker-1003 (/: 18%, /srv: 83%, /var/lib/docker: 45%): RECOVERY disk space OK
[07:23:50] <wikibugs>	 (03PS2) 10Hashar: Upgrade gear from 0.7.0 to 1.15.1+wmf1 [integration/zuul/deploy] - 10https://gerrit.wikimedia.org/r/617404 (https://phabricator.wikimedia.org/T258630)
[07:26:27] <wikibugs>	 (03CR) 10Hashar: [V: 03+2 C: 03+2] "Rolling rolling!" [integration/zuul/deploy] - 10https://gerrit.wikimedia.org/r/617404 (https://phabricator.wikimedia.org/T258630) (owner: 10Hashar)
[08:00:01] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10Zuul, and 3 others: Improve scheduling of CI jobs invoked by zuul - https://phabricator.wikimedia.org/T258630 (10hashar) That also affect the load...
[08:07:14] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10Zuul, and 4 others: Improve scheduling of CI jobs invoked by zuul - https://phabricator.wikimedia.org/T258630 (10hashar) The last action is to upst...
[08:40:42] <wikibugs>	 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10Wikispeech-Text-to-Speech, and 3 others: Tell Jenkins and Zuul about the Speechoid pipielines - https://phabricator.wikimedia.org/T259911 (10Lokal_Profil)
[08:48:54] <wikibugs>	 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10Wikispeech-Text-to-Speech, and 3 others: Tell Jenkins and Zuul about the Speechoid pipielines - https://phabricator.wikimedia.org/T259911 (10Lokal_Profil)...
[09:21:11] <wikibugs>	 (03PS3) 10Hashar: Squelch ref-replication gerrit warnings [integration/zuul] (patch-queue/debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/594526 (owner: 10Jforrester)
[09:21:24] <wikibugs>	 (03PS3) 10Hashar: Add entry for ref-replication-scheduled event [integration/zuul] (patch-queue/debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/594527 (owner: 10Jforrester)
[09:22:43] <wikibugs>	 (03CR) 10Hashar: [V: 03+2 C: 03+2] "Sorry for the long delay in handling that backporting." [integration/zuul] (patch-queue/debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/594526 (owner: 10Jforrester)
[09:22:55] <wikibugs>	 (03CR) 10Hashar: [V: 03+2 C: 03+2] "Sorry for the long delay in handling that backporting." [integration/zuul] (patch-queue/debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/594527 (owner: 10Jforrester)
[09:29:05] <wikibugs>	 (03PS1) 10Hashar: Support Gerrit replication events [integration/zuul/deploy] - 10https://gerrit.wikimedia.org/r/621478
[09:29:54] <hashar>	 James_F: hiiii so I finally managed to review the zuul backport patches for ref-replication Gerrit events
[09:30:06] <hashar>	 both backports patches are now merged in our integration/zuul fork
[09:30:14] <hashar>	 and I have bumped the source module in the deploy repository: https://gerrit.wikimedia.org/r/c/integration/zuul/deploy/+/621478 
[09:30:28] <hashar>	 so if that looks right, I guess I can then scap deploy the fixup
[09:31:46] <wikibugs>	 (03CR) 10Hashar: "I think I might have to regenerate the zuul wheel first though :-\" [integration/zuul/deploy] - 10https://gerrit.wikimedia.org/r/621478 (owner: 10Hashar)
[09:42:20] <wikibugs>	 (03PS2) 10Hashar: Support Gerrit replication events [integration/zuul/deploy] - 10https://gerrit.wikimedia.org/r/621478
[09:42:26] <hashar>	 nicer
[09:46:53] <James_F>	 hashar: Nice.
[10:07:30] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[10:22:06] <wikibugs>	 10Project-Admins, 10Developer-Advocacy: Create #Wikimedia-Developer-Portal tag for Developer Advocacy's project - https://phabricator.wikimedia.org/T260892 (10Aklapper) p:05Triage→03Medium
[10:23:04] <wikibugs>	 10Project-Admins, 10Developer-Advocacy: Create #Wikimedia-Developer-Portal tag for Developer Advocacy's project - https://phabricator.wikimedia.org/T260892 (10Aklapper) 05Open→03Resolved Done in https://phabricator.wikimedia.org/project/view/4941/
[10:23:28] <wikibugs>	 (03CR) 10Hashar: [V: 03+2 C: 03+2] Support Gerrit replication events [integration/zuul/deploy] - 10https://gerrit.wikimedia.org/r/621478 (owner: 10Hashar)
[10:26:33] <wikibugs>	 10Project-Admins, 10Developer-Advocacy, 10Wikimedia-Developer-Portal: Create #Wikimedia-Developer-Portal tag for Developer Advocacy's project - https://phabricator.wikimedia.org/T260892 (10Aklapper)
[10:31:54] <James_F>	 hashar: Did you deploy? I'm still seeing `2020-08-20 10:31:30,639 WARNING zuul.GerritEventConnector: Received unrecognized event type 'ref-replication-scheduled' from Gerrit.`
[10:32:06] <hashar>	 James_F: yeah but only restrated the zuul merger
[10:32:16] <hashar>	 cause there are a lot of changes in the queues :/
[10:32:20] <James_F>	 Oh, right. :-(
[10:32:46] <hashar>	 cause someone send a fairly long serie of patches targetting mediawiki/core 
[10:32:53] <James_F>	 Mostly epic puppeteer patches.
[10:32:58] <James_F>	 Just drop the lot of them.
[10:33:02] <James_F>	 They're all WIPs.
[10:33:23] <James_F>	 Well, maybe wait for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/621037 to clear G&S?
[10:33:56] <hashar>	 also the wmf-quibble-selenium-php72-docker  ends up being super long :/
[10:34:09] <James_F>	 Yeah. :-(
[10:34:13] <hashar>	 I was looking at it last night and it seems a good chunk of it is simply due to a lot of new selenium tests being added
[10:34:22] <hashar>	 such as in MinervaNeue
[10:34:32] <hashar>	 which is a good thing, then we should probably not run it for every changes
[10:34:37] <James_F>	 Yeah.
[10:34:40] <James_F>	 Erk. No no no.
[10:34:46] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[10:34:51] <hashar>	 so we could use something like you did for phpunit @group Standalone
[10:34:54] <James_F>	 The world breaking three hours after a patch has merged is terrible CI.
[10:35:05] <James_F>	 We already do split out the selenium tests into their own job.
[10:35:33] <James_F>	 We could split the skin ones into their own special one because there are so many of them, but that'll just be an endless fight of ever-yet-more jobs.
[10:38:09] <hashar>	 yeah :-\
[10:38:37] <James_F>	 Maybe we need to give each skin and extension a time budget for selenium tests.
[10:38:51] <James_F>	 They can have 10s of tests. Pick wisely.
[10:39:29] <James_F>	 But K.rinkle keeps pointing out that our selenium tests run disastrously slowly compared to local runs.
[10:40:02] <James_F>	 So maybe there's something we could fix in quibble to make them much faster, if only we could find which.
[10:40:18] <James_F>	 Running the different extensions' selenium runs in parallel might help?
[10:41:53] <hashar>	 yeah pretty sure there are some optimization that are requires
[10:42:06] <hashar>	 one that has been identified a few months ago is that php built in webserver is single thread
[10:42:08] <James_F>	 Maybe make it QTE's problem?
[10:42:12] <hashar>	 so all requests are serialized
[10:42:19] <James_F>	 Yeah, how is the move to Apache going?
[10:42:31] <James_F>	 The main work last I looked was with Adam, and stalled.
[10:42:58] <James_F>	 Also migrating selenium runs of quibble to buster (and PHP 7.3 and Chrome 83) would make things much faster, IIRC?
[10:43:09] <James_F>	 But more complex. :-(
[10:43:17] <hashar>	 might be
[10:43:54] <hashar>	 running with the old chromium 73 might at least catch issues with that old browser
[10:44:00] <hashar>	 but maybe that is not much of an argument
[10:44:15] <hashar>	 php7.3 , potentially that would might not catch a php7.2 issue
[10:44:27] <James_F>	 We'd run the phpunit tests on php7.2
[10:44:30] <James_F>	 Just the selenium ones.
[10:44:37] <James_F>	 Hence the complexity.
[10:44:52] <hashar>	 so well, we want to add support for Apache in Quibble
[10:44:55] <hashar>	 instead of using php -S
[10:45:06] <James_F>	 PHP 7.3's php -S is faster, though?
[10:45:14] <James_F>	 So that might be a quicker win before moving to Apache?
[10:45:23] <hashar>	 and I guess we need a solution to have some selenium tests to be filtered out (like @group standalone)
[10:45:31] <James_F>	 Eh. :-(
[10:45:44] <hashar>	 and I don't think 7.3 is considerably faster than 7.2
[10:45:52] <James_F>	 Selenium tests are exactly the thing that are hard to isolate and often broken by other extensions' changes.
[10:46:04] <hashar>	 yeah
[10:46:20] <James_F>	 hashar: OK, the world is quiet. Restart zuul?
[10:46:26] <hashar>	 then from a quick look at MinervaNeue tests, it seems lot of scenario are just testing that skin and would barely have interaction with other repos
[10:46:28] <hashar>	 but might be wrong
[10:50:18] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[10:56:10] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[11:00:12] <hashar>	 James_F: will restart zuul when it is quieter
[11:01:09] <James_F>	 hashar: I can just start aborting jobs in Jenkins to make it quieter. ;-)
[11:02:49] <hashar>	 i need to revisit graceful shutdown
[11:02:57] <James_F>	 Yeah. :-(
[11:03:00] <hashar>	 sending SIGUSR1 to zuul causes it to stop triggering jobs
[11:03:12] <wikibugs>	 (03PS1) 10Gergő Tisza: Add @see to UnusedUseStatementSniff [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/621508
[11:03:27] <James_F>	 But the queues don't empty?
[11:28:50] <hashar>	 James_F: yeah 
[11:28:59] <hashar>	 SIGUSR1 cause Zuul to stop adding changes in the queues
[11:29:14] <hashar>	 it just keep all new received events on hold
[11:29:26] <hashar>	 once all jobs have been completed, it would then restart
[11:29:34] <hashar>	 load the saved events and resume processing them
[11:46:36] <wikibugs>	 (03PS3) 10Esanders: Add PSR2.ControlStructures.SwitchDeclaration [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/613734 (https://phabricator.wikimedia.org/T182546)
[12:11:03] <hashar>	 !log Gracefully stopping Zuul
[12:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[12:17:01] <hashar>	 James_F: I have restarted zuul
[13:00:39] <wikibugs>	 (03CR) 10Lars Wirzenius: [C: 03+1] "still LGTM (which is short for Lars Giggles To Moomins)" [blubber] - 10https://gerrit.wikimedia.org/r/619072 (https://phabricator.wikimedia.org/T260830) (owner: 10Dduvall)
[13:48:58] <mdholloway>	 phabricator seems to be having some DB connection trouble
[13:49:24] <mdholloway>	 "Unable to establish a connection to any database host (while trying "phabricator_token"). All masters and replicas are completely unreachable. AphrontConnectionQueryException: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2002: Cannot assign requested address."
[13:51:38] <micgro42>	 I saw errors as well, but now it seems to be back for me
[13:53:42] <mdholloway>	 working at the moment here as well
[13:57:50] <mdholloway>	 never mind, still hitting errorrs
[13:58:50] <mdholloway>	 this one also: "upstream connect error or disconnect/reset before headers. reset reason: connection failure"
[14:05:06] <thcipriani>	 ^ twentyafterfour I'm seeing a few things, too: http://tyler.zone/phab-trouble-2020-08-20.png
[14:05:51] <twentyafterfour>	 hmm
[14:08:54] <twentyafterfour>	 sp there was a huuuge spike in rows read per second a few minutes ago
[14:09:51] <twentyafterfour>	 thcipriani: was that just transient? I also had a couple of connections reset a few minutes ago but everything appears to be back to normal now?
[14:10:21] <twentyafterfour>	 https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=misc&var-shard=All&var-role=All&from=now-3h&to=now
[14:10:35] <twentyafterfour>	 a couple of really big spikes but then everything normal
[14:16:28] <thcipriani>	 hrm, lots of messages in the error log, but with the whole stack trace without new lines it's hard to parse :(
[14:40:47] <twentyafterfour>	 thcipriani: yeah I noticed that as well, kinda strange I'm not sure when that started happening
[14:43:16] <andrewbogott>	 hi all!  If I have questions about worker VMs in the integration project, should I start an async conversation with hashar on phab or are there other people in my TZ involved with that?
[14:43:45] <hashar>	 andrewbogott: shoot? ;)
[14:43:54] <hashar>	 I mean, please shoot your question!
[14:44:12] <andrewbogott>	 oh!  You're here :)  You're not in the alphabetical user list but that's 'cause you're at the top
[14:45:11] <andrewbogott>	 So, background: we're very tentatively moving some VMs to ceph-backed storage.  In the interest of caution, we're trying to do this with mostly cattle-type instances.
[14:45:25] <andrewbogott>	 I'm wondering if integration workers are good candidates for that
[14:46:12] <hashar>	 ceph? isn't opnstack coming with Swift and we have in house knowledge of swift already 
[14:46:13] <hashar>	 ?
[14:46:22] <andrewbogott>	 (by 'cattle' I mean: can definitely be rebuilt without much trouble if they are suddenly deleted)
[14:46:26] <hashar>	 anyway, we don't rely on any external storage
[14:46:46] <hashar>	 the instances nowaday are just the Stretch image with Docker installed, extended disk allocated for storing the images and that is about it
[14:46:55] <andrewbogott>	 ah, ok — swift is an object store.  Ceph can also be used as an object store, but that's not what we're using it for currently.
[14:47:10] <andrewbogott>	 the VMs themselves are moving from local hypervisor storage to distributed storage.
[14:48:37] <hashar>	 i have no idea how storage is managed right now though
[14:48:55] <hashar>	 but the CI instances do a bunch of disk i/o. I imagine that is written down in the local storage right now
[14:49:05] <hashar>	 is that a qcow image maybe?
[14:49:47] <hashar>	 if that moves to a distributed storage maybe the io would end up being slower
[14:50:00] <andrewbogott>	 Right now all the 'local' storage on a VM (aka '/') is stored in a local file on the hypervisor.
[14:50:42] <andrewbogott>	 We don't have very specific performance stats currently, part of why I'm hoping to use your nodes is that you're good at noticing these changes :)
[14:51:06] <andrewbogott>	 But also, everything is moving to ceph eventually, so this isn't optional in the long run, I'm just looking for early adopters.
[14:51:43] <hashar>	 yeah got it
[14:51:53] <hashar>	 well maybe it is better to pick some instances from tools labs
[14:52:30] <hashar>	 I would rather not risk possible issues that would encountered by being the first to adopt that new system
[14:52:55] <Platonides>	 most important would be integrity, imho
[14:52:56] <hashar>	 guess my main concern is how it might slow down disk/io (but that my be a red hearing)
[14:53:10] <Platonides>	 I remember there were some issues years ago with... glusterfs?
[14:53:30] <andrewbogott>	 ok, next question (tangential): I'm adding some modest short-term backups for VM backing files.  We don't have storage for absolutely everything so I'm trying to (again) identify cattle vs. pets so I can exclude cattle from the backup
[14:53:40] <Platonides>	 I suppose disk/io will have to be slower
[14:53:51] <andrewbogott>	 (since in some cases like e.g. k8s workers it's easier to rebuild from scratch than restore from backup)
[14:53:52] <Platonides>	 question is if so much that it matters
[14:54:30] <andrewbogott>	 disk performance with ceph is actually quite fast because of elaborate distributed striping algos
[14:54:38] <andrewbogott>	 but I don't have clear side-by-side numbers yet
[14:55:20] <andrewbogott>	 so, regarding backups… I have a regexp system for excluding things that don't need backing up:  https://gerrit.wikimedia.org/r/c/operations/puppet/+/621281/1/modules/profile/templates/wmcs/backy2/wmcs_backup_instances.yaml.erb
[14:55:25] <Platonides>	 perhaps, if a "disk" is used by one node exclusively, it is pinned there and you don't notice an io difference
[14:55:46] <Platonides>	 although being a distributed system,  I guess that needs to be replicated somewhere
[14:55:47] <andrewbogott>	 hashar: would you be intersted in writing a patch to that file to exclude those integration nodes that don't need backups?  Or should I just exclude the whole project?
[14:56:16] <andrewbogott>	 Platonides: the glusterfs thing was 8 years ago; pretty much everyone in the OpenStack world uses ceph for instance storage these days.
[14:56:33] <andrewbogott>	 It's not nearly as simple as one file on one ceph server, it's scattered about block by block.
[14:56:49] <andrewbogott>	 so imagine RAID striping except that instead of an array of lots of disks it's an array of lots of servers
[14:57:01] <hashar>	 andrewbogott: for backup, the puppetmaster: integration-puppetmaster02 , and I guess integration-cumin would be good candidates. The rest we can afford to just looseo them entirely
[14:57:09] <wikibugs>	 10Release-Engineering-Team (Onboarding), 10Release-Engineering-Team-TODO, 10Scap, 10User-dancy: Better error message than "Scap failed!: Call to mwscript eval.php returned: None" - https://phabricator.wikimedia.org/T225109 (10dancy) a:03dancy
[14:57:33] <andrewbogott>	 hashar: ok, I'll look at what you've got and will run the patch by you before it's in
[14:57:44] <Platonides>	 andrewbogott: I don't mean it is wrong to move
[14:57:50] <Platonides>	 or that glusterfs was bad, either
[14:57:51] <andrewbogott>	 (for what it's worth, the backups only apply to ceph-hosted things, so it'll be moot in the near term)
[14:58:03] <andrewbogott>	 glusterfs was pretty bad in 2011 :)
[14:58:13] <andrewbogott>	 Folks seem to be happy with it these days for the most part
[14:58:13] <Platonides>	 I was just remembering that "bad experience", yes, I sound like an old and grumppy grandfather ;)
[14:58:25] <hashar>	 andrewbogott: though we might just rebuild those instances. So ideally the decision to backit up or not, would be based on the Puppet role maybe
[14:58:48] <andrewbogott>	 believe me, glusterfs-related trauma is a big part of why it's taking us so long to adopt distributed storage of any kind
[15:00:02] <andrewbogott>	 hm… role-based exclusion is a good idea, although role assignment to VMs isn't nearly as straightforward as it is in prod
[15:00:08] <hashar>	 andrewbogott: but yeah essentially I guess it would be nice to backup everything by default then add some regex as to which host not to backup ( /^integration-agent-.*/   can be excluded ).
[15:00:43] <wikibugs>	 (03PS3) 10Ahmon Dancy: Improve mwscript error handling [tools/scap] - 10https://gerrit.wikimedia.org/r/620791 (https://phabricator.wikimedia.org/T225109)
[15:01:58] <andrewbogott>	 cool, I will implement that shortly
[15:03:52] <wikibugs>	 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10Scap, 10User-dancy: Scap fails when checking train version number - https://phabricator.wikimedia.org/T259706 (10dancy) 05Open→03Resolved
[15:04:30] <andrewbogott>	 hashar: simple as this, yes?  https://gerrit.wikimedia.org/r/c/operations/puppet/+/621538
[15:13:29] <hashar>	 andrewbogott: yup +1ed ;]
[15:13:53] <andrewbogott>	 thx
[15:14:45] <hashar>	 andrewbogott: as for Ceph, surely I don't mind moving integration to it, but I would rather not be the first adopter :]]]
[15:15:04] <andrewbogott>	 that's fair!
[15:15:09] <hashar>	 then if the migration is doable on a per instance basis, surely we could start moving some of the Jenkins agents to ceph and compare 
[15:15:17] <andrewbogott>	 there's a fair bit of toolforge on it now, we can let it sit for a few weeks before moving your things
[15:15:21] <hashar>	 so that out of the X hosts we have, we could have 3 or 4 moved to Ceph
[15:15:35] <andrewbogott>	 yeah — moving it per instance would just consist of you rebuilding nodes with a ceph-enabled flavor
[15:15:41] <hashar>	 then
[15:15:52] <hashar>	 I imagine disk I/O are done on the localhost and are then replicated to the distributed storage
[15:16:13] <hashar>	 and maybe we can even tweak it so that the distribution is not a blocker (eg make it so that we can loose data eventually)
[15:18:36] <andrewbogott>	  there actually isn't any local disk i/o — our new hypervisors barely have disks at all, just enough for the OS
[15:18:47] <andrewbogott>	 kvm mounts the ceph volumes directly
[15:18:47] <hashar>	 ahhh
[15:18:50] <andrewbogott>	 it's pretty weird :)
[15:19:51] <andrewbogott>	 part of the move to ceph will involve moving off of old hypervisors to newer ones, so for things on the low-number hypervisors there will be big CPU improvements.  Multiple variables changing at once :/
[15:20:08] <hashar>	 big CPU!
[15:20:19] <hashar>	 now there is an incentive to migrate! :]]]
[15:20:36] <hashar>	 cause the old cloudvirt were a pain (apparently due to the bios CPU scaling governor)
[15:20:56] <hashar>	 so anytime we create an instance, we check whether it happened to be scheduled on those fauly cloudvirt
[15:21:11] <hashar>	 but yeah again
[15:21:19] <hashar>	 if we can move just a few instances that would be nice
[15:22:01] <andrewbogott>	 :)
[15:22:16] <andrewbogott>	 I'll enable the flavor for your project and then you can build things there at your convenience.
[15:25:45] <hashar>	 sounds good
[15:25:57] <hashar>	 andrewbogott: so we can just pick the backend when creating a new instance?
[15:27:29] <andrewbogott>	 hashar: you should see a new flavor in horizon, 'mediumram-ceph'
[15:27:37] <hashar>	 AHH
[15:27:40] <andrewbogott>	 VMs with that flavor will be scheduled on ceph-backed hypervisors
[15:27:55] <andrewbogott>	 (it looks like 'mediumram' is what you're using for most of your workers?)
[15:28:33] <hashar>	 apparently yes
[15:28:59] <hashar>	 do you have a meta task for migrating stuff to Ceph?
[15:29:16] <hashar>	 or maybe a phabricator project
[15:29:44] <andrewbogott>	 only very high-level at this point
[15:29:45] <andrewbogott>	 https://phabricator.wikimedia.org/T253365
[15:29:59] <andrewbogott>	 I have a fair bit of communication to do with this before we start moving 'normal' VMs
[15:30:05] <hashar>	 I imagine
[15:30:15] <hashar>	 maybe a Phabricator subproject would work
[15:30:27] <hashar>	 that is what I do usually for big migrations, this way I end up with a dedicated workboard
[15:31:11] <wikibugs>	 (03CR) 10Jayprakash12345: [C: 03+1] Start branching GlobalWatchlist extension [tools/release] - 10https://gerrit.wikimedia.org/r/621242 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712)
[15:31:12] <andrewbogott>	 when the main migration happens it will be one hypervisor at a time rather than one project at a time, so won't be so much 'coordinated' with users as just announced :)
[15:31:31] <andrewbogott>	 although I'm still thinking through whether we can do it one VM at a time instead
[15:32:06] <hashar>	 damn
[15:32:12] <hashar>	 we are still on Stretch instances grblblb
[15:33:57] <hashar>	 andrewbogott: is there a way to move an instance manually to ceph?
[15:34:33] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO: Move all Wikimedia CI (WMCS integration project) instances from stretch to buster - https://phabricator.wikimedia.org/T252071 (10hashar)
[15:34:33] <andrewbogott>	 it's not straightforward because a given hypervisor is one or the other
[15:35:06] <andrewbogott>	 so the process as it stands now is basically stop everything on a hypervisor, switch the backend to ceph, import everything into ceph, start VMs back up
[15:36:22] <hashar>	 but maybe you could stop an instance, import it into ceph, change its flavor and have it scheduled on a ceph based hypervisor when it has to be started?
[15:36:28] <hashar>	 or maybe I am making non sense hehe
[15:36:55] <andrewbogott>	 no, that makes sense, it's just a fair bit of coding
[15:37:15] <andrewbogott>	 and I'm not sure yet if it's necessary or not.
[15:37:32] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO: Move CI instances to use ceph in WMCS - https://phabricator.wikimedia.org/T260916 (10hashar)
[15:37:38] <hashar>	 ok ok
[15:37:49] <hashar>	 guess we can just create fresh new ones since they are easy to provision
[15:37:51] <andrewbogott>	 I'm not totally sure how to do the 'import it into ceph' stage from a hv that doesn't have ceph enabled.  So it might be 'stop an instance, copy to a ceph-enabled hypervisor, import into ceph, etc. etc.'
[15:37:54] <andrewbogott>	 which is 2x copies
[15:38:04] <hashar>	 yeah
[15:38:20] <hashar>	 but that might be faster than having to provision a fresh new instance ;]
[15:38:22] <hashar>	 anyway
[15:38:23] <hashar>	 https://phabricator.wikimedia.org/T260916
[15:38:45] <hashar>	 filed for it, I am unlikely to think about it anytime soon (if ever :D  )
[15:38:58] <hashar>	 but I guess it is probably all about creating a new instance with that new flavor
[15:39:53] <andrewbogott>	 yep, if you're building new instances anyway you can just use that flavor, the difference should be pretty much invisible
[15:40:57] <hashar>	 !log Created dummy instance integration-agent-docker-1020  using a Ceph backed hypervisor # T260916
[15:40:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[15:40:59] <stashbot>	 T260916: Move CI instances to use ceph in WMCS - https://phabricator.wikimedia.org/T260916
[15:41:08] <hashar>	 will see how it goes, and I guess I can pool it in the fleet of jenkins agents
[15:41:09] <hashar>	 ;D
[15:41:51] <andrewbogott>	 great!
[15:44:04] <hashar>	 90 upgraded, 19 newly installed, 0 to remove and 0 not upgraded.
[15:44:10] <hashar>	 Need to get 105MB
[15:44:11] <hashar>	 ;]
[15:52:55] <andrewbogott>	 btw hashar, creating new ceph VMS is currently very slow because the base image has to be converted during startup.  Once we've moved to an all-ceph setup we'll rebuild base images in the format that ceph likes and things will be fast again.
[15:54:44] <wikibugs>	 (03CR) 1020after4: [C: 03+2] Improve mwscript error handling [tools/scap] - 10https://gerrit.wikimedia.org/r/620791 (https://phabricator.wikimedia.org/T225109) (owner: 10Ahmon Dancy)
[15:56:35] <hashar>	 andrewbogott: I was merely ranting how the Stretch base image might deserve a rebuild :]]]
[15:56:55] <wikibugs>	 (03Merged) 10jenkins-bot: Improve mwscript error handling [tools/scap] - 10https://gerrit.wikimedia.org/r/620791 (https://phabricator.wikimedia.org/T225109) (owner: 10Ahmon Dancy)
[15:57:39] <hashar>	 anyway it is build. Will pool it in later on
[16:02:51] <hashar>	 !log Added integration-agent-docker-1020  to Jenkins # T260916
[16:02:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[16:02:53] <stashbot>	 T260916: Move CI instances to use ceph in WMCS - https://phabricator.wikimedia.org/T260916
[16:05:51] <wmf-insecte>	 Project mediawiki-core-doxygen-docker build #18269: 04FAILURE in 1 min 22 sec: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/18269/
[16:05:52] <wikibugs>	 10Phabricator: Phab feature request: Cycle time for a task entering a column to resolution, with support for wildcards - https://phabricator.wikimedia.org/T148805 (10mmodell) Another option besides status, is we could have a tag called "Started" which gets added by a column trigger.
[16:07:13] <hashar>	 !log Depooled integration-agent-docker-1020  to Jenkins cant connnect to /var/run/docker.sock # T260916
[16:07:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[16:16:05] <wikibugs>	 10Phabricator: Phab feature request: Cycle time for a task entering a column to resolution, with support for wildcards - https://phabricator.wikimedia.org/T148805 (10MBinder_WMF) I think a column trigger would work well, with the one standout limitation being that, as I understand it, column triggers only occur...
[16:44:47] <wikibugs>	 10Continuous-Integration-Config, 10Fresnel, 10Performance-Team: For the Fresnel job, distinguish system failure from assert failure - https://phabricator.wikimedia.org/T216574 (10Krinkle) 05Stalled→03Declined It's fine as it is.  It's the same as any other CI job in that it gives a good signal that somet...
[16:46:49] <wikibugs>	 10Phabricator: Phab feature request: Cycle time for a task entering a column to resolution, with support for wildcards - https://phabricator.wikimedia.org/T148805 (10mmodell) Since we are looking for a way to explicitly mark the start of work, it doesn't seem like the limitations are too much of a problem. The t...
[17:21:59] <wikibugs>	 10Phabricator: Phab feature request: Cycle time for a task entering a column to resolution, with support for wildcards - https://phabricator.wikimedia.org/T148805 (10MBinder_WMF) Yea, if column triggers are simple enough, it's definitely the most straightforward way, for now. The manual task dragging isn't ideal...
[17:34:56] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): jenkins-deploy user is not in the docker group - https://phabricator.wikimedia.org/T260930 (10hashar)
[17:35:17] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): jenkins-deploy user is not in the docker group - https://phabricator.wikimedia.org/T260930 (10hashar)
[17:35:19] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO: Move CI instances to use ceph in WMCS - https://phabricator.wikimedia.org/T260916 (10hashar)
[17:39:46] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): jenkins-deploy user is not in the docker group - https://phabricator.wikimedia.org/T260930 (10hashar) The group addition has been removed in Puppet by...
[17:39:56] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): jenkins-deploy user is not in the docker group - https://phabricator.wikimedia.org/T260930 (10hashar) a:03hashar
[17:43:40] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): jenkins-deploy user is not in the docker group - https://phabricator.wikimedia.org/T260930 (10Dzahn) I guess one solution is to add a "if $realm = labs...
[18:11:54] <wikibugs>	 (03PS1) 10Ahmon Dancy: Improve mwscript error handling (followup) [tools/scap] - 10https://gerrit.wikimedia.org/r/621560 (https://phabricator.wikimedia.org/T225109)
[18:17:57] <wikibugs>	 (03CR) 1020after4: [C: 03+2] Improve mwscript error handling (followup) [tools/scap] - 10https://gerrit.wikimedia.org/r/621560 (https://phabricator.wikimedia.org/T225109) (owner: 10Ahmon Dancy)
[18:20:06] <wikibugs>	 (03Merged) 10jenkins-bot: Improve mwscript error handling (followup) [tools/scap] - 10https://gerrit.wikimedia.org/r/621560 (https://phabricator.wikimedia.org/T225109) (owner: 10Ahmon Dancy)
[18:43:44] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10Patch-For-Review: jenkins-deploy user is not in the docker group - https://phabricator.wikimedia.org/T260930 (10hashar) Cherry picked the puppet pa...
[18:44:10] <hashar>	 !log Pooling integration-agent-docker-1020 # T260930 / T260916
[18:44:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[18:44:13] <stashbot>	 T260930: jenkins-deploy user is not in the docker group - https://phabricator.wikimedia.org/T260930
[18:44:14] <stashbot>	 T260916: Move CI instances to use ceph in WMCS - https://phabricator.wikimedia.org/T260916
[18:52:45] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO: Move CI instances to use ceph in WMCS - https://phabricator.wikimedia.org/T260916 (10hashar) p:05Triage→03Medium
[18:52:58] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO: Move all Wikimedia CI (WMCS integration project) instances from stretch to buster - https://phabricator.wikimedia.org/T252071 (10hashar) p:05Triage→03Medium
[18:56:02] <wikibugs>	 (03PS4) 10Dduvall: Support scratch images [blubber] - 10https://gerrit.wikimedia.org/r/619072 (https://phabricator.wikimedia.org/T260830)
[19:04:53] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] Support scratch images [blubber] - 10https://gerrit.wikimedia.org/r/619072 (https://phabricator.wikimedia.org/T260830) (owner: 10Dduvall)
[19:09:10] <wikibugs>	 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10Patch-For-Review, 10Release, 10Train Deployments: 1.36.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T257973 (10mmodell) Note: Probably not a blocker but this is happening occasionally: {T260853}
[19:18:35] <wikibugs>	 10Phabricator: Phab feature request: Cycle time for a task entering a column to resolution, with support for wildcards - https://phabricator.wikimedia.org/T148805 (10Aklapper) I can imagine several potential solutions which each have pros and cons. To evaluate them I'd like to know which "certain criteria" peopl...
[20:17:31] <wikibugs>	 (03CR) 10Jayprakash12345: [C: 03+1] Add PSR2.ControlStructures.SwitchDeclaration [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/613734 (https://phabricator.wikimedia.org/T182546) (owner: 10Esanders)
[22:17:46] <dpifke>	 !log Deleted deployment-xhgui01.deployment-prep - no longer need MongoDB test instance (T180761)
[22:17:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[22:17:48] <stashbot>	 T180761: Move XHGui from tungsten to xhgui-001 - https://phabricator.wikimedia.org/T180761
[22:49:49] <Pchelolo>	 I donno if you already know, but there's something very wrong with one of the CI pipelines: https://integration.wikimedia.org/ci/job/mediawiki-quibble-vendor-mysql-php72-docker/28876/console
[22:51:50] <wikibugs>	 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10Patch-For-Review, 10Release, 10Train Deployments: 1.36.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T257973 (10mmodell) 05Open→03Resolved Easiest train, ever.
[23:36:04] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Technical-Debt: Clear  /srv/.git on contint1001; move integration.wikimedia.org docroot to new location - https://phabricator.wikimedia.org/T149924 (10Dzahn)...