[15:40:12] <AaronSchulz>	 TimStarling: what is getting the second DB connection all the time on your wiki?
[15:40:28] <AaronSchulz>	 I know some things do that with DB_MASTER in some cases
[17:10:14] <Krinkle>	 I stumbled onto this config file and trying to understand exactly what I'm looking at: https://gerrit.wikimedia.org/g/operations/deployment-charts/+/985f4bb6ee34a4ee8356cef226f97b5a377fdce9/helmfile.d/services/changeprop-jobqueue/values.yaml#67
[17:11:10] <Krinkle>	 I thought we serialise and queue the 'retries' information from MW at runtime when queuing jobs, but it seems this is doing something similar. Are they the same? Is one ignored? Or is this an override only?
[19:28:17] <Krinkle>	 Pchelolo: ^
[19:29:51] <Pchelolo>	 Krinkle: change-prop can retry things by itself. you are right, cirrus jobs do retries from within MW, so we disable generic change-prop retries for them
[19:30:34] <Pchelolo>	 cirrus jobs are smarter about their retires then change-prop can be, it just reexecutes exactly the same job if the job fails
[19:30:53] <Pchelolo>	 cirrus reenqueues a different, simpler job upon failure
[19:31:53] <Pchelolo>	 a better solution would be to make cirrus report success if it successfully enqueued the retries, and have change-prop retry the entire original cirrus job in case of catastrophic failure, but we've never got to this
[19:33:12] <Pchelolo>	 for other jobs, we can override Job::allowRetries and change-prop will respect that
[19:34:36] <Krinkle>	 OK, so in general it does respect it. So all jobs that set allowRetries=true (all jobs do that by default), they will get retried when they fail or e.g. when we kill them by restarting a job runner. 
[19:34:41] <Krinkle>	 and if it sets false, it will honour that too.
[19:35:36] <Krinkle>	 I didn't quite understand how the cirrus jobs are different. When these fail, you say a different simpler job is re-queued in its place?
[19:35:59] <Krinkle>	 where do those come from?
[19:36:54] <Pchelolo>	 yes. That's what Cirrus jobs do afaik - they build the update for cirrus search (expensive), then try writing it to elasticsearch cluster, and then if elastic write fails, they enqueue the ElasticaWrite job, which contains the pre-built update for elasticsearch
[19:37:18] <Pchelolo>	 so their retry is in reality a different job
[19:38:09] <Krinkle>	 ah ok, I think I undertand. but what does change-prop have to do different?
[19:39:39] <Krinkle>	 it seems like in that case cirrus jobs should either disable retries in general then, or mark themselves as success after they catch/requeue write job. I vaguely recall that they do that already, so maybe this was a temporary measure after a miscommunication between teams?
[19:39:54] <Krinkle>	 that would also benefit third parties :)
[19:40:28] <Pchelolo>	 yeah, it's probably a temporary hack that never got removed.
[19:40:44] <Krinkle>	 https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/includes/Job/CheckerJob.php
[19:40:52] <Krinkle>	 ok, yeah, I see a year ago or so it was fixed in the extension
[19:41:43] <Krinkle>	 I'd like to help with some documentation for hte new job queue as it is a relatively large system nowadays. I don't know all the moving parts by name and location etc. any features we can remove would be a nice win for stability and debuggability
[19:43:08] <Pchelolo>	 hm.. this code is actually exactly the opposite of being fixed. the job sets 'allowRetries' true, which would make change-prop retry the original job, and it posts retry jobs itself. 
[19:43:32] <Pchelolo>	 so we will have 15 retires overall if I remove that config hack
[19:43:43] <Krinkle>	 Pchelolo: I saw that, but I think Ottomata and I talked about that. It was intentional.
[19:44:00] <Krinkle>	 it only happens if the job fails fatally before it can return true and retry its own way
[19:44:09] <Krinkle>	 there is no return false in the code, and it catches its own failures.
[19:44:20] <Krinkle>	 so if it retried its own way, it is always success so no native retry
[19:44:22] <Pchelolo>	 oh, ok. 
[19:44:38] <Krinkle>	 separation of concerns :)
[19:44:45] <Krinkle>	 but yeah, the code is confusing, could use a better inline comment
[19:45:01] <Krinkle>	 its likely that killed jobs will become more common when mw runs without opcache validation and/or on k8s pods because deployments will have to replace existing pods within a short period of time.
[19:45:09] <Krinkle>	 hence I'm looking for jobs that don't support retries today
[19:45:53] <Pchelolo>	 Lemme make a ticket about this and look into cirrus search situation in more detail and maybe remove that config
[19:45:53] <Krinkle>	 but looks like there is nothing I need to worry about within the change-props part, it only sets 0 for cirrus jobs, which is still not great because it means if it fatals now it will be lost, but it's mostly taken care of.
[19:46:00] <Krinkle>	 cool, thanks!
[19:46:15] <Pchelolo>	 by default it does 2 retries
[19:46:36] <Pchelolo>	 with exponential delay
[19:47:05] <Pchelolo>	 there's a lot of control on retry limit and how it's delayed, so we can tweak it easily if needed
[19:49:23] <Krinkle>	 yeah, maybe for operational reasons there coudl stil be an emergency way to reduce retries from 3 to 1 if an incident happened. but in general I think it should not appear in the config and probably not be able to accidentally start retries for jobs that set it to false, or accidentally set 0 for jobs that set it to true. that will make it difficult to avoid mistakes and make it safe to restart jobrunners without worries.
[20:19:15] <legoktm>	 James_F: https://codesearch.wmcloud.org/deployed/?q=Register%20hooks%20based%20on%20version&i=nope&files=&excludeFiles=&repos= sigh
[20:37:13] <James_F>	 legoktm: Indeed.
[21:46:02] <TimStarling>	 AaronSchulz: it's from JobQueueDB::getDB(), so actually not affecting production
[21:46:28] <TimStarling>	 I'm going to upload some debug log enhancements since it wasn't obvous at first why that connection was being opened
[21:58:28] <AaronSchulz>	 TimStarling: oh, that makes sense if it was from the $wgJobRunRate code
[22:01:34] <TimStarling>	 maybe CONN_TRX_AUTOCOMMIT should only be used for $index == DB_MASTER?
[22:05:52] <TimStarling>	 would you approve that if I put it in gerrit?
[22:26:52] <James_F>	 legoktm: All now fixed (thanks to Reedy for the merges).
[22:27:30] <legoktm>	 <3 to both
[22:51:21] <TimStarling>	 legoktm, regarding the hook interface performance issue, have you thought about optimisation of our PSR-4 code? maybe we could cache a big regex or something