[09:59:45] <godog>	 effie: I'd need another mw host for reimage tests, what's the next I can sequester ?
[10:00:47] * effie checks santa's naught or nice list
[10:01:06] <effie>	 mw2229.codfw.wmnet
[10:01:17] <effie>	 mw2228.codfw.wmnet was successful ?
[10:01:55] <godog>	 effie: yeah it was! it is already repooled and crossed off in the task
[10:02:04] <godog>	 thank you, I'll take 2229
[12:02:03] <godog>	 jbond42: re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/554045 can we add the sni to the certificate instead ?
[12:02:36] <godog>	 ah I guess it'd change based on the host
[12:05:39] <jbond42>	 godog: the cert allready has idp[12]001.wikimedia.org so adding idp[12]001 is not a big issue if thats the prefrence
[12:06:15] <godog>	 jbond42: bah that's fine too, I +1'd the patch
[12:06:23] <jbond42>	 ok cheers
[13:48:45] <elukey>	 there seems to be exceptions ongoing for mediawiki
[13:49:19] <elukey>	 interestingly mostly from mw2* hosts
[13:49:54] <elukey>	 MediaWiki::restInPeace: transaction round 'MediaWiki\Linter\RecordLintJob::run' still running
[13:51:07] <elukey>	 ok codfw jobrunners
[13:51:19] <elukey>	 effie: just to confirm, you are not working on them right?
[13:52:31] <effie>	 currently no one is working on mw servers in codfw
[13:52:41] <effie>	 this is weird
[13:53:13] <elukey>	 yeah I checked and the mw2 alarming have a lot of days in uptime
[13:54:45] <elukey>	 https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&from=now-3h&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-main&var-kafka_topic=codfw_mediawiki_job_RecordLintJob&var-kafka_broker=All&var-kafka_producer_type=All
[13:55:35] <elukey>	 no idea if a consequence of anything else, but from the exception errors I see record lint mentioned and the codfw metrics showed a big spike
[13:56:17] <bblack>	 no such thing in eqiad though?
[13:58:11] <elukey>	 I lost track of when mw2* hosts are used, don't recall if anything (like changeprop) uses mw2 jobrunners to avoid impacting eqiad
[14:00:21] <elukey>	 marostegui: are you around?
[14:00:24] <effie>	 elukey: I have the same question
[14:00:38] <effie>	 I don't recall anything using codfw jobrunners
[14:01:39] <marostegui>	 elukey: what's up
[14:02:21] <elukey>	 marostegui: hola :) there are some exceptions for mw2 jobrunners in codfw, and it seems related to a record lint job.. I am wondering if you had seen this issue before
[14:02:29] <elukey>	 https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&from=now-3h&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-main&var-kafka_topic=codfw_mediawiki_job_RecordLintJob&var-kafka_broker=All&var-kafka_producer_type=All
[14:02:34] <elukey>	 uff sorry wrong link
[14:02:41] <elukey>	 /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/lbfactory/LBFactory.php: MediaWiki::restInPeace: transaction round 'MediaWiki\Linter\RecordLintJob::run' still running
[14:03:17] <marostegui>	 elukey: I have seen similar ones: https://phabricator.wikimedia.org/T228911
[14:03:39] <marostegui>	 elukey: Not particularely related to Linter
[14:03:49] <marostegui>	 But I have seen errors of "transactions still running"
[14:04:06] <effie>	 oh joy
[14:04:29] <effie>	 ok I will create a task and mention that
[14:06:34] <elukey>	 it seems change prop from one mw2 jobrunner access log
[14:12:48] <elukey>	 this graph is nice
[14:12:50] <elukey>	 https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&from=now-3h&to=now&refresh=5m&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-kafka_broker=All&var-topic=codfw.cpjobqueue.retry.mediawiki.job.RecordLintJob
[14:13:25] <elukey>	 change prop is probably getting the failure and then adding the job over and over to the retry queue
[14:15:55] <elukey>	 I tried to get from kafkacat the spec of the job but I don't really understand how to read it :D
[14:26:07] <elukey>	 so just to sum up, it seems that an issue with MW core is causing changeprop to retry over and over the same problematic job
[14:26:14] <elukey>	 this is my understanding
[14:26:37] <elukey>	 so to break the tie, we'd need some help from Marko or Petr probably
[14:26:44] <elukey>	 any other thought?
[14:29:23] <effie>	 no I think that is all
[14:30:13] <effie>	 any deployments marco did before that were Parsoid Proxy ones
[14:30:50] <effie>	 cpt is switching some wikis to parsoid php
[15:28:12] <elukey>	 effie: do we know why the api was slow? the dashboard looks horrible (latency wise) :(
[15:29:08] <effie>	 was? it still is
[15:29:10] <effie>	 and not sadly
[15:29:20] <effie>	 I am restarting the servers with more cpu load
[15:29:26] <effie>	 and see how it goes
[16:39:23] <bblack>	 we seem to have a serious issue?
[16:41:29] <godog>	 mhh maybe api post switching to parsoid/php ?
[16:43:34] <cdanis>	 just out of a meeting
[16:43:36] <cdanis>	 what's the status?
[16:45:13] <bblack>	 effie probably knows best, seems to be working on related things
[16:46:59] <cdanis>	 effie: do you need help?
[16:47:25] <effie>	 cdanis: there is not much, I think our next step is to install systemd coredump on one host
[16:47:41] <effie>	 as it appears that at least on host I checked, tried to
[16:47:45] <effie>	 and see what is up
[16:47:46] <cdanis>	 did... did we segv php-fpm across the entire cluster?
[16:47:53] <effie>	 yes
[16:47:59] <cdanis>	 lovely
[16:48:10] <effie>	 response was at 8s
[16:48:26] <effie>	 so that was practically down
[16:48:43] <effie>	 php slow log was not useful
[16:48:58] <effie>	 and now we could parse a chunk of API log
[16:49:01] <cdanis>	 75%ile peaked at something like 12 seconds lol
[16:52:57] <cdanis>	 effie: I still see a ~4% error rate on POSTs
[16:53:19] <cdanis>	 let me know if I can help
[16:53:28] <effie>	 cdanis: we are still restarting
[16:53:54] <effie>	 ok just finished
[16:55:15] <godog>	 I think the recent restbase deploy to switch to parsoid/php has sth to do with it, most of the 500s are for zhwiki https://logstash.wikimedia.org/goto/3ca23a801287e69f24ac14383d8a877d
[16:56:13] <effie>	 ok we are done
[16:56:37] <ema>	 did the issue start around 16:27? That's the SAL entry time for mobrovac@deploy1001: Started deploy [restbase/deploy@3516382]: Switch ru, sr and zh wikipediae to Parsoid/PHP - T229015
[16:56:38] <stashbot>	 T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015
[16:56:48] <cdanis>	 shortly after ema
[16:56:55] <cdanis>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-30m&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200&fullscreen&panelId=20
[16:56:58] <jynus>	 ema: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1575299208252&to=1575305787601&fullscreen&panelId=20&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200
[16:57:27] <cdanis>	 latency has somewhat recovered but appserver error rate has not
[16:57:41] <effie>	 ok let's see errors
[16:58:46] <jynus>	 referer: envoy
[16:59:01] <jynus>	 but maybe that is normal
[16:59:10] <cdanis>	 yes
[16:59:54] <ema>	 I've asked mobrovac to join us here
[17:59:49] <_joe_>	 the error rate has recovered, and it was equal across all clusters
[17:59:56] <_joe_>	 api, appserver, parsoid
[18:01:13] <_joe_>	 that's because that graph is not by cluster, meh
[18:02:51] <_joe_>	 the errors were all on parsoid in fact