[09:59:45] effie: I'd need another mw host for reimage tests, what's the next I can sequester ? [10:00:47] * effie checks santa's naught or nice list [10:01:06] mw2229.codfw.wmnet [10:01:17] mw2228.codfw.wmnet was successful ? [10:01:55] effie: yeah it was! it is already repooled and crossed off in the task [10:02:04] thank you, I'll take 2229 [12:02:03] jbond42: re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/554045 can we add the sni to the certificate instead ? [12:02:36] ah I guess it'd change based on the host [12:05:39] godog: the cert allready has idp[12]001.wikimedia.org so adding idp[12]001 is not a big issue if thats the prefrence [12:06:15] jbond42: bah that's fine too, I +1'd the patch [12:06:23] ok cheers [13:48:45] there seems to be exceptions ongoing for mediawiki [13:49:19] interestingly mostly from mw2* hosts [13:49:54] MediaWiki::restInPeace: transaction round 'MediaWiki\Linter\RecordLintJob::run' still running [13:51:07] ok codfw jobrunners [13:51:19] effie: just to confirm, you are not working on them right? [13:52:31] currently no one is working on mw servers in codfw [13:52:41] this is weird [13:53:13] yeah I checked and the mw2 alarming have a lot of days in uptime [13:54:45] https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&from=now-3h&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-main&var-kafka_topic=codfw_mediawiki_job_RecordLintJob&var-kafka_broker=All&var-kafka_producer_type=All [13:55:35] no idea if a consequence of anything else, but from the exception errors I see record lint mentioned and the codfw metrics showed a big spike [13:56:17] no such thing in eqiad though? [13:58:11] I lost track of when mw2* hosts are used, don't recall if anything (like changeprop) uses mw2 jobrunners to avoid impacting eqiad [14:00:21] marostegui: are you around? [14:00:24] elukey: I have the same question [14:00:38] I don't recall anything using codfw jobrunners [14:01:39] elukey: what's up [14:02:21] marostegui: hola :) there are some exceptions for mw2 jobrunners in codfw, and it seems related to a record lint job.. I am wondering if you had seen this issue before [14:02:29] https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&from=now-3h&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-main&var-kafka_topic=codfw_mediawiki_job_RecordLintJob&var-kafka_broker=All&var-kafka_producer_type=All [14:02:34] uff sorry wrong link [14:02:41] /srv/mediawiki/php-1.35.0-wmf.5/includes/libs/rdbms/lbfactory/LBFactory.php: MediaWiki::restInPeace: transaction round 'MediaWiki\Linter\RecordLintJob::run' still running [14:03:17] elukey: I have seen similar ones: https://phabricator.wikimedia.org/T228911 [14:03:39] elukey: Not particularely related to Linter [14:03:49] But I have seen errors of "transactions still running" [14:04:06] oh joy [14:04:29] ok I will create a task and mention that [14:06:34] it seems change prop from one mw2 jobrunner access log [14:12:48] this graph is nice [14:12:50] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&from=now-3h&to=now&refresh=5m&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-kafka_broker=All&var-topic=codfw.cpjobqueue.retry.mediawiki.job.RecordLintJob [14:13:25] change prop is probably getting the failure and then adding the job over and over to the retry queue [14:15:55] I tried to get from kafkacat the spec of the job but I don't really understand how to read it :D [14:26:07] so just to sum up, it seems that an issue with MW core is causing changeprop to retry over and over the same problematic job [14:26:14] this is my understanding [14:26:37] so to break the tie, we'd need some help from Marko or Petr probably [14:26:44] any other thought? [14:29:23] no I think that is all [14:30:13] any deployments marco did before that were Parsoid Proxy ones [14:30:50] cpt is switching some wikis to parsoid php [15:28:12] effie: do we know why the api was slow? the dashboard looks horrible (latency wise) :( [15:29:08] was? it still is [15:29:10] and not sadly [15:29:20] I am restarting the servers with more cpu load [15:29:26] and see how it goes [16:39:23] we seem to have a serious issue? [16:41:29] mhh maybe api post switching to parsoid/php ? [16:43:34] just out of a meeting [16:43:36] what's the status? [16:45:13] effie probably knows best, seems to be working on related things [16:46:59] effie: do you need help? [16:47:25] cdanis: there is not much, I think our next step is to install systemd coredump on one host [16:47:41] as it appears that at least on host I checked, tried to [16:47:45] and see what is up [16:47:46] did... did we segv php-fpm across the entire cluster? [16:47:53] yes [16:47:59] lovely [16:48:10] response was at 8s [16:48:26] so that was practically down [16:48:43] php slow log was not useful [16:48:58] and now we could parse a chunk of API log [16:49:01] 75%ile peaked at something like 12 seconds lol [16:52:57] effie: I still see a ~4% error rate on POSTs [16:53:19] let me know if I can help [16:53:28] cdanis: we are still restarting [16:53:54] ok just finished [16:55:15] I think the recent restbase deploy to switch to parsoid/php has sth to do with it, most of the 500s are for zhwiki https://logstash.wikimedia.org/goto/3ca23a801287e69f24ac14383d8a877d [16:56:13] ok we are done [16:56:37] did the issue start around 16:27? That's the SAL entry time for mobrovac@deploy1001: Started deploy [restbase/deploy@3516382]: Switch ru, sr and zh wikipediae to Parsoid/PHP - T229015 [16:56:38] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [16:56:48] shortly after ema [16:56:55] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-30m&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200&fullscreen&panelId=20 [16:56:58] ema: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1575299208252&to=1575305787601&fullscreen&panelId=20&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200 [16:57:27] latency has somewhat recovered but appserver error rate has not [16:57:41] ok let's see errors [16:58:46] referer: envoy [16:59:01] but maybe that is normal [16:59:10] yes [16:59:54] I've asked mobrovac to join us here [17:59:49] <_joe_> the error rate has recovered, and it was equal across all clusters [17:59:56] <_joe_> api, appserver, parsoid [18:01:13] <_joe_> that's because that graph is not by cluster, meh [18:02:51] <_joe_> the errors were all on parsoid in fact