[11:23:49] moritzm: i am entertained that you typo'd the word 'typo' :) [11:28:57] yeah :-) [12:42:05] rzl: I was checking https://wikitech.wikimedia.org/wiki/Switch_Datacenter, on phase 6, it says es2-es3, but in reality it needs to be es4 and es5, I remember we discussed that in a meeting, just making sure it is just a matter of wikitech not being updated? [12:44:03] marostegui: I double checked and the automation was updated [12:44:11] ok [12:44:20] as in, mysql_legacy & read only [12:45:08] I remember +1ing a change from volans, that's why I am asking if it is just wikitech not being up-to-date [12:45:14] as that is likely a copy&paste from the previous DC [12:45:29] probably- it has happened to me, docs lag behind [12:46:03] I don't want to enter too deep into this, but I don't know how to keep names of servers updated on wikitech [12:46:16] when you know they only will last 5 year tops [12:47:58] rzl: this is the change https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/576297, so yes, it is just wikitech, I will go ahead and update it [12:47:59] marostegui: I don't recall updating wikitech [12:48:10] the mysql_legacy module uses the new aliases made by kormat [12:48:19] so at the best of my knowledge they are right [12:48:32] volans: yep, I just updated wiktiech [12:48:34] but if you're not 100% sure I can bring the logs from the live-test we did last week [12:48:35] wikitech [12:48:41] and you can double check them [12:48:56] volans: this is enough I think https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/576297/3/spicerack/mysql.py [12:49:05] I remember rzl showing me the logs [12:49:25] ok [12:49:43] And I specifically checked for those two, and they were correct (I just double checked it again) [12:49:49] marostegui: thanks! [12:52:45] anyone else having issues with gerrit over ssh? [12:53:00] https://phabricator.wikimedia.org/P12427 [12:53:25] let me check same repo [12:53:25] i've tried cloning that repo 3 times. once over http, which worked. and twice over ssh, which failed like ^ both times [12:53:28] kormat: fwiw I'd recommend setting up your .gitconfig to always pull over https [12:53:42] yeah, I actually do that, it is way faster [12:54:00] https://phabricator.wikimedia.org/P8871 [12:54:29] it is a big latency improvement, especially the further you get from gerrit [12:55:24] aaand gerrit just started working for me again [12:55:34] i had a git pull on an existing repo hang for minutes which just completed [12:55:35] I was going to say, it went rather fast for taht repo now [12:55:52] kormat: I've seen somtimes the scheduler block [12:56:07] because the queue is doing expensive operations for many people [12:56:07] Good luck everyone for the switch [12:56:33] in the past, to debug we entered ps on gerrit [12:56:42] jynus: huh, interesting [12:56:54] ps as in the application scheduler, not the host [12:57:18] a couple of kills on blocked hosts and things started to flow :-) [12:57:26] kormat: good thing we don't depend on gerrit anymore for the switchdc :D [12:57:40] I have not seen any of those in the last months however [12:57:50] volans: so it's only a bad thing the other 364 days a year then ;) [12:59:23] eheh [14:10:09] _joe_: ema: every 500 i've looked at is restbase [14:10:10] <_joe_> ok so [14:10:10] all: let's discuss here instead [14:10:15] <_joe_> things seem ok to me [14:10:25] is this the thing where nodejs doesn't know how to re-resolve its backends? [14:10:43] <_joe_> cdanis: I doubt it, there is envoy there now [14:11:01] 500s recovering [14:11:13] <_joe_> also I suspect that's a rate over 5 minutes [14:11:36] "description": "Search is currently too busy. Please try again later.", [14:11:37] <_joe_> the errors were mostly with api [14:11:44] <_joe_> yes that too [14:11:55] so I guess we didn't do any cirrussearch warmups in codfw? [14:11:58] but i thought that was a/a [14:12:03] s1 enwiki still behaving badly [14:12:18] marostegui: what does that mean? [14:12:31] mark: response time is still high there on the dbs [14:12:31] all 500s seem to be restbase indeed, and the're about the "related" functionality [14:12:35] eg: https://de.wikipedia.org/api/rest_v1/page/related/Krampfader [14:12:37] but they are not lagging, so that's good [14:12:46] I guess they need to get their buffer fully full [14:12:47] avg. response time on the "appservers red dashboard" is much lower than it was before the switch at around 133ms. that seems good [14:12:51] <_joe_> ema: but they'r egoing away [14:12:56] <_joe_> mutante: that was expected [14:12:57] marostegui: it looks like s8 to me on mysql aggregated [14:13:06] mutante: it got a little worse yesterday when we moved services to codfw, now they're colocated again [14:13:23] <_joe_> ok so [14:13:24] ack. and confirmed edit on mobile apps [14:13:32] cdanis: yep that too, but same as enwiki, not lagging so that's "good" [14:13:52] I am not going to touch weights yet [14:13:52] <_joe_> those restbase urls now all work [14:13:57] <_joe_> I say go on and finish the job [14:14:02] <_joe_> we need maintenance to restart [14:14:19] I'll start phase 8 then, objections? [14:14:23] rzl: fwiw read-only period was 2:49 [14:14:24] +1 [14:14:29] marostegui: do you have https://grafana.wikimedia.org/d/XyoE_N_Wz/wikidata-database-cpu-saturation?orgId=1 but for codfw? :D [14:14:32] mcrouter metrics look good [14:14:36] going [14:14:39] cdanis: no :( [14:15:05] <_joe_> volans: how much better than last time? [14:15:09] going to reduce a bit db2083's weight [14:15:29] ttls restored, starting maintenance jobs [14:16:13] _joe_: can't recall, but we don't have puppet now [14:16:29] <_joe_> we do, for the wikidata dispatcher [14:16:42] s8 is looking good now [14:16:49] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=11&var-site=codfw&var-group=All&var-shard=All&var-role=All [14:17:03] all of the problems we had, tbh, seem like warmup artifacts [14:17:05] restbase errors still there [14:17:44] "During the most critical part of the switch [14:17:44] today, the wikis were in read-only mode for a duration of 4 minutes [14:17:44] and 41 seconds." [14:17:48] that's from 2018 [14:18:12] db land is looking good [14:18:13] forcing a refresh of the alerting db hosts [14:18:13] ema: it's still mostly /related/ urls [14:18:17] indeed [14:18:17] jynus: parsercache ok? [14:18:20] maintenance on mwmaint2001 looks good [14:18:21] mwmaint2001 has wikidata dispatch crons but no other maint crons [14:18:22] kormat: thanks [14:18:31] <_joe_> ema: and all of those url work when I try again [14:18:34] mutante: correct, dispatch job is the only maintenance cron left [14:18:35] marostegui: yes [14:18:36] what team supports cirrussearch / related page feature? [14:18:38] discovery? [14:18:43] everything else is in systemd now [14:18:44] jynus: excellent thanks [14:18:45] <_joe_> yes [14:18:50] <_joe_> cdanis: yes [14:18:56] I also saw some mobileapps failures in restbase [14:19:00] marostegui: 08-update-tendril done, tendril lgty? [14:19:06] rzl: checking [14:19:20] I see db2* roots [14:19:27] akosiaris: i think the switchback was faster, no? [14:19:47] mark: that was the absolute best up to then and it was the switchback [14:19:50] rzl: parsercache (pc1,p2 and pc3) should have been changed too, but they were not, that's not important [14:19:52] the switchover was [14:19:59] 7~8 IIRC [14:20:00] * cdanis asking in #-discovery [14:20:05] "That's a significant improvement over the 7 mins and [14:20:05] 34 seconds we achieved during the inverse process we concluded a month [14:20:05] ago, which was already significantly better than last year" [14:20:08] marostegui: ack, noted for later [14:20:16] rzl: ok [14:20:28] rzl: the rest looks good [14:20:36] thanks [14:20:39] marostegui: isn't pc1 section missing from tendril? [14:20:45] _joe_: confirmed, yes. Re-trying the failing URLs seems to work all the time [14:20:52] jynus: I can see it there [14:21:00] ok, artifact that tendril sometimes does [14:21:05] on my screen [14:21:06] yep [14:21:09] seeing mw fatals still elevated at https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?orgId=1&var-datasource=eqiad%20prometheus%2Fops&from=now-1h&to=now [14:21:26] should be recovered by now I think? [14:21:34] databases are stable now, everything seems to be fine [14:21:45] mobileapps error rate just checked, it looks normal [14:22:03] wikifeeds is also pretty ok. It did note the switchover but it's fine now [14:22:19] from logstash, some "you can't edit right now" errors but they may be backlogged [14:22:29] marostegui: something is wrong with the db r-o monitoring [14:22:50] need a hand? [14:22:52] what the heck is 'servergroup other'? [14:22:54] kormat: let's downtime those masters for 1d and we can check without pages [14:22:57] marostegui: the eqiad masters _should be_ read-only, correct? [14:23:00] ok [14:23:02] krocorrect [14:23:04] yes [14:23:06] kormat: correct [14:23:10] ok. i'll dt them. [14:23:22] rzl: the most recent of the "you can't edit right now" errors are from eqiad jobrunners [14:23:28] long-running jobs still lingering there? [14:23:34] oh good eye [14:23:43] godog: ^ that's your fatals probably [14:23:57] I assume jobrunners are servergroup 'other' [14:24:00] didn't we confirm that all maintenance jobs were turned off? [14:24:14] rzl cdanis agreed, looks like mw eqiad only, I'm looking at https://logstash-next.wikimedia.org/goto/422135a886dc3bc51f5cebda2e395239 [14:24:19] https://logstash.wikimedia.org/goto/1881560cae71490b01001e989d6cd934 [14:24:31] <_joe_> cdanis: uh that's bad [14:25:09] <_joe_> can someone check the dns records for jobrunners please? [14:25:37] <_joe_> yes we're still getting traffic in eqiad [14:25:51] jobrunner.discovery.wmnet is 10.2.1.26, jobrunner.svc.codfw.wmnet [14:25:54] <_joe_> hnowlan: around? [14:25:56] maches jobrunner.svc.codfw.wmnet. [14:25:57] _joe_: what do you mean? [14:26:09] I'm here too, looking at DNS [14:26:13] <_joe_> so the hosts are correct [14:26:15] <_joe_> in dns [14:26:24] <_joe_> it must be cpjobqueue doing something wrong [14:26:35] https://config-master.wikimedia.org/pybal/codfw/jobrunner looks correct also [14:26:37] templates/wmnet:mwmaint 300 IN CNAME mwmaint1002.eqiad.wmnet. [14:26:40] _joe_: yo [14:27:02] right, we have some that are not confd/discovery-controlled [14:27:03] apergos: that's expected, should be unrelated to the jobrunners if I understand [14:27:07] dumb q, does cpjobqueue run on scb still or on k8s nowadays [14:27:07] CNAMEs like that [14:27:26] <_joe_> cdanis: k8s [14:27:41] <_joe_> hnowlan: it seems cpjobqueue is still calling eqiad's jobrunners [14:27:41] cdanis: k8s [14:27:43] <_joe_> at least partly [14:27:49] persistent conns? [14:27:55] marostegui: ok, all affected db masters downtimed. [14:27:57] <_joe_> oh possibly? [14:28:01] <_joe_> bblack: let me test that [14:28:02] kormat: thanks [14:28:08] or long running jobs? [14:28:34] kormat: thank you [14:28:49] _joe_: it should be using jobrunner.discovery.wmnet but there's a chance it's cached bad records [14:29:06] <_joe_> yes it worked [14:29:12] <_joe_> bblack: you were right [14:29:16] persistent conns eh? [14:29:22] <_joe_> uh no nevermind [14:29:24] <_joe_> it didn't [14:29:37] <_joe_> hnowlan: can we please do a redeploy of cpjobqueue? [14:29:48] yep, on it [14:29:49] <_joe_> can someone verify if the codfw jobrunners are receiving traffic? [14:30:15] <_joe_> hnowlan: wait, I see no more requests now in fact [14:30:16] other possibly-notable CNAMEs that still point at eqiad manually (but I don't *think* any of them are the/a problem here): [14:30:19] eventlogging 1H IN CNAME eventlog1002.eqiad.wmnet. [14:30:21] udplog 1H IN CNAME mwlog1001.eqiad.wmnet. [14:30:24] deployment 5M IN CNAME deploy1001.eqiad.wmnet. [14:30:26] maintenance 5M IN CNAME mwmaint1002.eqiad.wmnet. [14:30:40] <_joe_> yes irrelevant [14:30:40] eventlog1002 is fine, it is eqiad only anyway (consumes from kafka-jumbo) [14:31:01] <_joe_> rzl: can you log onto a codfw jobrunner and look at the apache log? [14:31:07] yeah fatals are dropping afaics [14:31:11] <_joe_> yes [14:31:13] <_joe_> ok [14:31:15] _joe_: ack, looks from the CPU/network graphs as if they're serving [14:31:18] checking apache now [14:31:26] <_joe_> yeah so what happened is [14:31:32] check mw2243 and its getti9ng traffic [14:31:34] <_joe_> changeprop creates persistent connections [14:31:39] <_joe_> that we need to kill [14:31:42] mw2267 [14:31:44] <_joe_> we need another step in the switchover [14:31:45] down [14:31:50] we can do a redeploy pretty quickly [14:32:01] jobrunners in codfw are showing traffic in grafana [14:32:07] _joe_: should I goahead with the redeploy? [14:32:20] _joe_: confirmed activity on /var/log/apache2/jobqueue-access.log on mw2246 [14:32:25] <_joe_> hnowlan: nope hold your horses [14:32:28] ack [14:32:31] <_joe_> things should be ok [14:33:21] <_joe_> rzl: we need to add a "restart envoy on $dc_from jobrunners" step [14:33:29] nod [14:33:36] it's nice that we have a month :) [14:33:51] oh, envoy doesn't know to invalidate old conns when they aren't resolved in the hostname anymore? [14:34:03] or at least doesn't it have a timeout? [14:34:08] i hear petr contributes to envoy these days ;) [14:34:09] oh, no, I bet envoy isn't used by all clients [14:34:10] it probably never even checks [14:34:13] <_joe_> cdanis: I was wrong, changeprop doesn't use envoy [14:34:16] I think that's configurable but it may not be default [14:34:23] joe is saying to restart it *on the jobrunners* to force the non-envoy-client conns to be reset [14:34:25] it would be an unusual step anyways, I think, to re-check DNS on an established conn [14:34:26] <_joe_> but there is envoy on the jobrunners [14:34:31] <_joe_> cdanis: correct [14:34:42] bblack: well, it's constantly re-resolving those hostnames anyway, so it does know -- but I misunderstood [14:35:04] <_joe_> envoy DTRT when making http requests [14:35:07] <_joe_> it rechecks [14:35:08] yeah, envoy is built for load-balanced RPCs, so it's constantly redoing discovery [14:35:13] hmmm ok [14:35:14] <_joe_> and so does cpjobqueue [14:35:32] <_joe_> only cpjobqueue checks when it has to start a new connection [14:35:41] <_joe_> and envoy is very good at keeping connections persistent [14:35:50] <_joe_> which btw is also a problem with load-balancing [14:39:07] okay, anything else still ongoing? restbase 5xxs are still up around 10ish/sec, but we're comfortable with it, is that right? [14:39:14] I see some icinga warnings [14:39:16] Stale template error files present for '/var/lib/gdnsd/discovery-jobrunner.state' [14:39:16] rzl: it is a warmup artifact [14:39:28] still a lot cirrussearch errors inrestbase [14:39:29] rzl: talked with dcausse about it and added a note in the doc [14:39:35] jayme: ^ [14:39:37] cdanis: ack, thanks [14:39:45] I assume the deployment. CNAME will be switched at some point, right? no rush since no deploys today, but just checking. [14:39:59] Krinkle: Tyler and I have chatted about that, we might or might not [14:40:04] (and if not automatically, also the motd message and scap lock) [14:40:20] <_joe_> about what? [14:40:22] but either way let's finish tidying up the MW switch proper first [14:40:24] rzl: jayme: this capacity cache https://w.wiki/a$6 is supposed to be replicated cross-DC, and apparently that isn't working [14:40:46] _joe_: about whether to switch the deployment server later this week [14:41:09] cdanis: that's the warmup issue you were talking about, right? [14:41:15] yes [14:41:29] 👍 [14:41:34] <_joe_> rzl: sorry I have horrible lag [14:41:50] exec_error (cpjobqueue, changeprop) in restbase are down es well [14:41:59] _joe_: jetlag from the trip to Dallas, no doubt [14:42:20] <_joe_> jayme: as expected [14:42:41] any other unresolved issues? [14:43:00] <_joe_> no I think the elastic one is the only one [14:43:01] rzl: some of our documentation is still out of date [14:43:18] <_joe_> I think we can page dcausse and ryan and gehel [14:43:21] cdanis: mmm, I don't think you're right, it doesn't say so in the status doc [14:43:31] _joe_: they are aware [14:43:44] <_joe_> anything else left? [14:43:52] sounds like that's all :) [14:43:55] off to the pub? [14:43:56] nice work everybody [14:44:00] woo hoo! [14:44:02] ema: intereting, the cookbook does remove the stale confd files [14:44:10] yes, nicely done, really not bad after 2 years! :) [14:44:10] I'll fill in the timeline in the status doc later today [14:44:12] and well-run [14:44:13] <_joe_> we cut the ro time down to the bare minimum [14:44:16] network wise everything is as usual [14:44:17] please do check over everything else in there, and add any AIs [14:44:24] good job everybody! [14:44:29] rzl: nice job for the train conductor as well :) [14:44:32] awesome well done all [14:44:33] <_joe_> I don't think we can go below 2:30 of read-only without risking [14:44:48] we could automate the save edit step [14:44:59] but it gets a bit messy ;) [14:45:18] 👏 [14:45:35] \o/ [14:45:38] good job :) [14:46:04] anyone know if it's possible for `mediawiki::state('primary_dc')` to be unstable? [14:46:09] icinga warnings about power supplies infeed load in codfw, but yellow, not red [14:46:44] kormat: theoretically? or ever having been observed? [14:46:51] mutante: what are the loads? [14:47:10] akosiaris: we have a puppet role that depends on the value, and it keeps coming up with different results [14:47:13] eheh [14:47:28] codfw PDUs went from ~79 kW to ~103kW? [14:47:36] mark: from 1285 to 1573 [14:47:51] that reads /etc/conftool-state/mediawiki.yaml [14:48:25] they are not in sync [14:48:27] depends on teh master [14:48:29] damn [14:48:39] puppetmaster1002.eqiad.wmnet: [14:48:40] read_only: codfw: true eqiad: false [14:48:45] (5) puppetmaster[2001-2003].codfw.wmnet,puppetmaster[1001,1003].eqiad.wmnet [14:48:48] iverted [14:48:50] is that another confd problem volans ? [14:48:56] I'll restart confd on puppetmaster1002.eqiad.wmnet [14:49:18] ack [14:49:25] Sep 01 14:48:58 puppetmaster1002 confd[13848]: 2020-09-01T14:48:58Z puppetmaster1002 /usr/bin/confd[13848]: ERROR client: etcd cluster is unavailable or misconf [14:49:28] Sep 01 14:48:58 puppetmaster1002 confd[13848]: ; error #1: x509: certificate has expired or is not yet valid [14:49:31] ahahah [14:49:39] :( [14:49:41] bbboooooo [14:49:46] but restarted fine [14:50:05] can we disable that server as a point of contact temporarilly? [14:50:06] kormat, jynus, marostegui: now it should be stable [14:50:09] ah [14:50:11] you made it work [14:50:29] volans: another case of better call volans! thank you [14:50:32] ok. let me re-run puppet on all the nodes. [14:50:33] so what got lagged is etcd there? [14:50:43] ^that is for volans [14:50:55] confd did not react to the etcd key change I guess? [14:50:57] jynus: confd stuck [14:51:01] :-( [14:51:14] ok, it only created spam [14:51:19] stuck? as in entirely stuck ? Why did we not catch this earlier? [14:51:20] but I think not user imapct [14:51:28] that is the positive thing [14:51:59] akosiaris: let's create a ticket to track the whole issue [14:52:49] akosiaris: was complaining about the cert [14:53:03] but [14:53:03] Active: active (running) since Wed 2019-10-16 11:20:02 UTC; 10 months 16 days ago [14:53:40] so, the puppet CA rollover [14:53:43] I think so [14:53:53] and wondering how many others we might have :D [14:53:53] see ya in 5 years then :P [14:53:56] lol [14:54:08] I will create a ticket to track the alert spam reaching to etcd [14:54:15] noted it on the doc [14:54:16] we stumbled across a few indeed [14:54:25] it's a gift that keeps on giving [14:54:43] ok so about mwmaint1002 [14:54:48] what is the deal with the captcha jobs from 2019 [14:55:28] someone still trying to solve them? [14:56:29] lvs2007 - one CPU core of an LVS is saturated- packet drops are likely (CPU=21) (warning only) [14:56:49] mutante: if that is just a blip, it means: puppet was runnig [14:57:43] cdanis: rescheduling that check to see if it was.. and it was. resolved [14:58:09] the warning threshold is a bit too low [15:01:13] someone is clearly searching for all those fire hydrants since 2019 [15:01:40] I'm rebooting mwmaint1002 on Thu (off tomorrow) [15:02:11] we need a better setup for that one in general, like two servers in each DC we can flip over or similar [15:02:25] +1, at the very least that :) [15:02:31] restbase is happy again [15:03:09] sorry I missed current status, are jobs/crons enabled now on codfw? [15:03:16] jynus: yes [15:03:25] thanks [15:03:34] ema: the hit rate on that cache is up from <1% to about 30%, so makes sense, that's a lot less traffic [15:04:10] in terms of elasticsearch-reported cache miss rps, we're only at 2x normal [15:04:29] _joe_: I just run a check and the 4 failing confd are/were: mwmaint2001.codfw.wmnet,mwmaint1002.eqiad.wmnet,puppetmaster2002.codfw.wmnet,puppetmaster1002.eqiad.wmnet [15:04:36] did you restart it in both mwmaint? [15:05:04] I'll check the other puppetmasters [15:05:04] <_joe_> no [15:05:13] <_joe_> but the data on mwmaint2001 seemed ok [15:05:20] my check is just a grep on syslog, so might be a red herring [15:05:22] <_joe_> volans: just restart confd everywhere [15:05:29] ok [15:05:31] <_joe_> on those servers [15:05:45] icinga question: how do you undowntime a host from this page: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=db1093 ? [15:06:03] there is a lit of downtimes [15:06:09] kormat: menu on the list [15:06:10] downtimes [15:06:11] you can select it and remove it [15:06:17] let me try to find the url [15:06:19] jynus: there are still about 140 icinga checks in state "pending" like when new checks are added that have never executed yet. they are all db/es/pc [15:06:27] mutante: when was the lvs2007 cpu issue [15:06:32] volans: what menu? [15:06:58] mutante: that is weird [15:07:07] those are not new checks, or shouldn't be [15:07:15] kormat: you want to instead go to this page https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=db1093# [15:07:36] kormat: and then, tickybox the host in he first group of tickboxes, and in the first menu, "Remove Downtime(s) For Checked Host(s) and All Services" [15:07:40] _joe_: no need, they were restarted 5h 14min ago [15:07:41] cdanis: that's the sort of dumb thing i was hoping to not be disappointed with [15:07:59] bblack: icinga logs show it was in SOFT/warn state since hours before the switch. it happened on each puppet run apparently [15:08:01] XDD [15:08:11] kormat: the menu in icinga homepage https://icinga.wikimedia.org/icinga/ :D [15:08:11] ok [15:08:22] mutante: bblack: yeah, the WARN threshold is a bit too low on that alert, puppet runs generally cause it [15:08:24] volans: ok, so _not_ on the page i linked :P [15:08:31] what is this? https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=1 [15:08:37] mutante: bblack: it's also by nature pretty sensitive, as it is looking at the util % of every individual CPU [15:08:41] sorry, wrong channel [15:08:42] well we should finish the -cpu0 work for ipvs stuff [15:08:46] that would help, I bet [15:09:39] jynus: yea, that's what I mean. it's like they just got added about 22m ago. some eqiad, some codfw [15:10:11] https://phabricator.wikimedia.org/T236208 is the -cpu0 thing. rzl patched up one part of it a while back, but there's more to be done. [15:10:15] that makes no sense, unless it is a fall down of the confctl issue [15:10:19] cdanis: ack, confirmed it was talking about one specific CPU [15:10:25] mutante: looking into it [15:14:04] number is going down. they eventually just run. just that it went through that cycle once before [15:19:28] something is wrong with icinga [15:19:37] the url that jynus linked above was empty a minute ago [15:19:47] now it's full again of checks that icinga claims it hasn't done before [15:19:53] I can see it [15:20:00] are them the same? [15:20:02] or new ones? [15:20:29] i think different hosts [15:20:32] i can confirm it did this at least once before.. new checks in pending.. they become OK.. back in pending [15:20:44] both times they were all db, pc or es [15:20:47] mutante: before as in not on this maintenance? [15:20:49] but mixed dc [15:21:03] no, i mean earlier today, before i pinged you [15:21:04] let's check at icinga on puppetboard [15:21:32] yeah it's changing at each run, let me dig a bit [15:21:42] * volans running puppet on icinga [15:22:35] probably the hastag page changes [15:22:37] is my bet [15:22:47] removing or adding it due to primary dc changes [15:22:54] (we don't page on passive dbs) [15:23:08] sure, that could explain it, but not if happens multiple times [15:23:14] still getting [15:23:14] - contact_groups admins,sms,admins [15:23:14] + contact_groups admins [15:23:16] changes [15:23:21] - service_description mysqld processes #page [15:23:21] + service_description mysqld processes [15:23:34] volans: this sounds like mediawiki::state('primary_dc') again [15:23:36] are we still within 1h? [15:23:36] wow just using the bad word like that [15:23:45] ah sorry, for the hashtag [15:23:46] :) [15:23:47] volans: you paged us! ;P [15:23:53] via m.ark: https://grafana.wikimedia.org/d/000000429/backend-save-timing-breakdown?orgId=1&refresh=5m is still elevated, some of that is just long averages, maybe all? [15:23:55] even I avoided it! [15:24:07] volans: so.. what do you need captain? ;P [15:24:36] uh, p50 has doubled [15:24:41] so if we're within 1h of the last change thatcauses this it's still explained [15:24:48] all the hosts are running puppet and then icinga is catching up [15:25:08] but if it continues we have something weird, I guess that those last oneas are due to teh confd I restarted, were broken before [15:25:12] and need up to 1h to catchup [15:25:20] do we have insights on where is the performance loss? [15:25:23] kormat: we could re-run (in batch of 10) puppet on all the dbs [15:25:26] and the once on icinga [15:25:31] bear in mind that's a 7d graph, so some of the elevation is just when services and mw were split [15:25:32] to ensure we cleanup this earlier [15:25:32] volans: ah ok. looking at 2 of the hosts, they're not flapping. they've changed once. [15:25:42] volans: i'll do it. [15:25:46] yeah I think it's just a catching up effect [15:26:46] _joe_: can you check my grafana link above? trying to figure out why edit latency is still up [15:26:55] request # is equivalent pre-switchover [15:27:10] for gets, checking for posts [15:27:14] does /usr/bin/confd[23664]: ERROR 100: Key not found (/conftool/v1/services) ring any bell to anyone? [15:27:15] (talking mw app level) [15:27:36] seems unrelated but got them checking the status on all confd hosts [15:27:50] <_joe_> rzl: that's up since yesterday? [15:27:59] _joe_: yes, but then didn't fully recover [15:28:00] <_joe_> volans: yes, ignore [15:28:15] <_joe_> rzl: I mean it seems unrelated to the switchover [15:28:18] ack [15:28:45] <_joe_> oh you mean it went up with the switch of services [15:28:54] <_joe_> we're sure it's related? [15:29:09] % of responses <= 100ms went from 10 to 20 on codfw [15:29:13] (percent) [15:29:18] not sure, just got sent the graph and trying to figure out if it's related [15:29:31] <_joe_> the p75 started raising before we did switch the services [15:29:45]