[11:23:49] moritzm: i am entertained that you typo'd the word 'typo' :) [11:28:57] yeah :-) [12:42:05] rzl: I was checking https://wikitech.wikimedia.org/wiki/Switch_Datacenter, on phase 6, it says es2-es3, but in reality it needs to be es4 and es5, I remember we discussed that in a meeting, just making sure it is just a matter of wikitech not being updated? [12:44:03] marostegui: I double checked and the automation was updated [12:44:11] ok [12:44:20] as in, mysql_legacy & read only [12:45:08] I remember +1ing a change from volans, that's why I am asking if it is just wikitech not being up-to-date [12:45:14] as that is likely a copy&paste from the previous DC [12:45:29] probably- it has happened to me, docs lag behind [12:46:03] I don't want to enter too deep into this, but I don't know how to keep names of servers updated on wikitech [12:46:16] when you know they only will last 5 year tops [12:47:58] rzl: this is the change https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/576297, so yes, it is just wikitech, I will go ahead and update it [12:47:59] marostegui: I don't recall updating wikitech [12:48:10] the mysql_legacy module uses the new aliases made by kormat [12:48:19] so at the best of my knowledge they are right [12:48:32] volans: yep, I just updated wiktiech [12:48:34] but if you're not 100% sure I can bring the logs from the live-test we did last week [12:48:35] wikitech [12:48:41] and you can double check them [12:48:56] volans: this is enough I think https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/576297/3/spicerack/mysql.py [12:49:05] I remember rzl showing me the logs [12:49:25] ok [12:49:43] And I specifically checked for those two, and they were correct (I just double checked it again) [12:49:49] marostegui: thanks! [12:52:45] anyone else having issues with gerrit over ssh? [12:53:00] https://phabricator.wikimedia.org/P12427 [12:53:25] let me check same repo [12:53:25] i've tried cloning that repo 3 times. once over http, which worked. and twice over ssh, which failed like ^ both times [12:53:28] kormat: fwiw I'd recommend setting up your .gitconfig to always pull over https [12:53:42] yeah, I actually do that, it is way faster [12:54:00] https://phabricator.wikimedia.org/P8871 [12:54:29] it is a big latency improvement, especially the further you get from gerrit [12:55:24] aaand gerrit just started working for me again [12:55:34] i had a git pull on an existing repo hang for minutes which just completed [12:55:35] I was going to say, it went rather fast for taht repo now [12:55:52] kormat: I've seen somtimes the scheduler block [12:56:07] because the queue is doing expensive operations for many people [12:56:07] Good luck everyone for the switch [12:56:33] in the past, to debug we entered ps on gerrit [12:56:42] jynus: huh, interesting [12:56:54] ps as in the application scheduler, not the host [12:57:18] a couple of kills on blocked hosts and things started to flow :-) [12:57:26] kormat: good thing we don't depend on gerrit anymore for the switchdc :D [12:57:40] I have not seen any of those in the last months however [12:57:50] volans: so it's only a bad thing the other 364 days a year then ;) [12:59:23] eheh [14:10:09] _joe_: ema: every 500 i've looked at is restbase [14:10:10] <_joe_> ok so [14:10:10] all: let's discuss here instead [14:10:15] <_joe_> things seem ok to me [14:10:25] is this the thing where nodejs doesn't know how to re-resolve its backends? [14:10:43] <_joe_> cdanis: I doubt it, there is envoy there now [14:11:01] 500s recovering [14:11:13] <_joe_> also I suspect that's a rate over 5 minutes [14:11:36] "description": "Search is currently too busy. Please try again later.", [14:11:37] <_joe_> the errors were mostly with api [14:11:44] <_joe_> yes that too [14:11:55] so I guess we didn't do any cirrussearch warmups in codfw? [14:11:58] but i thought that was a/a [14:12:03] s1 enwiki still behaving badly [14:12:18] marostegui: what does that mean? [14:12:31] mark: response time is still high there on the dbs [14:12:31] all 500s seem to be restbase indeed, and the're about the "related" functionality [14:12:35] eg: https://de.wikipedia.org/api/rest_v1/page/related/Krampfader [14:12:37] but they are not lagging, so that's good [14:12:46] I guess they need to get their buffer fully full [14:12:47] avg. response time on the "appservers red dashboard" is much lower than it was before the switch at around 133ms. that seems good [14:12:51] <_joe_> ema: but they'r egoing away [14:12:56] <_joe_> mutante: that was expected [14:12:57] marostegui: it looks like s8 to me on mysql aggregated [14:13:06] mutante: it got a little worse yesterday when we moved services to codfw, now they're colocated again [14:13:23] <_joe_> ok so [14:13:24] ack. and confirmed edit on mobile apps [14:13:32] cdanis: yep that too, but same as enwiki, not lagging so that's "good" [14:13:52] I am not going to touch weights yet [14:13:52] <_joe_> those restbase urls now all work [14:13:57] <_joe_> I say go on and finish the job [14:14:02] <_joe_> we need maintenance to restart [14:14:19] I'll start phase 8 then, objections? [14:14:23] rzl: fwiw read-only period was 2:49 [14:14:24] +1 [14:14:29] marostegui: do you have https://grafana.wikimedia.org/d/XyoE_N_Wz/wikidata-database-cpu-saturation?orgId=1 but for codfw? :D [14:14:32] mcrouter metrics look good [14:14:36] going [14:14:39] cdanis: no :( [14:15:05] <_joe_> volans: how much better than last time? [14:15:09] going to reduce a bit db2083's weight [14:15:29] ttls restored, starting maintenance jobs [14:16:13] _joe_: can't recall, but we don't have puppet now [14:16:29] <_joe_> we do, for the wikidata dispatcher [14:16:42] s8 is looking good now [14:16:49] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=11&var-site=codfw&var-group=All&var-shard=All&var-role=All [14:17:03] all of the problems we had, tbh, seem like warmup artifacts [14:17:05] restbase errors still there [14:17:44] "During the most critical part of the switch [14:17:44] today, the wikis were in read-only mode for a duration of 4 minutes [14:17:44] and 41 seconds." [14:17:48] that's from 2018 [14:18:12] db land is looking good [14:18:13] forcing a refresh of the alerting db hosts [14:18:13] ema: it's still mostly /related/ urls [14:18:17] indeed [14:18:17] jynus: parsercache ok? [14:18:20] maintenance on mwmaint2001 looks good [14:18:21] mwmaint2001 has wikidata dispatch crons but no other maint crons [14:18:22] kormat: thanks [14:18:31] <_joe_> ema: and all of those url work when I try again [14:18:34] mutante: correct, dispatch job is the only maintenance cron left [14:18:35] marostegui: yes [14:18:36] what team supports cirrussearch / related page feature? [14:18:38] discovery? [14:18:43] everything else is in systemd now [14:18:44] jynus: excellent thanks [14:18:45] <_joe_> yes [14:18:50] <_joe_> cdanis: yes [14:18:56] I also saw some mobileapps failures in restbase [14:19:00] marostegui: 08-update-tendril done, tendril lgty? [14:19:06] rzl: checking [14:19:20] I see db2* roots [14:19:27] akosiaris: i think the switchback was faster, no? [14:19:47] mark: that was the absolute best up to then and it was the switchback [14:19:50] rzl: parsercache (pc1,p2 and pc3) should have been changed too, but they were not, that's not important [14:19:52] the switchover was [14:19:59] 7~8 IIRC [14:20:00] * cdanis asking in #-discovery [14:20:05] "That's a significant improvement over the 7 mins and [14:20:05] 34 seconds we achieved during the inverse process we concluded a month [14:20:05] ago, which was already significantly better than last year" [14:20:08] marostegui: ack, noted for later [14:20:16] rzl: ok [14:20:28] rzl: the rest looks good [14:20:36] thanks [14:20:39] marostegui: isn't pc1 section missing from tendril? [14:20:45] _joe_: confirmed, yes. Re-trying the failing URLs seems to work all the time [14:20:52] jynus: I can see it there [14:21:00] ok, artifact that tendril sometimes does [14:21:05] on my screen [14:21:06] yep [14:21:09] seeing mw fatals still elevated at https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?orgId=1&var-datasource=eqiad%20prometheus%2Fops&from=now-1h&to=now [14:21:26] should be recovered by now I think? [14:21:34] databases are stable now, everything seems to be fine [14:21:45] mobileapps error rate just checked, it looks normal [14:22:03] wikifeeds is also pretty ok. It did note the switchover but it's fine now [14:22:19] from logstash, some "you can't edit right now" errors but they may be backlogged [14:22:29] marostegui: something is wrong with the db r-o monitoring [14:22:50] need a hand? [14:22:52] what the heck is 'servergroup other'? [14:22:54] kormat: let's downtime those masters for 1d and we can check without pages [14:22:57] marostegui: the eqiad masters _should be_ read-only, correct? [14:23:00] ok [14:23:02] krocorrect [14:23:04] yes [14:23:06] kormat: correct [14:23:10] ok. i'll dt them. [14:23:22] rzl: the most recent of the "you can't edit right now" errors are from eqiad jobrunners [14:23:28] long-running jobs still lingering there? [14:23:34] oh good eye [14:23:43] godog: ^ that's your fatals probably [14:23:57] I assume jobrunners are servergroup 'other' [14:24:00] didn't we confirm that all maintenance jobs were turned off? [14:24:14] rzl cdanis agreed, looks like mw eqiad only, I'm looking at https://logstash-next.wikimedia.org/goto/422135a886dc3bc51f5cebda2e395239 [14:24:19] https://logstash.wikimedia.org/goto/1881560cae71490b01001e989d6cd934 [14:24:31] <_joe_> cdanis: uh that's bad [14:25:09] <_joe_> can someone check the dns records for jobrunners please? [14:25:37] <_joe_> yes we're still getting traffic in eqiad [14:25:51] jobrunner.discovery.wmnet is 10.2.1.26, jobrunner.svc.codfw.wmnet [14:25:54] <_joe_> hnowlan: around? [14:25:56] maches jobrunner.svc.codfw.wmnet. [14:25:57] _joe_: what do you mean? [14:26:09] I'm here too, looking at DNS [14:26:13] <_joe_> so the hosts are correct [14:26:15] <_joe_> in dns [14:26:24] <_joe_> it must be cpjobqueue doing something wrong [14:26:35] https://config-master.wikimedia.org/pybal/codfw/jobrunner looks correct also [14:26:37] templates/wmnet:mwmaint 300 IN CNAME mwmaint1002.eqiad.wmnet. [14:26:40] _joe_: yo [14:27:02] right, we have some that are not confd/discovery-controlled [14:27:03] apergos: that's expected, should be unrelated to the jobrunners if I understand [14:27:07] dumb q, does cpjobqueue run on scb still or on k8s nowadays [14:27:07] CNAMEs like that [14:27:26] <_joe_> cdanis: k8s [14:27:41] <_joe_> hnowlan: it seems cpjobqueue is still calling eqiad's jobrunners [14:27:41] cdanis: k8s [14:27:43] <_joe_> at least partly [14:27:49] persistent conns? [14:27:55] marostegui: ok, all affected db masters downtimed. [14:27:57] <_joe_> oh possibly? [14:28:01] <_joe_> bblack: let me test that [14:28:02] kormat: thanks [14:28:08] or long running jobs? [14:28:34] kormat: thank you [14:28:49] _joe_: it should be using jobrunner.discovery.wmnet but there's a chance it's cached bad records [14:29:06] <_joe_> yes it worked [14:29:12] <_joe_> bblack: you were right [14:29:16] persistent conns eh? [14:29:22] <_joe_> uh no nevermind [14:29:24] <_joe_> it didn't [14:29:37] <_joe_> hnowlan: can we please do a redeploy of cpjobqueue? [14:29:48] yep, on it [14:29:49] <_joe_> can someone verify if the codfw jobrunners are receiving traffic? [14:30:15] <_joe_> hnowlan: wait, I see no more requests now in fact [14:30:16] other possibly-notable CNAMEs that still point at eqiad manually (but I don't *think* any of them are the/a problem here): [14:30:19] eventlogging 1H IN CNAME eventlog1002.eqiad.wmnet. [14:30:21] udplog 1H IN CNAME mwlog1001.eqiad.wmnet. [14:30:24] deployment 5M IN CNAME deploy1001.eqiad.wmnet. [14:30:26] maintenance 5M IN CNAME mwmaint1002.eqiad.wmnet. [14:30:40] <_joe_> yes irrelevant [14:30:40] eventlog1002 is fine, it is eqiad only anyway (consumes from kafka-jumbo) [14:31:01] <_joe_> rzl: can you log onto a codfw jobrunner and look at the apache log? [14:31:07] yeah fatals are dropping afaics [14:31:11] <_joe_> yes [14:31:13] <_joe_> ok [14:31:15] _joe_: ack, looks from the CPU/network graphs as if they're serving [14:31:18] checking apache now [14:31:26] <_joe_> yeah so what happened is [14:31:32] check mw2243 and its getti9ng traffic [14:31:34] <_joe_> changeprop creates persistent connections [14:31:39] <_joe_> that we need to kill [14:31:42] mw2267 [14:31:44] <_joe_> we need another step in the switchover [14:31:45] down [14:31:50] we can do a redeploy pretty quickly [14:32:01] jobrunners in codfw are showing traffic in grafana [14:32:07] _joe_: should I goahead with the redeploy? [14:32:20] _joe_: confirmed activity on /var/log/apache2/jobqueue-access.log on mw2246 [14:32:25] <_joe_> hnowlan: nope hold your horses [14:32:28] ack [14:32:31] <_joe_> things should be ok [14:33:21] <_joe_> rzl: we need to add a "restart envoy on $dc_from jobrunners" step [14:33:29] nod [14:33:36] it's nice that we have a month :) [14:33:51] oh, envoy doesn't know to invalidate old conns when they aren't resolved in the hostname anymore? [14:34:03] or at least doesn't it have a timeout? [14:34:08] i hear petr contributes to envoy these days ;) [14:34:09] oh, no, I bet envoy isn't used by all clients [14:34:10] it probably never even checks [14:34:13] <_joe_> cdanis: I was wrong, changeprop doesn't use envoy [14:34:16] I think that's configurable but it may not be default [14:34:23] joe is saying to restart it *on the jobrunners* to force the non-envoy-client conns to be reset [14:34:25] it would be an unusual step anyways, I think, to re-check DNS on an established conn [14:34:26] <_joe_> but there is envoy on the jobrunners [14:34:31] <_joe_> cdanis: correct [14:34:42] bblack: well, it's constantly re-resolving those hostnames anyway, so it does know -- but I misunderstood [14:35:04] <_joe_> envoy DTRT when making http requests [14:35:07] <_joe_> it rechecks [14:35:08] yeah, envoy is built for load-balanced RPCs, so it's constantly redoing discovery [14:35:13] hmmm ok [14:35:14] <_joe_> and so does cpjobqueue [14:35:32] <_joe_> only cpjobqueue checks when it has to start a new connection [14:35:41] <_joe_> and envoy is very good at keeping connections persistent [14:35:50] <_joe_> which btw is also a problem with load-balancing [14:39:07] okay, anything else still ongoing? restbase 5xxs are still up around 10ish/sec, but we're comfortable with it, is that right? [14:39:14] I see some icinga warnings [14:39:16] Stale template error files present for '/var/lib/gdnsd/discovery-jobrunner.state' [14:39:16] rzl: it is a warmup artifact [14:39:28] still a lot cirrussearch errors inrestbase [14:39:29] rzl: talked with dcausse about it and added a note in the doc [14:39:35] jayme: ^ [14:39:37] cdanis: ack, thanks [14:39:45] I assume the deployment. CNAME will be switched at some point, right? no rush since no deploys today, but just checking. [14:39:59] Krinkle: Tyler and I have chatted about that, we might or might not [14:40:04] (and if not automatically, also the motd message and scap lock) [14:40:20] <_joe_> about what? [14:40:22] but either way let's finish tidying up the MW switch proper first [14:40:24] rzl: jayme: this capacity cache https://w.wiki/a$6 is supposed to be replicated cross-DC, and apparently that isn't working [14:40:46] _joe_: about whether to switch the deployment server later this week [14:41:09] cdanis: that's the warmup issue you were talking about, right? [14:41:15] yes [14:41:29] 👍 [14:41:34] <_joe_> rzl: sorry I have horrible lag [14:41:50] exec_error (cpjobqueue, changeprop) in restbase are down es well [14:41:59] _joe_: jetlag from the trip to Dallas, no doubt [14:42:20] <_joe_> jayme: as expected [14:42:41] any other unresolved issues? [14:43:00] <_joe_> no I think the elastic one is the only one [14:43:01] rzl: some of our documentation is still out of date [14:43:18] <_joe_> I think we can page dcausse and ryan and gehel [14:43:21] cdanis: mmm, I don't think you're right, it doesn't say so in the status doc [14:43:31] _joe_: they are aware [14:43:44] <_joe_> anything else left? [14:43:52] sounds like that's all :) [14:43:55] off to the pub? [14:43:56] nice work everybody [14:44:00] woo hoo! [14:44:02] ema: intereting, the cookbook does remove the stale confd files [14:44:10] yes, nicely done, really not bad after 2 years! :) [14:44:10] I'll fill in the timeline in the status doc later today [14:44:12] and well-run [14:44:13] <_joe_> we cut the ro time down to the bare minimum [14:44:16] network wise everything is as usual [14:44:17] please do check over everything else in there, and add any AIs [14:44:24] good job everybody! [14:44:29] rzl: nice job for the train conductor as well :) [14:44:32] awesome well done all [14:44:33] <_joe_> I don't think we can go below 2:30 of read-only without risking [14:44:48] we could automate the save edit step [14:44:59] but it gets a bit messy ;) [14:45:18] 👏 [14:45:35] \o/ [14:45:38] good job :) [14:46:04] anyone know if it's possible for `mediawiki::state('primary_dc')` to be unstable? [14:46:09] icinga warnings about power supplies infeed load in codfw, but yellow, not red [14:46:44] kormat: theoretically? or ever having been observed? [14:46:51] mutante: what are the loads? [14:47:10] akosiaris: we have a puppet role that depends on the value, and it keeps coming up with different results [14:47:13] eheh [14:47:28] codfw PDUs went from ~79 kW to ~103kW? [14:47:36] mark: from 1285 to 1573 [14:47:51] that reads /etc/conftool-state/mediawiki.yaml [14:48:25] they are not in sync [14:48:27] depends on teh master [14:48:29] damn [14:48:39] puppetmaster1002.eqiad.wmnet: [14:48:40] read_only: codfw: true eqiad: false [14:48:45] (5) puppetmaster[2001-2003].codfw.wmnet,puppetmaster[1001,1003].eqiad.wmnet [14:48:48] iverted [14:48:50] is that another confd problem volans ? [14:48:56] I'll restart confd on puppetmaster1002.eqiad.wmnet [14:49:18] ack [14:49:25] Sep 01 14:48:58 puppetmaster1002 confd[13848]: 2020-09-01T14:48:58Z puppetmaster1002 /usr/bin/confd[13848]: ERROR client: etcd cluster is unavailable or misconf [14:49:28] Sep 01 14:48:58 puppetmaster1002 confd[13848]: ; error #1: x509: certificate has expired or is not yet valid [14:49:31] ahahah [14:49:39] :( [14:49:41] bbboooooo [14:49:46] but restarted fine [14:50:05] can we disable that server as a point of contact temporarilly? [14:50:06] kormat, jynus, marostegui: now it should be stable [14:50:09] ah [14:50:11] you made it work [14:50:29] volans: another case of better call volans! thank you [14:50:32] ok. let me re-run puppet on all the nodes. [14:50:33] so what got lagged is etcd there? [14:50:43] ^that is for volans [14:50:55] confd did not react to the etcd key change I guess? [14:50:57] jynus: confd stuck [14:51:01] :-( [14:51:14] ok, it only created spam [14:51:19] stuck? as in entirely stuck ? Why did we not catch this earlier? [14:51:20] but I think not user imapct [14:51:28] that is the positive thing [14:51:59] akosiaris: let's create a ticket to track the whole issue [14:52:49] akosiaris: was complaining about the cert [14:53:03] but [14:53:03] Active: active (running) since Wed 2019-10-16 11:20:02 UTC; 10 months 16 days ago [14:53:40] so, the puppet CA rollover [14:53:43] I think so [14:53:53] and wondering how many others we might have :D [14:53:53] see ya in 5 years then :P [14:53:56] lol [14:54:08] I will create a ticket to track the alert spam reaching to etcd [14:54:15] noted it on the doc [14:54:16] we stumbled across a few indeed [14:54:25] it's a gift that keeps on giving [14:54:43] ok so about mwmaint1002 [14:54:48] what is the deal with the captcha jobs from 2019 [14:55:28] someone still trying to solve them? [14:56:29] lvs2007 - one CPU core of an LVS is saturated- packet drops are likely (CPU=21) (warning only) [14:56:49] mutante: if that is just a blip, it means: puppet was runnig [14:57:43] cdanis: rescheduling that check to see if it was.. and it was. resolved [14:58:09] the warning threshold is a bit too low [15:01:13] someone is clearly searching for all those fire hydrants since 2019 [15:01:40] I'm rebooting mwmaint1002 on Thu (off tomorrow) [15:02:11] we need a better setup for that one in general, like two servers in each DC we can flip over or similar [15:02:25] +1, at the very least that :) [15:02:31] restbase is happy again [15:03:09] sorry I missed current status, are jobs/crons enabled now on codfw? [15:03:16] jynus: yes [15:03:25] thanks [15:03:34] ema: the hit rate on that cache is up from <1% to about 30%, so makes sense, that's a lot less traffic [15:04:10] in terms of elasticsearch-reported cache miss rps, we're only at 2x normal [15:04:29] _joe_: I just run a check and the 4 failing confd are/were: mwmaint2001.codfw.wmnet,mwmaint1002.eqiad.wmnet,puppetmaster2002.codfw.wmnet,puppetmaster1002.eqiad.wmnet [15:04:36] did you restart it in both mwmaint? [15:05:04] I'll check the other puppetmasters [15:05:04] <_joe_> no [15:05:13] <_joe_> but the data on mwmaint2001 seemed ok [15:05:20] my check is just a grep on syslog, so might be a red herring [15:05:22] <_joe_> volans: just restart confd everywhere [15:05:29] ok [15:05:31] <_joe_> on those servers [15:05:45] icinga question: how do you undowntime a host from this page: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=db1093 ? [15:06:03] there is a lit of downtimes [15:06:09] kormat: menu on the list [15:06:10] downtimes [15:06:11] you can select it and remove it [15:06:17] let me try to find the url [15:06:19] jynus: there are still about 140 icinga checks in state "pending" like when new checks are added that have never executed yet. they are all db/es/pc [15:06:27] mutante: when was the lvs2007 cpu issue [15:06:32] volans: what menu? [15:06:58] mutante: that is weird [15:07:07] those are not new checks, or shouldn't be [15:07:15] kormat: you want to instead go to this page https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=db1093# [15:07:36] kormat: and then, tickybox the host in he first group of tickboxes, and in the first menu, "Remove Downtime(s) For Checked Host(s) and All Services" [15:07:40] _joe_: no need, they were restarted 5h 14min ago [15:07:41] cdanis: that's the sort of dumb thing i was hoping to not be disappointed with [15:07:59] bblack: icinga logs show it was in SOFT/warn state since hours before the switch. it happened on each puppet run apparently [15:08:01] XDD [15:08:11] kormat: the menu in icinga homepage https://icinga.wikimedia.org/icinga/ :D [15:08:11] ok [15:08:22] mutante: bblack: yeah, the WARN threshold is a bit too low on that alert, puppet runs generally cause it [15:08:24] volans: ok, so _not_ on the page i linked :P [15:08:31] what is this? https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=1 [15:08:37] mutante: bblack: it's also by nature pretty sensitive, as it is looking at the util % of every individual CPU [15:08:41] sorry, wrong channel [15:08:42] well we should finish the -cpu0 work for ipvs stuff [15:08:46] that would help, I bet [15:09:39] jynus: yea, that's what I mean. it's like they just got added about 22m ago. some eqiad, some codfw [15:10:11] https://phabricator.wikimedia.org/T236208 is the -cpu0 thing. rzl patched up one part of it a while back, but there's more to be done. [15:10:15] that makes no sense, unless it is a fall down of the confctl issue [15:10:19] cdanis: ack, confirmed it was talking about one specific CPU [15:10:25] mutante: looking into it [15:14:04] number is going down. they eventually just run. just that it went through that cycle once before [15:19:28] something is wrong with icinga [15:19:37] the url that jynus linked above was empty a minute ago [15:19:47] now it's full again of checks that icinga claims it hasn't done before [15:19:53] I can see it [15:20:00] are them the same? [15:20:02] or new ones? [15:20:29] i think different hosts [15:20:32] i can confirm it did this at least once before.. new checks in pending.. they become OK.. back in pending [15:20:44] both times they were all db, pc or es [15:20:47] mutante: before as in not on this maintenance? [15:20:49] but mixed dc [15:21:03] no, i mean earlier today, before i pinged you [15:21:04] let's check at icinga on puppetboard [15:21:32] yeah it's changing at each run, let me dig a bit [15:21:42] * volans running puppet on icinga [15:22:35] probably the hastag page changes [15:22:37] is my bet [15:22:47] removing or adding it due to primary dc changes [15:22:54] (we don't page on passive dbs) [15:23:08] sure, that could explain it, but not if happens multiple times [15:23:14] still getting [15:23:14] - contact_groups admins,sms,admins [15:23:14] + contact_groups admins [15:23:16] changes [15:23:21] - service_description mysqld processes #page [15:23:21] + service_description mysqld processes [15:23:34] volans: this sounds like mediawiki::state('primary_dc') again [15:23:36] are we still within 1h? [15:23:36] wow just using the bad word like that [15:23:45] ah sorry, for the hashtag [15:23:46] :) [15:23:47] volans: you paged us! ;P [15:23:53] via m.ark: https://grafana.wikimedia.org/d/000000429/backend-save-timing-breakdown?orgId=1&refresh=5m is still elevated, some of that is just long averages, maybe all? [15:23:55] even I avoided it! [15:24:07] volans: so.. what do you need captain? ;P [15:24:36] uh, p50 has doubled [15:24:41] so if we're within 1h of the last change thatcauses this it's still explained [15:24:48] all the hosts are running puppet and then icinga is catching up [15:25:08] but if it continues we have something weird, I guess that those last oneas are due to teh confd I restarted, were broken before [15:25:12] and need up to 1h to catchup [15:25:20] do we have insights on where is the performance loss? [15:25:23] kormat: we could re-run (in batch of 10) puppet on all the dbs [15:25:26] and the once on icinga [15:25:31] bear in mind that's a 7d graph, so some of the elevation is just when services and mw were split [15:25:32] to ensure we cleanup this earlier [15:25:32] volans: ah ok. looking at 2 of the hosts, they're not flapping. they've changed once. [15:25:42] volans: i'll do it. [15:25:46] yeah I think it's just a catching up effect [15:26:46] _joe_: can you check my grafana link above? trying to figure out why edit latency is still up [15:26:55] request # is equivalent pre-switchover [15:27:10] for gets, checking for posts [15:27:14] does /usr/bin/confd[23664]: ERROR 100: Key not found (/conftool/v1/services) ring any bell to anyone? [15:27:15] (talking mw app level) [15:27:36] seems unrelated but got them checking the status on all confd hosts [15:27:50] <_joe_> rzl: that's up since yesterday? [15:27:59] _joe_: yes, but then didn't fully recover [15:28:00] <_joe_> volans: yes, ignore [15:28:15] <_joe_> rzl: I mean it seems unrelated to the switchover [15:28:18] ack [15:28:45] <_joe_> oh you mean it went up with the switch of services [15:28:54] <_joe_> we're sure it's related? [15:29:09] % of responses <= 100ms went from 10 to 20 on codfw [15:29:13] (percent) [15:29:18] not sure, just got sent the graph and trying to figure out if it's related [15:29:31] <_joe_> the p75 started raising before we did switch the services [15:29:45] interesting [15:29:54] are you sure? I read it as after [15:30:00] double-checking timestamps, I thought we did that at 14:00 [15:30:36] <_joe_> uhmmm so [15:30:51] <_joe_> 14:25 is when we had the bump in latencies on appservers [15:30:52] 14:21 END (PASS) - Cookbook sre.switchdc.services.01-switch-dc (exit_code=0) [15:31:19] <_joe_> ok so [15:31:34] <_joe_> we had a first surge to ~ 1.5 seconds immediately [15:32:03] <_joe_> then it went up more [15:32:16] <_joe_> and that's unrelated to the switchover [15:32:19] <_joe_> of services [15:32:24] there is small decreas in cpu usage when going to codfw (24% to 20% cluster wide), but it could be just better hw? [15:32:28] <_joe_> so I guess there are two effects together [15:32:44] rzl: are we declaring the party over? [15:32:50] <_joe_> jynus: we're trying to understand a specific problem, can we limit chatter about unrelated stuff please? [15:33:05] volans: we've declared the party over twice now :) but now we're looking at the graph I posted above ^ [15:33:16] coherence! [15:33:17] :D [15:33:23] I was trying to provide info about app server cluster, but ok [15:33:40] <_joe_> that's unrelated to backend saving times [15:33:44] ok [15:33:48] <_joe_> https://grafana.wikimedia.org/d/000000429/backend-save-timing-breakdown?viewPanel=34&orgId=1&from=now-6h&to=now&refresh=5m [15:34:09] <_joe_> rzl: I think here you see the effects of moving mw to the same dc as the services [15:34:34] <_joe_> now we should ask Krinkle (or someone else) how is this backend save time calculated [15:34:40] <_joe_> does it include post-send? [15:34:56] <_joe_> in that case, we might want to check latencies in the various eventgates [15:37:03] <_joe_> jobs seem to work correctly [15:38:10] <_joe_> https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&var-dc=codfw%20prometheus%2Fk8s [15:39:10] <_joe_> rzl: I would say the backend timing thing needs to be figured out but it's not critical at this point. [15:39:16] nod [15:39:31] <_joe_> we need to involve devs to understand what might be wrong there [15:39:46] <_joe_> I don't know enough about how it's measured [15:41:44] I'll open a task, we can go from there [15:42:51] i believe this is coming from gilles, so also sync up with him :) [15:43:21] * volans taking a break if not needed right now [15:43:24] ack thanks [15:43:48] _joe_: no, not including post-send [15:43:51] volans: definitely, nice work :) [15:44:09] went up a lot yesterday incl from user-perceived frontend metrics (e.g. form submit until TTFB) [15:44:19] I didn't do much, all yours :) [15:44:20] ttyl [15:44:55] backend save timing is php start time until pre-sesnd [15:45:06] frontend save timing is form submit until TTFB of response [15:45:28] wait [15:45:36] didn't we had this last time [15:45:52] <_joe_> jynus: uh? [15:46:06] many tools uploading data/edits being on cloud network (eqiad) so they suffer latency? [15:46:09] <_joe_> Krinkle: I'm wondering what can cause this by just switching services to codfw [15:46:15] the time to save edits more or less doubled at the median yesterday, we got an alert but we didn't act thinking it was due to cross-dc services, like session store [15:46:30] which should recover now [15:46:36] something along 1/3 of requests? [15:46:45] https://grafana.wikimedia.org/d/000000085/save-timing?orgId=1 [15:46:45] sorry if I am pointing something unrelated [15:46:49] <_joe_> jynus: not by 3 seconds though [15:46:51] this is a better dash as entry point [15:47:00] _joe_: indeed [15:47:44] <_joe_> I don't think 3 seconds is justified by moving services to codfw [15:47:56] <_joe_> unless there is some service called before post-send that is lagging badly [15:48:37] the median went from 1s to 2s [15:48:37] <_joe_> but this is not an emergency and I'm tired, so I'd leave it to the folks in better timezones [15:48:57] looking at flame graphs now [15:51:42] affects both wikipedia and wikidata, affects all content models (wikitext vs flow vs wikidata), all user types (anon, reg, bot) and entrypoints (index vs api). [15:51:45] per https://grafana.wikimedia.org/d/000000429/backend-save-timing-breakdown?orgId=1&refresh=5m [15:51:59] so not something specific :/ [15:52:19] and rate distribution of these also unchanged [15:52:43] at the lowest percentile, it went 150ms -> 200ms [15:53:05] p50 0.7 -> ~1.5s [15:53:07] opened https://phabricator.wikimedia.org/T261763, assigned to you initially Krinkle just cause you're looking right now, but feel free to move it around [15:53:21] and p75: 0.8s -> 3.8s [15:53:24] (and/or just throw it back at me) [15:56:17] eyeballing it suggests it aligns most closely with yesterday 14:20, [15:56:21] https://sal.toolforge.org/log/PHHkRHQBLkHzneNNOEPw [15:57:08] might be useful in the future for that kind of purpose to not move all services at once. even doing them 1 minute apart would have helped here [15:57:21] afaik sessionstore is the only one in that list that's used in all/most edits pre-send [15:57:31] gilles: for time-correlating? yeah that's an interesting thought [15:57:37] yes [15:57:37] gilles: personally I think it'd be more useful to have distributed RPC tracing by the next time we do this 🙃 [15:57:39] ack yeah, 5min apart would be nice. [15:57:52] it might also be tricky for inter-service dependencies, the hairpinning might be unfortunate [15:57:55] esp given it's done ahead of the switch so presumably that should be fine? [15:58:04] yeah, that is very true rzl [15:58:14] I note it on the doc? [15:58:16] as a suggestion [15:58:20] sure, thanks [15:58:23] did we do a kill/readonly on the services yesterday? [15:58:30] we can discuss before the switchback [15:58:40] nothing was readonly yesterday, no [15:58:59] so then a gradual switch would be no different in terms of guruantees than a fast one, right? [15:59:50] Krinkle: if you have a complicated services-calling-other-services-calling-other-services request, a gradual transition could have much worse latency properties during the switchover duration [16:01:15] cdanis: right, so just increased latency for a subset of services for upto 20-30min or so during the switch [16:03:35] sessionstore GET latency distribution is better on codfw than it was on eqiad before the switch. it's the same before/after for POST. As for DELETE, the buckets are useless as the maximum one (>25ms) is hit 100% of the time. So can't tell if it got worse [16:03:39] we now have pre-2015 save times in production, and have wasted about 1 person-month of time in terms of human productivity (2.1M edits over 24h, ~+1s, ~ 1 month of time) [16:05:21] I'd like to roll a wdqs deploy later today [16:05:31] I imagine there's still some datacenter switchover post-cleanup stuff being done, will it get in the way if I deploy wdqs in an hour or am I clear (production issues permitting)? [16:05:39] gilles: ack yeah, all traffic there looks identical to before, and even slightly better. [16:05:48] ryankemper: you should be all clear :) [16:06:04] great [16:06:39] ryankemper: yeah, == cdanis, go ahead and thanks for checking [16:06:45] by comparison, load.php backend timing is not increased, which eliminates a good portion of suspects. [16:08:20] marostegui: does the remaining traffic here all have reasonable explanations / within expectations? https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1 [16:08:50] 230 op/s for es2 [16:09:12] that is difficult to say, but 230 ops is probably replication writes [16:09:28] so without any research, looks fine [16:09:33] Krinkle: it can be pt-heartbeat and/or replication [16:09:38] oh, es2 [16:09:45] that doesn't have replication [16:09:48] nor heartbeat [16:10:11] it can be dumps [16:10:15] yeah [16:10:24] or monitoring [16:10:45] es2 is dumps, I am seeing it now [16:10:52] I am also running backups, but not on es2 [16:11:02] Krinkle: SELECT /* ExternalStoreDB::fetchBlob dumpsgen@snapsh... */ blob_text FROM `blobs_cluster24 [16:11:14] dumps indeed [16:11:57] ok, we didn't switch those. [16:13:22] looking at masters only, the num of conns is very flat, so that suggests it isn't user traffic indeed [16:13:22] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?var-site=eqiad&var-group=core&var-shard=All&var-role=master [16:16:28] memcache seems to have quite a bit higher commands per second, but slowly declining to eqiad levels. [16:19:55] recompute latencies for various parts of edit process can be seen in wanobjectcache breakdown metrics, kind of an approximate for time spans [16:19:55] https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?orgId=1&var-kClass=revision_row_1_29 [16:20:00] but nthing stands out there so far either [16:25:34] Here we can see an example of cross-dc latency: [16:25:35] https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?viewPanel=13&orgId=1&from=1598826112201&to=1598977509162&var-kClass=WikibaseQualityConstraints [16:25:58] about +30-40ms for wikibase stuff, which has recovered. [16:26:18] this uses sparql queries I believe [16:44:06] rzl: I don't see anything in grafana graphs of various backend services and internal time spans. I also don't see anything stand out in the flame graphs (not notably more samples, and no spans that are much slower than before). [16:44:32] rzl: what I do see when editing with XHGui enabled, is 3-5s being spent post-sent, most of which is in cURL for jobs I guess. [16:45:25] Now, we know our metrics are not affected by that since we also confirm the regression in frontend metrics (which are and can only be based on time to first byte), and I have also confirmed that the backend metric is correctly timestamped before post-send begins. [16:45:43] So my only guess left after everything I eliminated is that somehow post-send is not working and that it is happening pre-send. [16:46:02] could it it be that something in the fpm or apache or wherver is so broken that there is longer post-send support? [16:46:12] I [16:46:25] I'm going afk for a bit will be back in 1-2h [17:14:38] congratulations on the smooth switchover folks :) [17:15:10] Well done [17:15:22] I missed it, was busy [17:15:27] Any incidents? [17:15:35] Most in the wikimedia discord barely noticed RO [17:15:49] So we're now in codfw primary? [17:15:59] Yep [17:16:10] what was the total RO time? [17:17:46] found it, 2 mins 49 secs [17:18:27] From Virginia to Texas in 2 minutes, that's speed :-) [17:21:15] :) [17:22:38] It's about 2100km, so about 12.4km/s [17:23:14] barely escape velocity! [17:38:00] rzl: catching up with backlog, anything urgent came up recently? [17:55:34] volans: I think he's taking a break now, but, nothing I've seen [17:55:56] k [17:56:04] I'm adding a couple of things in teh doc [18:09:30] back now, yeah nothing huge [18:49:28] <_joe_> Krinkle: I'm not sure what you meant, but things I'd look out for are stuff like kafka latencies [18:49:47] <_joe_> if eventgate is slow, then saving is slow because it generates various events and jobs [18:51:48] _joe_: I'm not aware of job submissions happening within the window measured by either frontend nor backend save timing [18:51:59] Xhgui profile confirms they all happen post send [18:52:14] <_joe_> also event emission? [18:52:20] But are indeed notably very slow it seems [18:52:24] <_joe_> so, what happens in that interval? :) [18:52:43] <_joe_> ok, that is probably due to kafka in codfw having some issues