[07:02:40] there seems to be some kind of mismatch between our installer and the latest kernel modules availables on buster [07:07:49] Debian 10.3 was released February 8th, 2020. [07:07:52] sigh.. that's the culprit [07:10:18] vgutierrez: IIRC somebody needs to run a script to update the installer after point releases [07:10:24] (also good morning) [07:10:32] yeah [07:10:35] I was checking https://wikitech.wikimedia.org/wiki/Updating_netboot_image_with_newer_kernel [07:10:42] it looks kinda deprecated :_) [07:10:58] and /var/lib/puppet/volatile/tftpboot/ on puppetmaster1001 the latest installer is for buster 10.1 [07:10:59] yeah, it talks about jessie XD [07:11:06] dunno what happened to 10.2 [07:12:36] so at this point I don't know if we wanna "backport" the 10.3 kernel to the 10.1 installer or just go with 10.3 [07:17:22] version.info under /var/lib/puppet/volatile/buster/installer reports version 20190702+deb10u2 [07:18:28] so I'm guessing that's 10.2, seeng that u3 is now available [07:18:57] hey moritzm <3 we were missing you right now [07:19:28] moritzm: it looks like buster 10.3 has been released during the weekend and now the installer is complaining about kernel version mismatches [07:20:35] ack, for every point release which bumps the kernel ABI we need to rebuild our internal d-i images, I'll do it in a bit [07:21:04] please ping me when you do it :) I got two reimages in progress [07:21:37] meanwhile.. let's backport yet another patch for trafficserver :) [07:32:36] vgutierrez: Buster is updated and I've run Puppet on both install* servers, you're good to go [07:32:43] wonderful [07:32:46] thanks moritzm! [07:32:57] Stretch also need to be rebuilt, I'll do it in a bit [07:33:04] moritzm: BTW, could you refresh https://wikitech.wikimedia.org/wiki/Updating_netboot_image_with_newer_kernel? [07:33:23] or add a section for new point releases? [07:34:54] that's a different documentation, that is only needed if we e.g. want to backport a more recent kernel than 4.9 to the stretch images (it was once written to support some cloudvirt server on jessie) [07:35:29] I'll have a look whether we have some docs for the refresh (finding them on wikitech is probably the bigger issue anyway) [07:36:38] it's mostly covered by a script on the puppetmasters (update-netboot-image) and then shipped to the install* servers via the volatile directory [07:36:56] ack [09:02:28] vgutierrez: did the installation(s) with the 10.3 installer work fine? [09:05:33] yes [09:07:55] ack [14:38:54] uncheduled deploy starting shortly, roll forwrd to wmf 18, holdover from friday [14:49:57] tx apergos [14:50:22] right now the updated branch is going around but wikis are not being moved to the new version yet [14:50:27] yw [14:50:37] mwdebug1001 is still out of puppet I see [14:50:42] or again, whichever [14:58:20] <_joe_> apergos: wait [14:58:38] <_joe_> are we moving to wmf.18 again? [14:58:49] <_joe_> did we find the origin of the issues last week? [14:59:09] <_joe_> it was deployed twice and it caused 2 outages [14:59:19] no [14:59:24] the fix for the config that caused the issues has been backported to 18 [14:59:28] to wmf.18 with the backported fix [14:59:33] the underlying cause has not been looked into [14:59:38] yet. [14:59:42] so wrt this issue, it should be the same as wmf.16 [14:59:45] <_joe_> uhm [14:59:55] <_joe_> I'm not convinced, but we'll see :) [15:00:08] but yes, the underlying cause still needs to be identified [15:00:10] <_joe_> when we fixed that config we caused a second outage [15:00:10] we'll see after manuel's m5 maintenance [15:00:14] no [15:00:22] you should investigate a little deeper I think [15:00:26] this is not the same fix [15:00:26] we went over this on friday late afternoon [15:00:33] <_joe_> oh ok [15:00:37] <_joe_> another config change fix [15:00:40] * mark meeting [15:00:58] instead of 'let's fix the typo' it is 'let's leave the typo in so it's just like .16' [15:01:54] <_joe_> ok, aren't we in that situation already? and that was causing problems? [15:02:21] <_joe_> anyways, I just hope things have been correctly evaluated [15:02:45] m5 maintenance finished [15:02:58] no [15:03:55] initially there was 'let's fix the typo' (friday); this caused use of new term store; this caused outage; it was reverted; then the question was, in order to deal with bad data on groups 0 and 1, what shall we do [15:04:03] the decision was: roll all the way back to .16 [15:04:33] now we have: 'have the typo in .18' and then we will have 'roll all groups forward', they will still all use the old term store [15:04:38] that's the nutshell version [15:05:44] <_joe_> I am not sure we're changing anything from pre-firday afternoon then [15:05:48] <_joe_> 🤷 [15:08:33] from pre first deploy by hoo of config change? correct, except that group 2 will now be on 18 with the same typo config, on friday it was only groups 0 and 1 [15:10:41] well to be clear I don't know if the default setting was fixed (so it's not write_new) or if the typo was put back in [15:10:53] either way, we won't have write_new, and that's the thing [15:14:52] <_joe_> apergos: last time we rolled out .18 to all groups [15:14:55] <_joe_> we went down [15:15:01] <_joe_> that was before friday [15:16:56] <_joe_> https://wikitech.wikimedia.org/wiki/Incident_documentation/20200206-mediawiki <- this [15:20:02] let me verify that work was done on that [15:20:51] twentyafterfour: ^^ what was the outcome of the thursday .18 rollout? [15:21:43] apergos: immediate outage [15:22:04] yes, I mean as far as fixing the underlying cause :-D [15:23:39] I was never sure we identified the underlying cause [15:24:34] some people had a hunch which seemed at least plausible but I wasn't sure (mostly due to the amount of moving pieces I was having a hard time following all of it and I hadn't slept so I couldn't trust myself at that point to fully understand complex interrelated issues ) [15:26:40] might that still hit us then, whatever it is, on a deploy? [15:27:04] yea, so I hate to say it but I still don't have confidence that we found or fixed the root cause... [15:28:07] we can roll it out slowly through group0, then group1, then we could even do smaller subsets of group2 and watch the logs closely? [15:28:25] what time is it starting? [15:28:33] twentyafterfour: what is your timeline here [15:28:48] I never did understand how it blew up so quickly on group2, yet nothing showed up on group1 whatsoever [15:29:12] effie: RhinosF1: was gonna start with group0/1 shortly [15:29:28] I don't have to do it at all today but if we don't get .18 out then we're in a weird place with .19 [15:29:34] Nice, I hope! [15:30:00] if 18 is known to be a problem, I don't see how we can move forward with 19, is the thing [15:30:13] I agree with apergos [15:30:24] and if this is separate from thw whole wb terms storage thing of friday.... [15:30:43] IMO we should roll forward carefully, but also we won't figure out what is wrong if we don't proceed [15:32:17] part of the joy is that we have, well I have t any rate. 90 minutes of meetings starting in an hour [15:33:14] apergos: perfect time for such changes, easy excuse to leave ;) [15:33:28] lol [15:33:40] I'm fine with putting it off if this is a bad time for everyone [15:33:51] or we can just roll to group1 and leave group2 for tomorrow [15:34:45] is there any scrying that can be done of the logs and so on from thursday that might give you folks a handle on things? [15:35:19] otherwise what i'm hearing is 'let's try it and we'll see what goes wrong and diagnose when it does', whether it's later today or tomorrow [15:35:45] apergos: I already dug through all the logs I know to look at and I didn't find many clues [15:36:04] it went down so fast there weren't even many logs around the time of the outage [15:36:21] right [15:36:39] at least not many error logs from mediawiki and related services [15:36:40] and to be clear it is possible that this could be the wb term store issue but we don't know, and are not convinced, right? [15:36:57] right, it could be, and I'm not claiming it isn't [15:37:03] right [15:37:14] just that I wasn't totally confident and there were conflicting pieces of evidence [15:37:40] if we rolled to group2 more slowly then it might have been noticed before reaching full outage proportions [15:37:51] but what would we have noticed, I wonder [15:38:08] load shooting up somewhere? I don't know why it didn't show up at all on group1 [15:38:16] i'm up for group 0 and 1 now, with fingers on the rollback trigger [15:38:27] and still not sure what to do about group 2 yet [15:38:32] yeah, previously group0/1 didn't cause an issue at all [15:39:02] so let's do that as an easy decision first :-D [15:39:16] the best idea I have really for group2 is we roll out progressively somehow instead of syncing them all at once... [15:39:22] apergos: sounds good [15:39:51] shall I do group0 now, followed shortly after by group1? [15:41:32] thumbs up [15:42:52] <_joe_> the problem has to do with queries on wikibase, we determined [15:43:15] <_joe_> I thought something that would be looked into before continuing [15:43:22] _joe_: so you were convinced that the thursday issue was indeed wikibase? [15:43:28] <_joe_> I don't see the point of re-releasing the code that almost killed our dbs [15:43:33] it was looked into but not by me and nobody updated incident docs [15:43:44] (I mean, nobody from wmde did?) [15:43:52] <_joe_> twentyafterfour: I had enough on my plate with other outages [15:43:57] _joe_: Our DBs were not overloaded per se, they had a spike on connections, but not overload as in: slow queries or anything like that [15:44:18] <_joe_> marostegui: we had the average latency of s8 and s4 (iirc?) spike up [15:44:50] _joe_: right. and +1 to marostegui, the reason for my doubts is that our dbas were contradicting the db theory and their evidence was very convincing :D [15:45:04] <_joe_> ok, lemme show you the evidence [15:45:14] there was this https://phabricator.wikimedia.org/T244533 from the incident report [15:45:16] _joe_: Yeah, but I am not sure if that's the cause or the consequence [15:45:28] _joe_: I wasn't able to find any slow query [15:45:40] <_joe_> this is the deploy window [15:45:42] _joe_: I thought I had heard the issue was eventually understood as incompatible code and configs, and that was fixed? [15:46:05] that was friday but not clear it was thursday's issue [15:46:15] <_joe_> https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1581020059491&to=1581021698250 [15:46:22] <_joe_> look at the last row of graphs [15:46:31] Normally when we see the DBs being hit by slow queries it is super evident, this time it wasn't :( [15:46:34] <_joe_> the latency of the monitoring query [15:46:42] cdanis: yes that's what I was hearing but I just never was fully in the loop on the problem and solution enough to be sure it was truly the root cause of the original incident [15:46:48] <_joe_> I think it's just load from a large amount of queries marostegui [15:46:52] <_joe_> not slow queries per se [15:46:53] it was all plausible but not clearly obvious [15:47:39] _joe_: That could be, although in most cases, when they are super overloaded we'd have seen lag as well and I didn't see [15:48:13] <_joe_> marostegui: the correspondence is striking though [15:48:44] _joe_: Yeah, I am not saying they were not the cause, but that I wasn't able to conclude they were either as the normal symtomps were not present this time [15:49:09] marostegui: _joe_: has the 'monitoring scrape time' metric been diagnostically meaningful in the past? [15:50:17] <_joe_> cdanis: only if the dbs are in real trouble [15:50:24] <_joe_> so also look here https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All&from=1581019910438&to=1581022084564 [15:50:31] <_joe_> the cpu of the appservers went down [15:50:41] <_joe_> if the problem is elsewhere in the code, it goes up [15:50:47] <_joe_> anyways, I don't have to convince anyone [15:51:07] <_joe_> I'm just saying another deployment without any corrective action is just deemed to fail [15:52:16] we don't understand friday's failure either except that 'well look, here is what the one line code change did, so this must somehow be at fault' [15:52:25] _joe_: so you're saying you don't think the fix for friday's issue addressed the problem from thursday? [15:52:30] is it worth it to understand that better first? [15:53:10] apergos: yes I would like a better explanation from the wikidata team [15:53:42] we'd want to be able to match that up to the graphs and the logs, at least have the story be convincing [15:54:02] if we can do that we have a chance of comparing it to thursday's incident to see if same/not same [15:54:12] without that... impossible [15:54:40] ok, who can we/you bring in from the wikidata folks? [15:56:53] note in 30 mins I'm gone til 8 pm my time (6 pm utc), meetings [15:57:42] <_joe_> (I am gone in ~ 30 minutes) [16:00:33] ok so the graphs do look very similar between the two outages (thursday and friaday) [16:00:43] this is friday: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1581084790469&to=1581088067987 [16:00:52] this is thursday: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1581020059491&to=1581021698250 [16:01:17] all the exact same patterns [16:06:11] a lot of similarities, true [16:13:52] I notice the db latencies in the graph for Friday are for s8 alone basically, but for Thursday's graph it's s8 and s4 if you look at the numbers [16:26:47] yeah that is kinda odd, and I can't quite explain why that would be different [16:28:35] <_joe_> well on friday we didn't deploy to group2 [16:28:56] I'm off to a meeting, sorry about this [16:29:12] well s4 is commons but that's group1 see [16:29:19] anyways I gotta leave it here for now.... [16:31:36] add*shore had some explanations in _security about why friday would have triggered the same thing as thursday but on a smaller scale. I am re-reading that stuff and I'm going to add more detail to the incident documentation on wiki [16:39:49] twentyafterfour: the reason is that group2 also reads from s8 (as client of Wikidata) [16:40:26] twentyafterfour: I'm in the wikidata team, how can I help with the situation? [16:41:10] Amir1: we are really just trying to be sure that the issue from friday was resolved and also be confident that there wasn't some other issue on thursday that remains unidentified and unresolved [16:41:28] because if the thursday issue is unresolved then it's gonna blow up again when wmf.18 rolls to group2 [16:42:10] I can't make a 100% sure tell you if the issue is resolved [16:42:10] it seems like the issue from friday was probably the same issue on thursday although there might be other issues mixed into the same deployment since the train rolls out a lot of changes at once [16:42:28] Amir1: of course, nothing is 100% but we're just looking to document what we know I guess [16:42:31] twentyafterfour: the issue on firday was a config mix up [16:42:40] that is mixed and should not ba an issue [16:42:54] I'm around to double check everything [16:43:35] the main issue was that the config had a typo so instead of reading the production value, it was reading the software default. We flipped the software default, the software got shipped to production [16:43:45] well the problem now is timing, I think there are so many things going on (and few SREs available to watch this) that we might have to put off until tomorrow [16:44:05] the revert of flipping of the wikibase default is already deployed in wmf.18 and master [16:44:58] I'm around to watch things, but if you think that's not enough. We can wait until tomorrow [16:45:00] so do you think that the issue on thursday had the same root cause? add*shore seemed to think so [16:45:07] yup [16:45:27] my biggest concern is that this big outage might have overshadow other UBN issues [16:45:42] otherwise the main outage is definitely fixed [16:45:50] thanks, btw, I just heard you were supposed to be on vacation during the last incident, and thanks for clarifying for us now [16:45:52] at least from what I saw [16:46:21] Amir1: agreed, that's the only thing that I'm still concerned about is that we missed other stuff because of multiple issues masking eachother [16:46:27] yup, I was in vacation until today, Adam dumped everything on me and left for skiiing [16:46:37] * Amir1 cursed [16:46:45] *curses [16:47:12] twentyafterfour: exactly, I think it should be on group1 for now [16:47:27] so since you are confident about the root cause of thursday then I think I'm confident to deploy the train. I will go to group1 and wait until we have more sre coverage for group2 [16:47:28] so we can see at least some major UBNs before tomorrow [16:47:34] right [16:47:42] great [16:47:51] enjoy! this could be fun [16:47:53] going to do that now, sorry for the delay, had to re-read a lot of backlog [16:49:32] all good [16:50:29] I'm around to make sure if everything works fine, we actually shipped some performance improvements so reads on s8 (rows read specially) should drastically go down [16:54:51] amir1 i would love it if you also could add your knowledge to the incident report or the task, wherever this info is being collected, especially on the thursday outage [16:55:07] (still in meetings for over an hour) [16:57:16] apergos: Sure thing. Let me add it to my todo list for today [17:01:25] I updated the conclusions on https://wikitech.wikimedia.org/wiki/Incident_documentation/20200206-mediawiki [17:01:32] please correct me if I got any of it wrong [17:01:54] there is no incident report for thursday but there is a task [17:03:16] that is the thursday one that you linked [17:03:41] is there a Friday one (or two?)? I did look but saw nothing [17:04:35] there isn't a separate incident report is there? [17:04:41] there is not one, it woul dbe good to write one (start one? :-)) [17:04:52] since it's not clear they are the same cause [17:05:04] I think it's looking pretty clear to me now that they are [17:05:13] which is what I wrote in conclusions on wiki [17:05:35] trying to at least make a case for it [17:06:04] making a case is good, if they turn out to be the same then that's good, but a place to gther all the friday data, we need it, wherever it is [17:06:19] (I'm actually in a meeting, just drive-by commentary here) [17:06:25] I put together this little table this morning tracking our wbterms config headaches and what we suspect the problems were https://phabricator.wikimedia.org/T244697 [17:07:24] But I don't really know if there were one or two problems yesterday [17:07:36] *friday [17:07:45] tarrow: nice, thanks for that. [17:08:46] I'm happy to write some bits about the wbterms things. I know nothing at all about the zhwiki stuff that seemed to be happening just before (or contemporaneously) the terms config change though [17:09:07] just the terms stuff is great! [17:11:58] https://wikitech.wikimedia.org/wiki/Incident_documentation/20200207-wikidata [17:12:10] template page for friday's incident [17:12:54] I wrote a short and not very helpful summary but other than that I've so far put all my effort into the previous incident report which has a lot of detail and timeline already well documented [17:59:49] meeting over [18:00:25] I'll try to look at those and dump anything else I can think of on there tomorrow (it's already my evening here) [18:00:58] anything that can help us nail down if the underlying causes are the same and how/why behavior might have differed a bit [18:04:52] and as our meeting ends another deployment slot starts, of course [18:13:39] hey sre folks, anyone who can oversee group0, 1 to wmf.18? [18:13:53] because I really want to be done for the day at this point, 8 pm here and it's been long [18:14:20] no group2, that will be tomorrow [18:15:35] some sre in an sf timezone.... don't be shy [18:16:20] I need to get lunch so I don't starve, but I can look in 45m or so if nobody jumps on it first [18:16:32] let me know what "oversee" means though please :) [18:16:41] What exactly is involved? I have bandwidth but no idea [18:17:08] be around during and for awhile (30 min to 1 hour?) after in case things go bad [18:17:27] not expecting any issues since we won't go to group2 [18:17:41] be ready to determine that it's actually the depoy that's a problem and be around for the revert, which can also sometimes be exciting [18:17:44] yep can do [18:17:46] I'll be here [18:18:04] thank you, then I am officially checked out for the day [18:18:17] thanks apergos, have a good one [18:18:24] later apergos [18:18:37] :-) [18:19:52] ok so thanks chaomodus for being available, I'm going ahead with group0 now, all the fun (probably boring) will continue in #wikimedia-operations