[04:16:45] 10DBA, 10Operations, 10User-Kormat: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 (10Marostegui) I guess this is the known bug and restarting the exporter fixed it or is there something else? [04:42:32] 10DBA, 10Patch-For-Review: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 (10Marostegui) [04:43:35] 10DBA, 10Cloud-Services: Prepare and check storage layer for sysop_itwiki - https://phabricator.wikimedia.org/T257125 (10Marostegui) a:03Kormat As this is going to be private, please do not create this database until we've set the replication filters in place. [04:43:48] 10DBA, 10Cloud-Services: Prepare and check storage layer for sysop_itwiki - https://phabricator.wikimedia.org/T257125 (10Marostegui) p:05Triage→03Medium [04:54:27] 10DBA, 10Operations, 10ops-eqiad, 10User-Kormat: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T256939 (10Marostegui) 05Open→03Resolved I have tried to check the history for this host/service on icinga to see if it recovered for a while and then got triggered again for some reason (not th... [04:55:54] 10DBA, 10Puppet, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10Marostegui) p:05Triage→03Medium [05:07:09] 10DBA, 10Analytics, 10User-Kormat: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Marostegui) >>! In T256966#6274434, @Kormat wrote: > This host was reimaged to buster recently (2020-06-22) as part of T254870, and the symptoms do sound very like https://jira.mariadb.or... [05:08:55] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10Marostegui) The tasks description says: ` Likewise, we probably do not need to back it up. ` [05:22:08] 10DBA, 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) 05Resolved→03Open >>! In T256120#6273923, @jbond wrote: > Thanks @jcrespo Thanks for helping this is all set up and ready for th... [07:12:06] 10DBA, 10Operations: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui) @akosiaris @jcrespo let's replace this master on Wednesday at 08:00 AM UTC? [07:12:57] 10DBA, 10Operations: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10akosiaris) Fine by me. [07:13:09] 10DBA, 10Operations, 10User-Kormat: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 (10Kormat) Yep, that's it. It's not so much the bug that bothers me as us not being aware of it in some cases for 2 months. [07:14:07] 10DBA, 10Cloud-Services, 10User-Kormat: Prepare and check storage layer for sysop_itwiki - https://phabricator.wikimedia.org/T257125 (10Kormat) [07:14:11] 10DBA, 10Operations, 10User-Kormat: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 (10Marostegui) Completely agree - I was just asking in case it was something else what made it fail [07:18:01] marostegui: congratulations on managing to arrange an array of different issues while you were gone ;) [07:18:15] \o/ [07:24:16] 10DBA, 10Analytics, 10User-Kormat: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Kormat) > From the reported logs In that case let me supply more logs :) The errors from line 15 onwards are what made me think of that mariadb upstream issue. {P11741} [07:29:58] 10DBA, 10Analytics, 10User-Kormat: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Marostegui) Ah, I only saw the ones reported on the task initial creation. Those are definitely similar to the ones we did see during the crashes with labsdb hosts. Going to comment on th... [07:34:26] 10DBA, 10Analytics, 10Upstream, 10User-Kormat: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Marostegui) [07:42:14] 10DBA, 10Operations: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10jcrespo) Ok. [07:54:46] jynus changed the topic to Up | Log: https://bit.ly/wikitech | Channel logs: https://bit.ly/opsirclog | Ops Clinic Duty: jynus -> my condolences jynus [07:55:15] 10DBA, 10Operations, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) [08:02:14] "An upgrade will only be possible after a clean shutdown. mariabackup --prepare will not work with backups taken before version 10.5.2." [08:02:24] https://mariadb.com/kb/en/changes-improvements-in-mariadb-105/ [08:03:00] jynus: i hope that doesn't fall into the "improvements" category ;) [08:03:38] Implemented Features: InnoDB: Performance Improvements [08:03:59] hah, welp [08:04:07] sorry, "InnoDB: Performance Improvements etc." [08:04:10] that's going to be a massive pain [08:04:14] (i assume) [08:04:16] I am guessing that is on etc [08:04:20] not really [08:04:36] we don't normally upgrade after a crash [08:04:46] and we control the packaging version [08:04:58] i was thinking of the second half [08:04:59] but it will create same backup issues [08:05:06] as with the 10.4 upgrades [08:05:14] we will need to maintain 2 parallel backup envs [08:06:05] and that is why we create both logical AND raw backups [08:51:54] 10DBA, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Patch-For-Review, 10User-Marostegui: DBA review for Echo push notification subscription tables - https://phabricator.wikimedia.org/T246716 (10Marostegui) `echo_push_subscription` table added to replication filters on sanitarium... [09:19:58] 10DBA, 10Cloud-Services, 10User-Kormat: Prepare and check storage layer for sysop_itwiki - https://phabricator.wikimedia.org/T257125 (10Kormat) The replication filters are in place. @Urbanecm: please ping me (on irc, or by updating the ticket) once the database is created, so we can confirm it's not being re... [10:29:47] 10DBA, 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) @Marostegui sorry for the confusion in the initial request. To clarify we would like this database to be available at all times but wit... [10:59:31] 10DBA, 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) Let's move this to a "more stable" place then. This host is not guaranteed to be up really, we use it for many things, it can fail,... [11:03:31] 10DBA, 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) @Marostegui that sounds good to me, no need to copy the current data, just let me know when its in place and ill update my config. Thanks [11:08:33] 10DBA, 10Patch-For-Review: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 (10Marostegui) [11:15:14] 10DBA, 10Puppet, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10aborrero) [11:35:58] 10DBA: imagelinks has index mismatch on s8 - https://phabricator.wikimedia.org/T256680 (10Marostegui) [11:36:05] 10DBA: imagelinks has index mismatch on s8 - https://phabricator.wikimedia.org/T256680 (10Marostegui) 05Open→03Resolved a:03Marostegui All done [11:36:10] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Prevention), 10WorkType-NewFunctionality: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Marostegui) [12:43:45] 10DBA, 10Operations, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) [13:23:10] 10DBA, 10Operations, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) [13:24:37] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1 is ported to thanos now. let me know if you encounter any issues. [13:25:19] kormat: thank you [13:25:31] marostegui: you're somewhat welcome [13:26:58] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) [13:28:27] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [13:42:40] 10Blocked-on-schema-change, 10DBA: Extend echo_unread_wikis.euw_wiki - https://phabricator.wikimedia.org/T255174 (10jcrespo) Note I believe some wikis have echo tables embedded into the metadata database CC @Reedy T119154 Do these need alter too? [13:48:47] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [13:51:51] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [] db1145 [] db1144... [13:59:46] 10DBA, 10Operations, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) [13:59:57] 10DBA, 10Operations, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) 05Open→03Resolved All done. [14:03:24] 10DBA, 10Operations: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 (10Marostegui) [14:03:26] 10DBA, 10observability: Add haproxy-exporter to dbproxy hosts - https://phabricator.wikimedia.org/T191400 (10fgiunchedi) [14:04:27] 10DBA, 10observability: Add haproxy-exporter to dbproxy hosts - https://phabricator.wikimedia.org/T191400 (10Marostegui) p:05Triage→03Medium [14:06:11] 10DBA, 10Operations: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 (10Marostegui) @wiki_willy @Jclark-ctr do we have spare BBUs around? [14:06:13] 10Blocked-on-schema-change, 10DBA: Extend echo_unread_wikis.euw_wiki - https://phabricator.wikimedia.org/T255174 (10Reedy) [14:08:12] 10DBA, 10Operations, 10Patch-For-Review: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 (10Jclark-ctr) @Marostegui yes believe we still have some. I will be on site in a few hours if we wanted to change it today [14:08:23] 10DBA, 10Operations, 10Patch-For-Review: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 (10jcrespo) db1079 was depooled: P11751 Main traffic removed from db1136 as it is currently the only s7 API host on eqiad: P11752 Both should be removed or taken into account if host is rep... [14:09:00] 10DBA, 10Operations, 10Patch-For-Review: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 (10Marostegui) >>! In T257216#6281956, @Jclark-ctr wrote: > @Marostegui yes believe we still have some. I will be on site in a few hours if we wanted to change it today Excellent, I am goi... [14:09:47] marostegui: you scheduled that to happen while you were away and it took a few days longer than you expected, i'm betting [14:09:57] kormat: yeah, I messed up the dates [14:10:28] kormat: It's exam time- now you should go and say everthing we did and why! [14:10:37] (without looking) [14:11:06] jynus: i saw that it was a BBU failure, but given that the host rebooted on its own i would have looked at the hardware logs in the ilo [14:11:37] well, not interested as much as the cause (hw is boring for me :-P), but attending the page :-D [14:12:04] cause research is part of it, but less important [14:12:07] i was wondering if i should jump in and claim it, but manuel's reflexes are cat-like [14:12:24] it is ok, if you see a few people on it [14:12:44] just trying to trasmit what we did for learning purposes [14:13:18] in a tonge-and-cheek way, I am not evaluating you! :-D [14:13:43] if a host reboots on its own, depooling it is a good first step [14:13:49] yep [14:13:50] if nothing else it gives you time to figure out what the issue is [14:14:01] when it would not be a good option? [14:14:05] to just depool? [14:14:14] (extra points! :-D) [14:14:15] if it's a master [14:14:22] yep [14:14:39] what would you do next? [14:15:21] note we didn't necesarilly know what happened really [14:15:29] how would you check? [14:15:33] at the time [14:15:45] after it's depooled? investigate why it rebooted. in this case, it's because the BBU failed. as this isn't a high-traffic host it can proobably survive without BBU in the short-term [14:15:50] kormat: just for you to know, the way we normally organize without even saying it is that whoever is on the host CLI just does that, and the other does the surrounding things like depooling, communications, alerts etc, if you are on your own they you have to do all that :) [14:16:02] marostegui: ok, good to know [14:16:22] wait, which of the roles did you have now? [14:16:26] *had [14:16:36] CLI! [14:16:40] ok ok [14:16:51] so I don't think I communicated that well [14:17:25] one of the things I like to check is actual impact [14:17:32] sometimes metrics are not a good show of real imapct [14:17:53] e.g. lots of db connection errors doesn't mean actual user requests are failing [14:18:50] for example, it says there was 130 000 errors: https://logstash.wikimedia.org/goto/d78b0640dbbf02a042dcc07c2b61083f [14:19:13] but most of those are just the connection checks checking for health once and again [14:19:23] (not ideal, but that is how it works right now) [14:20:12] i still have no real idea how to make sense of logstash [14:20:16] don't worry [14:20:21] it was just an ilustration [14:20:23] not important [14:20:34] what I wanted to trasmit is that there was a lot of errors reported [14:20:59] but I think there was less than 100 actual user errors of ongoing queries being closed suddenly [14:23:27] kormat: then did you see why I did a second deploy? [14:24:02] i didn't notice any deploys [14:24:11] deploy as in config deploy [14:24:17] not mw deploy [14:24:30] ah, dbconf? [14:24:36] yep [14:25:03] right, i see it now [14:25:04] I guess it is not technically a deploy as we only touch etcd? [14:25:10] makes sense [14:25:34] I am not worried about the BBU parts/hw, because that will be something different every time [14:25:56] but those changes will be very similar in certain cases [14:26:02] so as a general rule when depooling a host like that i should be looking at the expected impact on the other hosts in the section [14:26:15] well, in this case it wasn't even needed [14:26:16] (and if i can't predict the impact, i should at least be looking at graphs for them) [14:26:26] but I am a detailed person [14:26:44] I guess ideally we could pool another host as api [14:26:54] to avoid the SPOF on the load group? [14:27:55] but yes, when one things breaks and you fix them, it is a good idea in general to make sure that 1) it is really fixed (e.g. asking for confirmation to other people) [14:28:03] 2) it didn't break something else :-P [14:28:30] * kormat nods [14:28:33] it is very easy to depool the wrong host [14:28:36] or something like that [14:28:49] or if this was bad traffic [14:28:58] like, a new query that is very heavy [14:29:11] depooling will not fix anything, in fact it will break things more [14:29:23] quite [14:29:33] I am trying to speak my mind in hope it is helpful [14:29:36] not sure if it is [14:29:47] it is, yes. thank you :) [14:30:03] these real examples are more helpful I think that theoretical examples [14:30:15] for example, I was rusty with dbctl [14:30:25] and I realized I have to practice it more [14:30:34] any thoughts marostegui? [14:31:19] sorry, I haven't read the backlog since I wrote, I am busy with something else [14:31:33] ok, sorry [14:31:51] he's trying to break more BBUs [14:56:55] mailman, idp, some others that I don't remember right now? Should we budget an m6 +1 host x DC just to be sure for next year projects? [14:58:46] I was planning to place idp on m1 [14:58:56] yeah, no problem with that [14:59:06] I am worried about increase usage one year for now [14:59:08] about mailman, I don't know, I haven't thought about it [14:59:21] so as to budget it now when there is time? [14:59:41] I belive at least mailman was a large project [14:59:59] although it is true that m1 is mostly underutilized [15:02:32] as far as I know budget period is done I think - this mailman thing came up when it was already over [15:02:49] we still have 4 hosts for unexpected things I reckon [15:03:04] But maybe for now I guess m1 our best approach I guess? [15:09:04] mark: budgeting is already finished? [15:41:40] 10Blocked-on-schema-change, 10DBA: Add img_deleted column on wikis where it's missing - https://phabricator.wikimedia.org/T257222 (10Reedy) [15:43:29] 10Blocked-on-schema-change, 10DBA: Add img_deleted column on wikis where it's missing - https://phabricator.wikimedia.org/T257222 (10Marostegui) Mmmm, this is weird, as we just finished {T250055} cc @Ladsgroup [15:45:41] 10Blocked-on-schema-change, 10DBA: Add img_deleted column on wikis where it's missing - https://phabricator.wikimedia.org/T257222 (10Reedy) Sorry :( I guess it wasn't checked where it came from, and whether it would be used? No reference to T90300 in that task [15:47:01] 10Blocked-on-schema-change, 10DBA: Add img_deleted column on wikis where it's missing - https://phabricator.wikimedia.org/T257222 (10Reedy) [15:51:28] 10Blocked-on-schema-change, 10DBA: Add img_deleted column - https://phabricator.wikimedia.org/T257222 (10Reedy) [15:52:27] 10Blocked-on-schema-change, 10DBA: Add img_deleted column - https://phabricator.wikimedia.org/T257222 (10Reedy) >>! In T257222#6282268, @Reedy wrote: > Sorry :( > > I guess it wasn't checked where it came from, and whether it would be used? No reference to T90300 in that task Looks like a phab search would'v... [15:53:18] 10Blocked-on-schema-change, 10DBA: Add img_deleted column - https://phabricator.wikimedia.org/T257222 (10Marostegui) Checking `tables.sql` I don't see that `img_deleted` column there, so should we get a patch merged before that to avoid those issues again? [15:56:48] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks: Monitor the growth of CheckUser tables after the addition of Thanks data - https://phabricator.wikimedia.org/T257223 (10Huji) [15:57:20] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks: Monitor the growth of CheckUser tables after the addition of Thanks data - https://phabricator.wikimedia.org/T257223 (10Huji) [15:58:33] 10Blocked-on-schema-change, 10DBA: Add img_deleted column - https://phabricator.wikimedia.org/T257222 (10Reedy) Yup, already started with that; I've taken the original patch and slimmed it down {F183775} >>! In T90300#6282219, @gerritbot wrote: > Change 609798 had a related patch set uploaded (by Reedy; owner... [15:59:41] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks: Monitor the growth of CheckUser tables after the addition of Thanks data - https://phabricator.wikimedia.org/T257223 (10Huji) 05Open→03Stalled Stalled until T252226 actually reaches production servers. This would be a good time for the DBA team to come up w... [16:01:28] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks, 10User-DannyS712: Monitor the growth of CheckUser tables after the addition of Thanks data - https://phabricator.wikimedia.org/T257223 (10DannyS712) [16:31:04] 10Blocked-on-schema-change, 10DBA: Add img_deleted column - https://phabricator.wikimedia.org/T257222 (10Tgr) >>! In T257222#6282328, @Reedy wrote: > With how it was originally implemented, it would add the column, and then do some UPDATE queries against a few tables, unbounded, not batched etc - https://gerri... [16:31:22] 10Blocked-on-schema-change, 10DBA: Add img_deleted column - https://phabricator.wikimedia.org/T257222 (10Tgr) 05Open→03Stalled [17:03:17] 10DBA, 10Operations: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 (10Jclark-ctr) @Marostegui BBU replaced host is powering up now [17:39:53] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10dpifke) I think there was some minor debate on that point. We'll resolve it in our team meeting today and I'll come back with a definitive answer. [21:46:32] 10DBA, 10CheckUser, 10Trust-and-Safety, 10WMF-Legal, and 2 others: Configure WMF wikis to log login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 (10jrbs) Just flagging here for the record - T&S has been made aware of this task and have no objections. [22:39:03] 10DBA, 10CheckUser, 10Trust-and-Safety, 10WMF-Legal, and 2 others: Configure WMF wikis to log login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 (10Huji) [23:28:46] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10dpifke) The consensus is that this database should be backed up if possible. We've been able to track down performance regressions by comparing the profiler output of a page to profiles taken months (or even... [23:50:45] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10Dzahn) a:05dpifke→03jcrespo