[00:04:59] PROBLEM - MariaDB sustained replica lag on es1022 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1022&var-port=9104 [00:05:59] RECOVERY - MariaDB sustained replica lag on es1022 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1022&var-port=9104 [05:06:37] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `es2013.codfw.wmnet` - es2013.codfw.wmnet (**PASS**) - Downtimed host on Icinga... [05:09:03] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10Marostegui) [05:30:00] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) a:05Marostegui→03Papaul @Papaul I think we need to ask for another disk or advise from Dell. These are the controller logs after the reboot: ` Time: Tue Sep 29 05:17:... [05:31:02] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) [05:50:41] 10DBA, 10CheckUser, 10Stewards-and-global-tools, 10I18n, and 2 others: Incomplete i18n for log entries in CheckUser - https://phabricator.wikimedia.org/T41013 (10Marostegui) I don't really have context on this, this is probably a question for someone who knows MW in depth. [05:52:59] 10DBA, 10Operations, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Marostegui) Not sure what `category` is as it doesn't appear on https://noc.wikimedia.org/dbconfig/eqiad.json From there we have these active group... [06:04:36] 10DBA, 10decommission-hardware: decommission es2019.codfw.wmnet - https://phabricator.wikimedia.org/T264063 (10Marostegui) [06:07:50] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es2019.codfw.wmnet - https://phabricator.wikimedia.org/T264063 (10Marostegui) [06:07:52] 10DBA, 10Patch-For-Review: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) [06:07:54] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es2019.codfw.wmnet - https://phabricator.wikimedia.org/T264063 (10Marostegui) [06:57:10] I found the mysql.py -h :s4 quite useful if you don't want to think about ports [06:57:39] e.g. in an emergency where you know the host but don't want to remember the port [07:45:52] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es2019.codfw.wmnet - https://phabricator.wikimedia.org/T264063 (10Marostegui) [07:52:17] elukey: we got the following warning: "Last dump for analytics_meta at eqiad (db1108.eqiad.wmnet:3352) taken on 2020-09-29 02:07:47 is 2 GB, but previous one was 1 GB, a change of 64.9%" [07:52:58] it is probably nothing, but worth checking [08:43:40] 10DBA, 10Goal, 10Patch-For-Review: Expand database provisioning/backup service to accomodate for growing capacity and high availability needs - https://phabricator.wikimedia.org/T257551 (10jcrespo) db1150 is now fully setup, set as active on netbox, added to tendril and zarcillo/prometheus and notifications... [08:44:43] 10DBA, 10Epic, 10Patch-For-Review: Improve regular production database backups handling - https://phabricator.wikimedia.org/T138562 (10jcrespo) [08:44:45] 10DBA, 10Goal, 10Patch-For-Review: Expand database provisioning/backup service to accomodate for growing capacity and high availability needs - https://phabricator.wikimedia.org/T257551 (10jcrespo) 05Open→03Resolved After T257551#6493655 and T257551#6493809, and all pending hardware setup, I consider thi... [09:20:24] marostegui: for T239238, is the thread_pool_stall_limit significant in https://phabricator.wikimedia.org/P12828 ? [09:21:09] T239238: Switchover s8 primary database master db1109 -> db1104 - 2020-09-29 08:00 UTC - https://phabricator.wikimedia.org/T239238 [09:22:04] kormat: I believe we keep it 10 for masters, yeah [09:22:08] and 100 for slaves [09:23:11] ah i see, so running puppet will change it, at least on disk [09:23:32] yeah, you'll need to change it live on mysql itself with set global thread_pool_stall_limit = 10; [09:23:39] (I believe it is dynamic, I haven't checked) [09:23:55] https://mariadb.com/kb/en/thread-pool-system-status-variables/#thread_pool_stall_limit says dynamic: yes [09:23:59] \o/ [09:26:43] added as a clean-up step [09:26:52] \o/ [09:34:56] jynus: o/ - so yesterday I had to purge some binlogs since the partition dedicated to mariadb on an-coord1001 was saturating.. I used the purge command two times, the first with sql_log_bin, the latter probably not now that I think about it. Is this the cause of the warning? Let me know if I made some stupid mistake [09:38:13] no, binlogs are not backed up [09:38:27] this means that the database doubled since last week [09:40:17] if that is normal and expected, I will just ack the alert [09:40:54] this is the link to the alert: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=alert1001&service=dump+of+analytics_meta+in+eqiad [09:41:04] ahhh sorry I've read it in the wrong way [09:42:14] I added a little database during the past days (hue_next), but it is tiny, and I have seen the binlog files growing a lot. I am trying to figure out what db is the root cause [09:42:36] "I have seen the binlog files growing a lot" that means it has a lot of writes [09:42:57] e.g. probably updates or insert + deletes [09:43:07] if the db itself didn't grow that much [09:43:22] is there a way to see from the dumps if there is a db that increased more than others? (if it is quick of course, don't want to force you to spend time on this) [09:43:33] elukey: actually yes [09:43:50] not on the alert, but we gather detailed metadata of every db and table [09:44:00] let me show you on pm [09:44:03] wow [09:44:32] we don't have yet a good dashboard of it because most of those cannot be public [09:44:46] but the data is very interesting for historical analysis [09:50:05] marostegui, sobanski: ok, going to start the prep for T239238 now [09:50:05] T239238: Switchover s8 primary database master db1109 -> db1104 - 2020-09-29 08:00 UTC - https://phabricator.wikimedia.org/T239238 [09:50:20] kormat: +1 [10:09:22] how is it going? [10:10:43] smoothly [10:10:46] in clean-up phase now [10:10:55] cool [10:14:07] marostegui: for the final step "Check tendril and zarcillo were updated correctly" - tendril at least looks to be displaying the right topology [10:14:32] ahh - on zarcillo i need to update the `masters` table. anything else? [10:14:46] kormat: I believe the script does it, it is just to double check it was done correctly [10:15:00] | s8 | eqiad | db1104 | [10:15:02] so it does, and did [10:15:06] \o/ [10:15:08] yes, that part was coded in, but sometimes there is existing inconsistencies [10:15:27] the only thing I would have wanted to include that wasn't was the dbctl commands [10:15:29] alrighty, all done then [10:15:36] marostegui: am i good to resolve this? [10:16:02] kormat: Just double check the new master is on RO and the old one is also RO and I think you are good to go! [10:16:24] the new master also has pt-heartbeat running fine, right? [10:16:39] RO confirmed x2 [10:17:16] oh, finally the read only options were useful! [10:17:54] the other thing that has to be done manually are the events [10:17:58] marostegui: heartbeat is running, and i can see updates in the heartbeat table on the old master [10:18:21] jynus: yes, that is part of the checklist [10:18:25] cool [10:18:45] we could until we get rid of them, package them as part of the script and do it automatically [10:18:47] and betterworkds updated :) [10:18:49] kormat: sweet! then you are good to go [10:19:01] so many things I would like to do and so little time! :-D [10:31:57] 10DBA, 10User-Kormat: Switchover s8 primary database master db1109 -> db1104 - 2020-09-29 08:00 UTC - https://phabricator.wikimedia.org/T239238 (10Kormat) 05Open→03Resolved Success :) [10:32:02] 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10Kormat) [10:32:04] 10Blocked-on-schema-change, 10DBA, 10Wikidata: Schema change on production for increase the size of wbt_text_in_lang.wbxl_language - https://phabricator.wikimedia.org/T237120 (10Kormat) [10:32:10] 10DBA, 10User-Kormat: Switchover s8 primary database master db1109 -> db1104 - 2020-09-29 08:00 UTC - https://phabricator.wikimedia.org/T239238 (10Marostegui) <3 [10:33:31] 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10Kormat) [10:38:08] marostegui: a question occurs to me that i probably should have asked earlier: is there any chance that orchestrator would _replace_ zarcillo? [10:38:31] kormat: it does haven an inventory table yeah [10:38:34] table(s) [10:38:47] mmm [10:40:13] if it cannot do everything, we could do extra things as plugins/extensions [10:40:19] this calls into question the idea of working on making zarcillo authoratitive [10:40:48] kormat: I think we should for now, I don't see orchestrator being the source of truth in short-term [10:41:11] it will take time until we understand it completely and be confident about sending patches and all that [10:41:24] we should have it installed in production ASAP to see what things it can give us (even if not in full use) [10:41:46] jynus: that is the plan [10:41:51] marostegui: ack. but on the flip side it does mean i shouldn't do _heavy_ investment in zarcillo in the meantime [10:41:57] yeah [10:42:19] kormat: I think just the MVP that would make our life easier for now I would suggest [10:42:29] no the other side, maybe some of the more custom tables would transfer as is (?) [10:42:45] *on [10:42:57] marostegui: *nod* [10:44:19] also, while we don't have orchestrator things like "give me a list of host from s1" will be needed in any way, just the query would change, I guess [10:44:25] kormat: In Q2 we'll know better what we can get from orchestrator's backend I think - but I think working on zarcillo to improve our current workflows is a big need and it will take time for orchestrator to take over (it can) [10:44:45] +1 [10:44:45] kormat: *if it can [10:44:56] SGTM [10:44:57] have utilities with a clean api should be reusable [10:47:36] I think in general we should avoid thinking of zarcillo as the current db and more like "a single, centralized inventory", be it called zarcillo, or orchestrator or anything else [10:48:18] and there are metadata we will also need anyway (like db object inventories) that orchestrator won't give us [10:48:34] (and maybe some other piece of software will) [10:58:27] 10DBA, 10Epic, 10Patch-For-Review: Improve regular production database backups handling - https://phabricator.wikimedia.org/T138562 (10jcrespo) [11:19:41] I am happy the time spent on gathering metadata pays off sometimes: T264081#6501390 [11:19:42] T264081: Increase in usage of /var/lib/mysql on an-coord1001 after Sept 21st - https://phabricator.wikimedia.org/T264081 [11:58:20] 10DBA: Clean up DB related pages on Wikitech - https://phabricator.wikimedia.org/T263420 (10LSobanski) [12:08:29] 10DBA, 10Data-Persistence, 10PM: Update the DBA task tracking workflow - https://phabricator.wikimedia.org/T263463 (10LSobanski) High level diagram and definitions: https://docs.google.com/document/d/1LANEQuSQ3XIfaraaAcfcGaq7f03TRABjHuqhU-uDTKg/ [12:41:03] 10DBA, 10Data-Persistence-Admin: Create a "how to engage us" process and documentation for Data Persistence - https://phabricator.wikimedia.org/T263456 (10LSobanski) [12:42:38] sobanski: can this be satisfied with "Please Don't." in a large font? [12:43:12] I had bigger plans [12:43:24] "Pretty Please" [12:46:14] :D [14:28:56] 10Blocked-on-schema-change, 10DBA, 10Operations, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [16:25:45] 10Blocked-on-schema-change, 10DBA: Schema change to drop three indexes from wb_changes - https://phabricator.wikimedia.org/T264109 (10Ladsgroup) [20:57:37] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10CDanis) 05Resolved→03Open crashed again [21:01:38] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Volans) from HW logs ` -------------------------------------------------------------------------------- SeqNumber = 286 Message ID... [22:55:56] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) ` 2020-09-29 20:47:59 287 SYS1003 System CPU Resetting. 2020-09-29 20:47:51 286 PWR2270 The Intel Management Engine has encounte...