[00:04:59] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on es1022 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1022&var-port=9104
[00:05:59] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on es1022 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1022&var-port=9104
[05:06:37] <wikibugs>	 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `es2013.codfw.wmnet` - es2013.codfw.wmnet (**PASS**)   - Downtimed host on Icinga...
[05:09:03] <wikibugs>	 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10Marostegui)
[05:30:00] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) a:05Marostegui→03Papaul @Papaul I think we need to ask for another disk or advise from Dell. These are the controller logs after the reboot: ` Time: Tue Sep 29 05:17:...
[05:31:02] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui)
[05:50:41] <wikibugs>	 10DBA, 10CheckUser, 10Stewards-and-global-tools, 10I18n, and 2 others: Incomplete i18n for log entries in CheckUser - https://phabricator.wikimedia.org/T41013 (10Marostegui) I don't really have context on this, this is probably a question for someone who knows MW in depth.
[05:52:59] <wikibugs>	 10DBA, 10Operations, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Marostegui) Not sure what `category` is as it doesn't appear on https://noc.wikimedia.org/dbconfig/eqiad.json  From there we have these active group...
[06:04:36] <wikibugs>	 10DBA, 10decommission-hardware: decommission es2019.codfw.wmnet - https://phabricator.wikimedia.org/T264063 (10Marostegui)
[06:07:50] <wikibugs>	 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es2019.codfw.wmnet - https://phabricator.wikimedia.org/T264063 (10Marostegui)
[06:07:52] <wikibugs>	 10DBA, 10Patch-For-Review: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui)
[06:07:54] <wikibugs>	 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es2019.codfw.wmnet - https://phabricator.wikimedia.org/T264063 (10Marostegui)
[06:57:10] <jynus>	 I found the mysql.py -h <host>:s4 quite useful if you don't want to think about ports
[06:57:39] <jynus>	 e.g. in an emergency where you know the host but don't want to remember the port
[07:45:52] <wikibugs>	 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es2019.codfw.wmnet - https://phabricator.wikimedia.org/T264063 (10Marostegui)
[07:52:17] <jynus>	 elukey: we got the following warning: "Last dump for analytics_meta at eqiad (db1108.eqiad.wmnet:3352) taken on 2020-09-29 02:07:47 is 2 GB, but previous one was 1 GB, a change of 64.9%"
[07:52:58] <jynus>	 it is probably nothing, but worth checking
[08:43:40] <wikibugs>	 10DBA, 10Goal, 10Patch-For-Review: Expand database provisioning/backup service to accomodate for growing capacity and high availability needs - https://phabricator.wikimedia.org/T257551 (10jcrespo) db1150 is now fully setup, set as active on netbox, added to tendril and zarcillo/prometheus and notifications...
[08:44:43] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review: Improve regular production database backups handling - https://phabricator.wikimedia.org/T138562 (10jcrespo)
[08:44:45] <wikibugs>	 10DBA, 10Goal, 10Patch-For-Review: Expand database provisioning/backup service to accomodate for growing capacity and high availability needs - https://phabricator.wikimedia.org/T257551 (10jcrespo) 05Open→03Resolved After T257551#6493655 and T257551#6493809, and all pending hardware setup, I consider thi...
[09:20:24] <kormat>	 marostegui: for T239238, is the thread_pool_stall_limit significant in https://phabricator.wikimedia.org/P12828 ?
[09:21:09] <stashbot>	 T239238: Switchover s8 primary database master db1109 -> db1104 - 2020-09-29 08:00 UTC - https://phabricator.wikimedia.org/T239238
[09:22:04] <marostegui>	 kormat: I believe we keep it 10 for masters, yeah
[09:22:08] <marostegui>	 and 100 for slaves
[09:23:11] <kormat>	 ah i see, so running puppet will change it, at least on disk
[09:23:32] <marostegui>	 yeah, you'll need to change it live on mysql itself with set global thread_pool_stall_limit = 10;
[09:23:39] <marostegui>	 (I believe it is dynamic, I haven't checked)
[09:23:55] <kormat>	 https://mariadb.com/kb/en/thread-pool-system-status-variables/#thread_pool_stall_limit says dynamic: yes
[09:23:59] <marostegui>	 \o/
[09:26:43] <kormat>	 added as a clean-up step
[09:26:52] <marostegui>	 \o/
[09:34:56] <elukey>	 jynus: o/ - so yesterday I had to purge some binlogs since the partition dedicated to mariadb on an-coord1001 was saturating.. I used the purge command two times, the first with sql_log_bin, the latter probably not now that I think about it. Is this the cause of the warning? Let me know if I made some stupid mistake
[09:38:13] <jynus>	 no, binlogs are not backed up
[09:38:27] <jynus>	 this means that the database doubled since last week
[09:40:17] <jynus>	 if that is normal and expected, I will just ack the alert
[09:40:54] <jynus>	 this is the link to the alert: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=alert1001&service=dump+of+analytics_meta+in+eqiad
[09:41:04] <elukey>	 ahhh sorry I've read it in the wrong way
[09:42:14] <elukey>	 I added a little database during the past days (hue_next), but it is tiny, and I have seen the binlog files growing a lot. I am trying to figure out what db is the root cause
[09:42:36] <jynus>	 "I have seen the binlog files growing a lot" that means it has a lot of writes
[09:42:57] <jynus>	 e.g. probably updates or insert + deletes
[09:43:07] <jynus>	 if the db itself didn't grow that much
[09:43:22] <elukey>	 is there a way to see from the dumps if there is a db that increased more than others? (if it is quick of course, don't want to force you to spend time on this)
[09:43:33] <jynus>	 elukey: actually yes
[09:43:50] <jynus>	 not on the alert, but we gather detailed metadata of every db and table
[09:44:00] <jynus>	 let me show you on pm
[09:44:03] <elukey>	 wow
[09:44:32] <jynus>	 we don't have yet a good dashboard of it because most of those cannot be public
[09:44:46] <jynus>	 but the data is very interesting for historical analysis
[09:50:05] <kormat>	 marostegui, sobanski: ok, going to start the prep for T239238 now
[09:50:05] <stashbot>	 T239238: Switchover s8 primary database master db1109 -> db1104 - 2020-09-29 08:00 UTC - https://phabricator.wikimedia.org/T239238
[09:50:20] <marostegui>	 kormat: +1
[10:09:22] <jynus>	 how is it going?
[10:10:43] <kormat>	 smoothly
[10:10:46] <kormat>	 in clean-up phase now
[10:10:55] <jynus>	 cool
[10:14:07] <kormat>	 marostegui: for the final step "Check tendril and zarcillo were updated correctly" - tendril at least looks to be displaying the right topology
[10:14:32] <kormat>	 ahh - on zarcillo i need to update the `masters` table. anything else?
[10:14:46] <marostegui>	 kormat: I believe the script does it, it is just to double check it was done correctly
[10:15:00] <marostegui>	 | s8      | eqiad | db1104   |
[10:15:02] <kormat>	 so it does, and did
[10:15:06] <marostegui>	 \o/
[10:15:08] <jynus>	 yes, that part was coded in, but sometimes there is existing inconsistencies
[10:15:27] <jynus>	 the only thing I would have wanted to include that wasn't was the dbctl commands
[10:15:29] <kormat>	 alrighty, all done then
[10:15:36] <kormat>	 marostegui: am i good to resolve this?
[10:16:02] <marostegui>	 kormat: Just double check the new master is on RO and the old one is also RO and I think you are good to go!
[10:16:24] <marostegui>	 the new master also has pt-heartbeat running fine, right?
[10:16:39] <kormat>	 RO confirmed x2
[10:17:16] <jynus>	 oh, finally the read only options were useful!
[10:17:54] <jynus>	 the other thing that has to be done manually are the events
[10:17:58] <kormat>	 marostegui: heartbeat is running, and i can see updates in the heartbeat table on the old master
[10:18:21] <marostegui>	 jynus: yes, that is part of the checklist
[10:18:25] <jynus>	 cool
[10:18:45] <jynus>	 we could until we get rid of them, package them as part of the script and do it automatically
[10:18:47] <kormat>	 and betterworkds updated :)
[10:18:49] <marostegui>	 kormat: sweet! then you are good to go
[10:19:01] <jynus>	 so many things I would like to do and so little time! :-D
[10:31:57] <wikibugs>	 10DBA, 10User-Kormat: Switchover s8 primary database master db1109 -> db1104 - 2020-09-29 08:00 UTC - https://phabricator.wikimedia.org/T239238 (10Kormat) 05Open→03Resolved Success :)
[10:32:02] <wikibugs>	 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10Kormat)
[10:32:04] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Wikidata: Schema change on production for increase the size of wbt_text_in_lang.wbxl_language - https://phabricator.wikimedia.org/T237120 (10Kormat)
[10:32:10] <wikibugs>	 10DBA, 10User-Kormat: Switchover s8 primary database master db1109 -> db1104 - 2020-09-29 08:00 UTC - https://phabricator.wikimedia.org/T239238 (10Marostegui) <3
[10:33:31] <wikibugs>	 10DBA: Failover DB masters in row D - https://phabricator.wikimedia.org/T186188 (10Kormat)
[10:38:08] <kormat>	 marostegui: a question occurs to me that i probably should have asked earlier: is there any chance that orchestrator would _replace_ zarcillo?
[10:38:31] <marostegui>	 kormat: it does haven an inventory table yeah
[10:38:34] <marostegui>	 table(s)
[10:38:47] <kormat>	 mmm
[10:40:13] <jynus>	 if it cannot do everything, we could do extra things as plugins/extensions
[10:40:19] <kormat>	 this calls into question the idea of working on making zarcillo authoratitive
[10:40:48] <marostegui>	 kormat: I think we should for now, I don't see orchestrator being the source of truth in short-term
[10:41:11] <marostegui>	 it will take time until we understand it completely and be confident about sending patches and all that
[10:41:24] <jynus>	 we should have it installed in production ASAP to see what things it can give us (even if not in full use)
[10:41:46] <marostegui>	 jynus: that is the plan
[10:41:51] <kormat>	 marostegui: ack. but on the flip side it does mean i shouldn't do _heavy_ investment in zarcillo in the meantime
[10:41:57] <jynus>	 yeah
[10:42:19] <marostegui>	 kormat: I think just the MVP that would make our life easier for now I would suggest
[10:42:29] <jynus>	 no the other side, maybe some of the more custom tables would transfer as is (?)
[10:42:45] <jynus>	 *on
[10:42:57] <kormat>	 marostegui: *nod*
[10:44:19] <jynus>	 also, while we don't have orchestrator things like "give me a list of host from s1" will be needed in any way, just the query would change, I guess
[10:44:25] <marostegui>	 kormat: In Q2 we'll know better what we can get from orchestrator's backend I think - but I think working on zarcillo to improve our current workflows is a big need and it will take time for orchestrator to take over (it can)
[10:44:45] <jynus>	 +1
[10:44:45] <marostegui>	 kormat: *if it can
[10:44:56] <kormat>	 SGTM
[10:44:57] <jynus>	 have utilities with a clean api should be reusable
[10:47:36] <jynus>	 I think in general we should avoid thinking of zarcillo as the current db and more like "a single, centralized inventory", be it called zarcillo, or orchestrator or anything else
[10:48:18] <jynus>	 and there are metadata we will also need anyway (like db object inventories) that orchestrator won't give us
[10:48:34] <jynus>	 (and maybe some other piece of software will)
[10:58:27] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review: Improve regular production database backups handling - https://phabricator.wikimedia.org/T138562 (10jcrespo)
[11:19:41] <jynus>	 I am happy the time spent on gathering metadata pays off sometimes: T264081#6501390
[11:19:42] <stashbot>	 T264081: Increase in usage of /var/lib/mysql on an-coord1001 after Sept 21st - https://phabricator.wikimedia.org/T264081
[11:58:20] <wikibugs>	 10DBA: Clean up DB related pages on Wikitech - https://phabricator.wikimedia.org/T263420 (10LSobanski)
[12:08:29] <wikibugs>	 10DBA, 10Data-Persistence, 10PM: Update the DBA task tracking workflow - https://phabricator.wikimedia.org/T263463 (10LSobanski) High level diagram and definitions: https://docs.google.com/document/d/1LANEQuSQ3XIfaraaAcfcGaq7f03TRABjHuqhU-uDTKg/
[12:41:03] <wikibugs>	 10DBA, 10Data-Persistence-Admin: Create a "how to engage us" process and documentation for Data Persistence - https://phabricator.wikimedia.org/T263456 (10LSobanski)
[12:42:38] <kormat>	 sobanski: can this be satisfied with "Please Don't." in a large font?
[12:43:12] <sobanski>	 I had bigger plans
[12:43:24] <sobanski>	 "Pretty Please"
[12:46:14] <kormat>	 :D
[14:28:56] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Operations, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat)
[16:25:45] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change to drop three indexes from wb_changes - https://phabricator.wikimedia.org/T264109 (10Ladsgroup)
[20:57:37] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10CDanis) 05Resolved→03Open crashed again
[21:01:38] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Volans) from HW logs  ` -------------------------------------------------------------------------------- SeqNumber       = 286 Message ID...
[22:55:56] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) ` 2020-09-29 20:47:59  287  SYS1003  System CPU Resetting. 2020-09-29 20:47:51  286  PWR2270  The Intel Management Engine has encounte...