[00:03:12] 10DBA: iw_url in interwiki is varbinary(127) in production but blob in code - https://phabricator.wikimedia.org/T277118 (10Ladsgroup) [01:20:02] 10Blocked-on-schema-change, 10DBA: Extend iwlinks.iwl_prefix to VARBINARY(32) - https://phabricator.wikimedia.org/T277123 (10Reedy) [01:20:12] 10Blocked-on-schema-change, 10DBA: Extend iwlinks.iwl_prefix to VARBINARY(32) - https://phabricator.wikimedia.org/T277123 (10Reedy) [06:35:37] 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui) [06:35:53] 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui) 05Open→03Resolved All checked and cleaned [06:37:07] 10DBA: Drop testreduce and testreduce_vd from m5 master - https://phabricator.wikimedia.org/T276787 (10Marostegui) 05Open→03Resolved Dropped `testreduce` ` root@db1128.eqiad.wmnet[(none)]> drop database if exists testreduce; Query OK, 5 rows affected (3.438 sec) root@db1128.eqiad.wmnet[(none)]> ` [06:37:09] 10DBA, 10Parsoid, 10Parsoid-Tests: testreduce_vd database in m5 still in use? - https://phabricator.wikimedia.org/T245408 (10Marostegui) [06:37:14] 10DBA: Drop testreduce and testreduce_vd from m5 master - https://phabricator.wikimedia.org/T276787 (10Marostegui) [06:55:17] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) The check on s1 finished. Only one table is reported as corrupted: `enwiki.iwlinks`. Going to fix that and then start replication on s1 and see if it crashes again. [07:08:59] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) `enwiki.iwlinks` fixed, started replication on s1. We'll see how it goes [07:20:51] 10Blocked-on-schema-change, 10DBA: Extend iwlinks.iwl_prefix to VARBINARY(32) - https://phabricator.wikimedia.org/T277123 (10Marostegui) p:05Triage→03Medium [07:21:57] 10DBA: iw_url in interwiki is varbinary(127) in production but blob in code - https://phabricator.wikimedia.org/T277118 (10Marostegui) p:05Triage→03Medium @Ladsgroup is this a schema change ticket or you need some help from us? Not sure what's expected from us on this task :) [07:26:54] 10DBA: fa_deleted_timestamp and fa_timestamp are binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 (10Marostegui) p:05Triage→03Medium @Ladsgroup does the `DEFAULT ' '` need to stay or it should be `DEFAULT NULL` as you mentioned? If that is the case, alter needed... [07:29:27] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) [07:29:29] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [07:29:38] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) [07:29:40] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [08:01:48] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) Fixed s8 inconsistencies. s8 and s1 are now replicating. The rest of the sections are for now stopped. If s1 and s8 sync fine, I will start them and see what happens [08:11:06] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) s5 eqiad progress: [] db1082 api [] db1096:3315 [] db1100 master [] db1110 basic [] db1113:3315 dump,vslow [] db1124:3315 sanitarium [] db1130 api [] db1144:3315 [] db1145:3315 backup s... [08:11:26] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) [08:23:11] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) [08:23:18] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) db2148 placed in s2 [08:27:48] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) s2 eqiad progress: [] db1074 sanitarium master [] db1076 [] db1105:3312 [] db1122 master [] db1125:3312 sanitarium [] db1129 [] db1146:3312 [] db1155:3312 sanitarium [] db1170:3312 [] d... [08:28:10] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) [08:38:53] 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui) [08:42:53] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) s1 sync'ed with the master correctly. s8 is still doing so, but so far so good. [09:10:24] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) s8 has synced correctly. At this point all the sections sync'ed correctly independently. I am now going to start all of them one by one, and we'll see what happens. [09:20:38] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) All threads started and now catching up...fingers crossed! [09:25:33] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) s7 is done, only pending the master. Will be done after the scheduled switchover. T274336 [09:25:35] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [09:25:37] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) [09:26:33] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) [09:27:33] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) [09:27:38] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) [09:29:18] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) [09:29:23] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) [09:44:52] 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui) 05Resolved→03Open Checking two more hosts [09:55:02] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) [09:55:06] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) [10:03:01] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 (10Marostegui) All the replication threads synced correctly. I am going to leave the host replicating until Monday and if there were no crashes, I will repool it. [10:05:40] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) s5 progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [] db1154 [] db1150 [] db1145 [] db1144 [] db1130 [] db1124 [] db1113 [] db1110 [] db... [10:05:42] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) s5 progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [] db1154 [] db1150 [] db1145 [] db1144 [] db1130 [] db1124 [] db1113 [] db1110 [] db1100 [] db1096 [] db1082 [] clou... [10:30:02] 10DBA, 10Patch-For-Review: mariadb: Replication lag monitoring does not support circular replication - https://phabricator.wikimedia.org/T275497 (10Kormat) 05Open→03Resolved This should all be in place now. [10:43:16] kormat, is the puppet right or the state of the server? [10:43:44] I see read_only: True, shouldn't it be false? [10:43:53] jynus: the issue is that puppet now expects non-master instances to be read-only. and i had forgotten to flip the bit on pc1010 [10:44:07] but shouldn't it be read-write? [10:44:21] why? it's not written to [10:44:31] otherwise I am 100% sure we will forget to change it when pooled [10:44:35] AND [10:44:45] being a cache, we don't care if we accidentally write to it [10:45:05] i'm sure we will, to begin with. that's why we have monitoring for it now [10:45:22] but I don't see how that will change? [10:45:32] e.g. if pc1009 fails, I expect to do dbctl changes, not puppet [10:45:37] in my recent puppet changes, i've made the management of the various types of clusters a lot more homogenous [10:45:39] to pool pc1010 [10:46:34] jynus: we don't control the running state of read_only via puppet [10:46:51] sure, but don't we control the alerting state via puppet? [10:47:08] jynus: when we change master, we change puppet [10:47:33] (and _also_ dbctl) [10:47:41] sure, if you remember to do all changes in an emergency... [10:48:16] that's the procedure we have in place for all the rest of the sections [10:48:22] this just makes pc act the same [10:48:54] but pc1010 is not a "real replica" [10:49:03] it is a "hot spare" [10:49:08] that happens to be replicating [10:49:44] e.g. it may be replicating from pc1009, or whatever, but may substitute a host that is not replicating from [10:52:12] I see this biting us in an emergency and warning you, doesn't affect me :-) [11:04:36] oh, heh. i forgot this bit: parsercache isn't in dbctl [11:04:47] if you need to swap masters, it's still a CR against mw-config [11:05:45] the main concern is we have a master-switchover.py, but not a pc-pool to remind those manual changes [11:05:50] in any case, even before my current changes, you'd still need to update puppet. otherwise monitoring/heartbeat/etc will be broken [11:07:03] if we had something like "pc-pool pc1010 pc1" we would eliminate manual errors [11:07:24] which is a different process than a metadata db switchover [11:08:08] that would require being able to make automated changes to gerrit [11:08:19] i don't see that happening [11:08:47] last question, what is the advantages of having pc1010 in read only? [11:09:10] jynus: making parsercache less of a snowflake means less mental load [11:09:24] you can see the changes i made here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/669845 [11:09:28] pc is already a snowflake [11:09:47] that's my point. it's a snowflake. making it _less of a snowflake_ is a good thing. [11:09:47] as you just said 2 lines earlier :-) [11:13:27] If I may interject (and I just did), having consistent configuration (to the extent where that is possible) and consistent process (to the extent where that is possible) is a good thing. [11:14:32] With that said, I cannot find a mention of the puppet change in the documentation so I'd say: (1) we add it now and (2) this goes high on my "review documentation" list [11:19:49] sobanski: if you're looking at https://wikitech.wikimedia.org/wiki/MariaDB#Production_section_failover_checklist, you'll see this step: [11:19:50] > Merge gerrit puppet change to promote NEW to master (Example): https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/538747/ [11:20:52] that shows a typical master change for a normal section [12:22:15] 10Data-Persistence-Backup, 10SRE-tools: Make recover-dump show the time taken - https://phabricator.wikimedia.org/T277160 (10Marostegui) [12:22:20] 10Data-Persistence-Backup, 10SRE-tools: Make recover-dump show the time taken - https://phabricator.wikimedia.org/T277160 (10Marostegui) p:05Triage→03Low [12:41:54] 10DBA, 10Data-Persistence-Backup: recover-mariadb should use logging (logger) to indicate actions taken - https://phabricator.wikimedia.org/T277162 (10jcrespo) [12:42:27] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) s7 eqiad progress: [] db1079 sanitarium master [] db1086 master [] db1098:3317 [] db1101:3317 [] db1116:3317 backup source [] db1125 sanitarium [] db1127 [] db1136 [] db1155 sanitarium [... [12:42:29] 10DBA, 10Data-Persistence-Backup: recover-mariadb should use logging (logger) to indicate actions taken - https://phabricator.wikimedia.org/T277162 (10jcrespo) p:05Triage→03Low [12:42:46] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) [12:51:02] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) s4 eqiad progress: [] db1121 sanitarium master [] db1125 sanitarium [] db1138 master [] db1141 [] db1142 [] db1143 [] db1144:3314 [] db1145:3314 backup source [] db1146:3314 [] db1147 [... [12:51:19] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) [13:11:36] 10DBA: iw_url in interwiki is varbinary(127) in production but blob in code - https://phabricator.wikimedia.org/T277118 (10Ladsgroup) Sorry, I should have been more clear. It's a drift report ticket. The field should become blob in production. [13:12:14] 10DBA: iw_url in interwiki is varbinary(127) in production but blob in code - https://phabricator.wikimedia.org/T277118 (10Marostegui) Thanks! Does it happen everywhere? Asking cause in some other tickets you report specific sections [13:13:08] 10DBA: fa_deleted_timestamp and fa_timestamp are binary(14) in code but varbinary(14) in production - https://phabricator.wikimedia.org/T277116 (10Ladsgroup) It should be DEFAULT NULL. Empty string is not a valid timestamp (mediawiki core accepts it but not Postgres). [13:14:58] 10DBA: iw_url in interwiki is varbinary(127) in production but blob in code - https://phabricator.wikimedia.org/T277118 (10Ladsgroup) >>! In T277118#6904529, @Marostegui wrote: > Thanks! > Does it happen everywhere? > Asking cause in some other tickets you report specific sections Good question. At first I tho... [13:27:56] marostegui: btw, I was up to "i", the rest of alphabet is left [13:41:32] holy... [13:46:23] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) s8 eqiad progress: [] db1087 sanitarium master [] db1099:3318 [] db1101:3318 [] db1104 master [] db1109 [] db1111 [] db1114 [] db1116:3318 backup source [] db1124 sanitarium [] db1126 [... [13:46:41] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) [13:49:36] and nothing from wikitech is reported, I clean that mess up personally [13:50:28] jynus: thinking about this some more, i'm not 100% happy with the current situation re: read-only on PC. we have tools to handle read-only/read-write on the other sections. we'd have to handle this manually for pc hosts. [13:50:42] (making a puppet change is orthogonal, and already required regardless of this specific decision) [13:50:50] i think i'll back out that part of the change [13:53:26] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Majavah) [13:53:47] beta cluster is now fully on mariadb 10.4! [13:55:53] Majavah: good job! [13:56:45] \o/ [13:57:05] PROBLEM - MariaDB sustained replica lag on db1126 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1126&var-port=9104 [13:58:17] RECOVERY - MariaDB sustained replica lag on db1126 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1126&var-port=9104 [14:01:27] 10DBA: Make alerts more specific - https://phabricator.wikimedia.org/T277174 (10Kormat) [14:01:48] 10DBA: Make DB alerts more specific - https://phabricator.wikimedia.org/T277174 (10LSobanski) [14:02:33] sobanski applying task concept to task title [14:03:21] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) db2149 is now placed in s3. [14:03:39] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) [14:04:57] 10DBA: Make DB alerts more specific - https://phabricator.wikimedia.org/T277174 (10LSobanski) Sample IRC alert: ` 11:37:32 <+icinga-wm> PROBLEM - MariaDB read only pc1 #page on pc1010 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.4.18-MariaDB-log, Uptime 873607s, event_scheduler: True, 1098.... [14:05:59] heh, awkward [14:06:00] 10DBA: Make DB alerts more specific - https://phabricator.wikimedia.org/T277174 (10LSobanski) p:05Triage→03Medium [14:06:19] sobanski: #.p.a.g.e is commonly configured by people to alert them on irc :) [14:06:32] yeah, it notified me :) [14:06:43] Whoops, my bad. [14:06:46] (i'm not sure what else you could have done, fwiw) [14:07:00] just.. don't edit that comment :) [14:07:25] Fortunately it wasn't tagged #sre so the blast radius was smaller [14:07:47] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) [14:08:08] * sobanski apologizes all active responders [14:08:12] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) [14:08:47] marostegui, jynus: can i get a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/670822 please? [14:08:55] it's a one-line change [14:09:44] (and i've already downtimed the alert on pc1010 for 2h :) [14:12:23] +1ed [14:15:04] ty :) [14:25:01] oh, huh. looks like i broke puppet on the sanitarium hosts earlier [14:25:18] this day: [ ] success [x] other [14:25:51] But you didn't break s7 with the schema change, so that's good [14:26:00] 😅 [14:27:41] uff, yeah. looks like my pcc hosts for the relevant change didn't include any sanitarium instances. sigh. [15:22:56] 10DBA: Migrate codfw sanitarium hosts (db2094/db2095) to Buster and 10.4 - https://phabricator.wikimedia.org/T275112 (10Marostegui) This can happen after 15th April [15:23:40] 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) This can happen after 15th April [15:24:49] 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) [15:37:35] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10Cmjohnson) @Marostegui can we schedule this for Monday next week? 1500/1600UTC timeframe please? Thanks [15:38:55] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10Marostegui) Sounds good @Cmjohnson - I will leave the host off beforehand so you can proceed as you wish. Once you are done, just power it back on and I will take it from there. Thank you [15:48:33] 10DBA, 10SRE, 10ops-eqiad: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Cmjohnson) Dell will be here tomorrow morning to replace the backplane. [18:11:31] 10Data-Persistence-Backup, 10Analytics-Clusters: Evaluate possible solutions to backup Analytics Hadoop's HDFS data - https://phabricator.wikimedia.org/T277015 (10Ottomata) [18:57:15] 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui) Fixing db2108 and db2148 [21:56:53] PROBLEM - MariaDB sustained replica lag on db1160 is CRITICAL: 18.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1160&var-port=9104 [22:00:45] RECOVERY - MariaDB sustained replica lag on db1160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1160&var-port=9104