[05:01:40] Pchelolo: I can see it on metawiki indeed, is that the only place it should be? [05:41:43] 10Blocked-on-schema-change, 10DBA, 10Wikidata: Schema change on production for increase the size of wbt_text_in_lang.wbxl_language - https://phabricator.wikimedia.org/T237120 (10Marostegui) 05Stalled→03Open [06:19:03] I am going to start disconnecting eqiad -> codfw replication on s1-s8 [06:19:24] Will issue: stop slave; reset slave all; on codfw masters, will grab and save the output of show slave status just in case [06:19:42] ok [06:27:46] All done: https://phabricator.wikimedia.org/P12440 [06:30:14] x1 you keep? [06:30:22] yep [06:31:10] 10DBA: Enable replication eqiad -> codfw and other checks - https://phabricator.wikimedia.org/T261914 (10Marostegui) [06:31:23] 10DBA: Enable replication eqiad -> codfw and other checks - https://phabricator.wikimedia.org/T261914 (10Marostegui) p:05Triage→03Medium [06:32:36] 10DBA: Enable replication eqiad -> codfw and other checks - https://phabricator.wikimedia.org/T261914 (10Marostegui) [06:34:19] 10DBA: Enable replication eqiad -> codfw and other checks - https://phabricator.wikimedia.org/T261914 (10Marostegui) Replication eqiad -> codfw was disconnected, for the record these are the coordinates from codfw masters : {P12440} [06:37:28] I am gonig to enable GTID on eqiad masters now that replication has been disconnected [06:38:15] I will verify that gtid_current_pos and slave_pos are the same [06:39:20] maybe a few compares after all that? [06:39:36] Something is going on with s7 [06:39:51] same crash on db2121, same crash we had from yesterday exactly the same table and the same index [06:39:56] on db2121 [06:41:18] you depool? [06:41:28] No, I have started replication first [06:41:32] I am going to repool db2120 [06:41:36] and then we can repool db2121 [06:41:44] depool [06:41:51] I would preemptively drop and reload the affected table on all s7 hosts [06:42:00] after handling db2121 [06:42:27] Yeah, it looks always that table and that same database and table [06:42:57] 10DBA, 10User-Kormat: db2120 & db2121 crashed - https://phabricator.wikimedia.org/T261869 (10jcrespo) [06:43:30] metawiki, but which table? [06:43:33] content [06:44:16] 10DBA, 10User-Kormat: db2120 & db2121 crashed - https://phabricator.wikimedia.org/T261869 (10Marostegui) db2121 crashed with the same database and table: ` Sep 03 06:35:19 db2121 mysqld[3365]: 2020-09-03 6:35:19 349115836 [Note] InnoDB: Index 19838 is `PRIMARY` in table `metawiki`.`content` Sep 03 06:35:19 db... [06:44:35] oh, I see [06:44:58] I thought it was a "content page / data page" not literally "a page from the content table" [06:45:45] it is 1.5 GB, reasonably small to reload it [06:45:53] yeah, should be fast [06:46:41] that makes it difficult to be a hw issue [06:46:59] of course it is not a hw issue [06:50:51] 10DBA, 10User-Kormat: db2120 & db2121 crashed - https://phabricator.wikimedia.org/T261869 (10Marostegui) [06:51:19] Going to depool db2121, upgrade + reload the table [06:52:10] db2120 processlist (load) is normal, but some kills are happening there [06:52:41] I think it is going down now [06:52:48] yeah, I have repooled it faster than usual, so it might be colder than usual [06:53:20] it should have gotten at least minimal load yesterday for warmup [07:00:48] I think stephen didn't want to repool it so late in his evening, and do it today so it could be monitored, that's why it wasn't repooled [07:00:53] anyways, going to depool db2121 [07:04:17] yeah, I would have done the same [07:04:39] of course it is way easier to speak with the power of hindsight on you side :-) [07:07:46] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) [07:08:03] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) p:05Triage→03High [07:23:39] marostegui: exactly re: repooling db2120. it also hadn't caught up on replication when i was signing off for the evening. [07:25:08] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) [07:26:12] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) db2121 done. Analyze table also looks good after the reload. [07:28:16] jynus: I have deployed the CI config change for the wmfbackups repo :] [07:30:02] thanks, hashar [07:41:07] I am going to reference a git commit until a realease is done to get me going: https://gerrit.wikimedia.org/r/c/operations/software/wmfbackups/+/623751/6/test-requirements.txt [07:47:47] jynus: yeah that is a bit hacky (I think pip ends up always cloning the repo on every install and is unable to keep it in cache), but for most purposes it works perfectly fine [07:48:03] note the tox -e format fails for me on wmfbackups [07:48:11] cause of some imports not being properly sorted [07:48:25] I haven't looked at the other build failures, hopefully they are straightforward to fix up [07:48:38] it works for me [07:49:04] 10DBA, 10MediaWiki-extensions-OAuthRateLimiter, 10Platform Team Initiatives (API Gateway), 10Platform Team Sprints Board (Sprint 2), and 2 others: Review request for a new database table for OAuthRateLimiter - https://phabricator.wikimedia.org/T258711 (10Kormat) Filtering is confirmed functioning, there's... [07:49:08] 10DBA, 10MediaWiki-extensions-OAuthRateLimiter, 10Platform Team Initiatives (API Gateway), 10Platform Team Sprints Board (Sprint 2), and 2 others: Review request for a new database table for OAuthRateLimiter - https://phabricator.wikimedia.org/T258711 (10Kormat) 05Open→03Resolved [07:49:21] or I was on the wrong repository hehe [07:49:31] 10DBA, 10User-Kormat: db2120 & db2121 crashed - https://phabricator.wikimedia.org/T261869 (10Marostegui) 05Open→03Resolved Resolving this as both hosts are back in production. db2120 was fully rebuilt db2121 got its table reloaded from the backup source in codfw. The follow up is to reload `metawiki.conten... [07:49:39] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) [07:50:34] hashar: I don't intend to leave it like that, but HEAD depends right now on a particular commit of the other repo [07:50:51] jynus: yeah it is entirely fine ;] [07:50:56] I am using the same trick on a few repos [07:50:59] but wanted to make sure the split was functional [07:51:26] even it it was just a trivial unit test [07:52:34] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) db1094 done - analyze looking good. Host upgraded too. [07:53:16] \o/ [08:05:44] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) db2086:3317 done - analyze looking good. Host upgraded too. [08:05:51] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) [08:12:52] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) db1127 done - analyze looking good. Host upgraded too. [08:12:59] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) [08:27:48] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) db1101 done - analyze looking good. Host upgraded too. [08:27:58] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) [08:29:04] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) [08:41:53] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) db2122 done - analyze looking good. Host upgraded too. [08:42:02] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) [08:43:13] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) dbstore1003:3317 done - analyze looking good. [08:43:26] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) [08:58:09] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) db1090:3317 done - analyze looking good. Host upgraded too. [08:58:18] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) [09:01:07] marostegui: good news, testing the updated cumin aliases has led me to discover some incorrect zarcillo data. [09:01:25] db1103 is reigstered as being in x1 and s2 and s4 [09:01:26] kormat: I fixed a bunch of stuff past week, so I am not surprised there is more :( [09:01:45] fixed this one at least [09:01:48] cheers [09:18:55] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) db1098:3317 done - analyze looking good. Host upgraded too. [09:19:01] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) [09:35:34] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) db2087:3317 done - analyze looking good. Host upgraded too. [09:35:46] 10DBA: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) [09:42:40] 10Blocked-on-schema-change, 10DBA, 10Wikidata: Schema change on production for increase the size of wbt_text_in_lang.wbxl_language - https://phabricator.wikimedia.org/T237120 (10Marostegui) 05Open→03Resolved The master is finally done ` root@db1109.eqiad.wmnet[wikidatawiki]> ALTER TABLE /*_*/wbt_text_in_... [09:43:00] 10Blocked-on-schema-change, 10DBA, 10Wikidata: Schema change on production for increase the size of wbt_text_in_lang.wbxl_language - https://phabricator.wikimedia.org/T237120 (10Marostegui) [09:48:18] I've uploaded https://gerrit.wikimedia.org/r/c/operations/software/+/623981 although not too worried about the patch, as much as the implication of it, let me know what you think [09:48:27] I have +1ed [09:48:54] as in "agree we should add it to the list of regularly monitored tables", right? [09:49:11] yes [09:49:30] not sure BTW if that list should be moved to the wmfmariadb repo now, up to kormat [09:49:39] or puppet, or somewhere else [09:49:55] will deploy anyway for now there [09:50:43] we can do a run the week before switchback [09:50:48] after all maintenance is done [09:51:27] that is what I did before this switchover and what I was planning to do, see: https://phabricator.wikimedia.org/T261914 [09:51:41] cool, so already planned, thank you [09:51:59] a slightly related topic [09:52:18] on of the newly big tables, image on commonswiki [09:52:34] has a non-autoinc PK [09:53:02] and that creates challenges for both db-compare and dumps (not being able to partion it well) [09:53:16] s/on/one/ [09:53:56] there has been a lot of growth there, I expect even more this week due to ongoing commons competitions [09:55:25] 10DBA, 10User-Kormat: db2120 & db2121 crashed - https://phabricator.wikimedia.org/T261869 (10Marostegui) [09:55:33] 10DBA, 10Patch-For-Review: Reload metawiki.content table on s7 hosts - https://phabricator.wikimedia.org/T261917 (10Marostegui) 05Open→03Resolved This is all done and hosts repooled. Also, the hosts that needed a reboot for the kernel upgrade for T261389, were rebooted. [09:58:33] 10DBA, 10Patch-For-Review, 10User-Kormat, 10cloud-services-team (Kanban): Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 (10Marostegui) So I will do the pooling and the master promoting while we have read only on the same `dbctl` commit [12:06:31] 10DBA, 10Operations, 10ops-codfw, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) 05Resolved→03Open This happened again. ` racadm>>racadm getsel Record: 1 Date/Time: 08/18/2020 15:23:07 Source: system Severity: Ok Desc... [12:08:40] 10DBA, 10Operations, 10ops-codfw, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) The full lclogs are here: {{P12484}} [12:10:24] marostegui: db2125 was in the 'api' group at weight 100. i'm wondering if i need to rebalance [12:11:00] the other two nodes in that group have high qps (11k and 19k) [12:11:28] let me check [12:13:13] kormat: maybe let's give db2086 and db2087 200 in main traffic and decrease db2077 to 100 in main traffic [12:13:19] so db2077 can serve more api and less main [12:14:17] marostegui: are we looking at the same section? db2125 was in s2 [12:14:30] oh, I was looking at s7, sorry [12:14:33] too much s7 today [12:14:38] :) [12:15:22] so let's put db2088 and db2138 to 200 for now and let's see how it goes [12:15:43] alrighty [12:15:59] kormat: and db2108 to 300 even, as that is dump/vslow host that won't be used in codfw [12:17:43] marostegui: can you check the diff on cumin1001 please? (i think i shouldn't be making this big a change in a single commit, so if you can confirm i'm at least touching the right group, i can remake it in smaller increments) [12:17:58] checking [12:18:31] kormat: that makes sense, commit that for now [12:18:46] we'll see if we need to readjust [12:18:48] grand. [12:19:57] I clearly jinxed it as this happened while I was telling lukasz during the meeting that we've been pretty stable in the last few months [12:20:07] hehe [12:20:21] on the plus side, we were talking about how i've been waiting for a slave-down event to handle for my OKRs ;) [12:20:31] indeed XD [12:24:33] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) a:05Kormat→03Papaul [12:28:10] 10DBA, 10Upstream, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) Update on this: So MariaDB has shipped some fixes that could be somewhat related to this on the last 10.4.14. They've also been able to reproduce the err... [12:51:29] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) No response on console, idrac says power is on. I've tried `serveraction hardreset`, but no response on console. `serveraction powerdo... [12:52:31] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) That sounds like mainboard to me :( [12:52:35] it came back, didn't it? [12:52:46] should I downtime the services? [12:52:55] it is done [12:53:26] RECOVERY - Host db2125 is UP: PING OK - Packet loss = 0%, RTA = 34.07 ms [12:54:00] i duplicate it, there is a lag on icinga, sorry [13:12:35] kormat: my proposal would be to start mysql on db2125 so it catches up, so at least we have it ready in case we want to pool it back for load issues [13:12:50] so far there is no need to pool it back, but let's cover for that possibility [13:13:36] i'm kinda assuming it has data corruption, are we feeling lucky? [13:15:15] kormat: let's start and see XD [13:15:34] kormat: we can leave it started, but let's not pool it [13:16:53] ok :) [13:17:31] kormat: In case it doesn't have corruption, it can replicate and if we feel the load in s7 is too much, we can always pool it a bit, but I rather not if we don't have absolutely have to [13:18:47] alright. it's started and replicating. [13:19:03] nice, let's see [13:19:14] last time corruption showed up a few hours later no? [13:19:52] 10DBA, 10Patch-For-Review, 10User-Kormat, 10cloud-services-team (Kanban): Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 (10Marostegui) [13:20:02] yeah i think so [14:13:00] 10DBA, 10Patch-For-Review, 10User-Kormat, 10cloud-services-team (Kanban): Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 (10Marostegui) [14:15:48] 10DBA, 10Patch-For-Review, 10User-Kormat, 10cloud-services-team (Kanban): Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 (10Marostegui) [14:18:41] 10DBA, 10Patch-For-Review, 10User-Kormat, 10cloud-services-team (Kanban): Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 (10Marostegui) The failover was successfully done. wikitech went read only at 14:01:38 and went back to writable at 16:04:40 [14:19:24] 10DBA, 10Patch-For-Review, 10User-Kormat, 10cloud-services-team (Kanban): Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 (10Marostegui) I am going to leave the TTL to 1M until tomorrow, just in case. I will revert it back to 5M tomorrow morning. [15:06:29] marostegui: how would you feel about me merging my puppet patches now? [15:06:37] (what's the worst that could happen, etc) [15:07:31] kormat: let me quickly check [15:07:58] I recall there was one that touched the basedir, but I saw the PCC and it looked good [15:09:42] kormat: I think it should be fine, but we can be extra careful and disable puppet everywhere and try it manually on a few hosts, like one from s1, es1, pc1, m1 and labsdb? [15:09:47] how do you feel about that? [15:10:00] excited! ;) [15:10:03] XD [15:12:10] running `cumin A:db-all 'disable-puppet "Deploying puppet packaging changes - T256972 - kormat"'` [15:12:10] T256972: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 [15:12:21] sweet [15:12:36] I can take pc1, m1 and labsdb if you want [15:13:27] that'd be great [15:13:35] ok, I will do that [15:13:38] let me know when merged [15:14:35] merged now. [15:14:41] ok, testing [15:15:51] pc1 looks good, there was a change I wasn't expected, but which actually makes sense to have [15:15:57] -optimizer_switch = 'mrr=on,mrr_cost_based=on,mrr_sort_keys=on,optimize_join_buffer_size=on' [15:15:57] +# Disable rowid_filter on 10.4 optimizer - T245489 [15:15:58] +optimizer_switch = 'mrr=on,mrr_cost_based=on,mrr_sort_keys=on,optimize_join_buffer_size=on,rowid_filter=off' [15:15:58] T245489: Possibly disable optimizer flag: rowid_filter on 10.4 - https://phabricator.wikimedia.org/T245489 [15:16:19] marostegui: right yeah - that's because version detection depends on the basedir, and the basedir used to be wrong [15:16:23] yep [15:16:32] labsdb looking clean too [15:17:06] m1 (multi instance) looks good too [15:17:10] green light from my side! [15:18:16] test on s1/es1/and a random core_multiinstance host all look good too [15:18:41] \o/ [15:19:04] reenabling puppet everywhere [15:19:54] I have tested a backup source and it also works fine [15:20:58] running `cumin -b 10 A:db-all 'run-puppet-agent'` now [15:21:09] * marostegui runs [15:21:20] :D [15:21:56] marostegui: can i get a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/622995 please? [15:22:01] checking [15:22:24] I thought I did already [15:22:36] done [15:22:47] that's because you feel my work is always +1-worthy. ;) [15:23:09] * marostegui looks aside... [15:23:20] hehe [17:37:22] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) I will open a case with Dell [23:36:09] 10DBA, 10CheckUser: Monitor the growth of CheckUser tables thanks to the addition of login data - https://phabricator.wikimedia.org/T261999 (10Huji)