[00:46:52] 10DBA, 10Community-Tech, 10Expiring-Watchlist-Items: Watchlist Expiry: Release plan [rough schedule] - https://phabricator.wikimedia.org/T261005 (10ifried) @Marostegui Watchlist Expiry has been enabled on Wikidata & Commons. [05:36:07] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) This is very bad news. clouddb1013:3311 (s1) and clouddb1017:3311 (s1) crashed at the same time with the same error (the ones we've seen before) with: ` Nov 17 22:... [05:54:41] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [06:05:16] 10DBA, 10CheckUser: Monitor the growth of CheckUser tables at enwiki and few other very large wikis - https://phabricator.wikimedia.org/T267275 (10Marostegui) [06:15:57] 10DBA, 10Community-Tech, 10Expiring-Watchlist-Items: Watchlist Expiry: Release plan [rough schedule] - https://phabricator.wikimedia.org/T261005 (10Marostegui) Thanks, this should be monitored for a few weeks in a similar way than we are doing for other tables (ie: T267275) [06:23:11] 10DBA, 10Community-Tech, 10Expiring-Watchlist-Items: Monitor the growth of watchlist table at wikidata and wikicommons - https://phabricator.wikimedia.org/T268096 (10Marostegui) [06:23:52] 10DBA, 10Community-Tech, 10Expiring-Watchlist-Items: Monitor the growth of watchlist table at wikidata and wikicommons - https://phabricator.wikimedia.org/T268096 (10Marostegui) p:05Triage→03Medium [06:52:51] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) Once the `check table` has finished on db1124:3311 I will transfer it to clouddb1013 and clouddb1017 and I will start them with: clouddb1013: `innodb_change_buffe... [07:05:10] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) If this doesn't work for some reason, I am thinking about: - Cloning these hosts from sanitarium's master, sanitize them and start replication from sanitarium (aft... [07:18:43] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [07:20:26] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) Attempting the same on s6 with: Running a check on s6 tables on db1125 clouddb1015:3316 `innodb_change_buffering=none` and `event_scheduler=OFF` (make sure all t... [07:44:07] 10Blocked-on-schema-change, 10DBA: Drop default of protected_titles.pt_expiry - https://phabricator.wikimedia.org/T267335 (10Marostegui) Deployed on db1098:3316 - will leave it for a day to make sure nothing reads-related are affected by this change [07:44:48] 10Blocked-on-schema-change, 10DBA: Drop default of ip_changes.ipc_rev_timestamp - https://phabricator.wikimedia.org/T267399 (10Marostegui) Deployed on db1098:3316 - will leave it for a day to make sure nothing reads-related are affected by this change [08:14:57] 10DBA, 10decommission-hardware: decommission es1011.eqiad.wmnet - https://phabricator.wikimedia.org/T268100 (10Marostegui) [08:15:27] 10DBA, 10decommission-hardware: decommission es1012.eqiad.wmnet - https://phabricator.wikimedia.org/T268101 (10Marostegui) [08:15:59] 10DBA, 10decommission-hardware: decommission es1014.eqiad.wmnet - https://phabricator.wikimedia.org/T268102 (10Marostegui) [08:16:31] the s1 and s6 snapshots failures are expected due to hw maintenance on db1139 [08:16:49] I will generate a manual snapshot until it is put down again [08:17:00] 10DBA, 10decommission-hardware: decommission es1014.eqiad.wmnet - https://phabricator.wikimedia.org/T268102 (10Marostegui) p:05Medium→03High Setting this to high as we need to make space for x2 hosts (T267043#6606399) [08:17:28] 10DBA, 10decommission-hardware: decommission es1012.eqiad.wmnet - https://phabricator.wikimedia.org/T268101 (10Marostegui) p:05Medium→03High Setting this to high as we need to make space for x2 hosts (T267043#6606399) [08:17:46] 10DBA, 10decommission-hardware: decommission es1011.eqiad.wmnet - https://phabricator.wikimedia.org/T268100 (10Marostegui) p:05Medium→03High Setting this to high as we need to make space for x2 hosts (T267043#6606399) [08:29:12] 10DBA, 10decommission-hardware: decommission es1011.eqiad.wmnet - https://phabricator.wikimedia.org/T268100 (10Marostegui) [08:29:19] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1012.eqiad.wmnet - https://phabricator.wikimedia.org/T268101 (10Marostegui) [08:29:24] 10DBA, 10decommission-hardware: decommission es1014.eqiad.wmnet - https://phabricator.wikimedia.org/T268102 (10Marostegui) [08:32:45] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1014.eqiad.wmnet - https://phabricator.wikimedia.org/T268102 (10Marostegui) [08:32:50] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1012.eqiad.wmnet - https://phabricator.wikimedia.org/T268101 (10Marostegui) [08:32:55] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1011.eqiad.wmnet - https://phabricator.wikimedia.org/T268100 (10Marostegui) [08:35:43] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1011.eqiad.wmnet - https://phabricator.wikimedia.org/T268100 (10Marostegui) [08:35:45] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1012.eqiad.wmnet - https://phabricator.wikimedia.org/T268101 (10Marostegui) [08:35:49] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1014.eqiad.wmnet - https://phabricator.wikimedia.org/T268102 (10Marostegui) [08:42:32] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) I will put the server back up temporarily for some hours so it catches up and we can generate a full backup before the maintenance. [09:51:48] 10Blocked-on-schema-change, 10DBA, 10Operations, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [09:55:17] 10Blocked-on-schema-change, 10DBA, 10Operations, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) Running one last check over all section instances to confirm the change has been made everywhere. [10:33:36] marostegui, does this seems right to you per our meeting? (current setup) https://usercontent.irccloud-cdn.com/file/F109CBTa/image.png [10:35:28] arturo: for the current situation yes [10:35:33] ok thanks [10:36:00] arturo, nitpicking here and offtopic (sorry), but when representing replication, try to avoid "connection arrows" like those, as data transfer happens in that direction, but connection happens the opposite way [10:37:26] ok thanks, will see how can I better reflect that [10:37:36] replicas work in a "pull" configuration, so they are the clients, but of course data goes in the way you say [10:38:07] I am nitpicking because it is important for network drawings, not so much for architecture/logical ones [10:38:21] (e.g. the port must be open on masters, not replicas) [10:40:12] maybe inverting the arrows and add a dotted one for data flow direction? [10:40:36] it all depends on what level the diagram is supposed to be working on [10:40:39] is offtopic [10:40:40] yeah [10:40:45] for a general overview i think it's perfectly fine [10:41:46] I just had a replication error on db1139 [10:41:51] which is what we discussed on the meeting [10:42:02] error reconnecting to master 'repl@db1131.eqiad.wmnet [10:43:06] Slave_IO_State: Reconnecting after a failed master event read [10:43:23] there are no issues on s6 production, right? [10:43:48] none that i'm aware of, at least [10:43:53] mmm [10:45:18] Error reading packet from server: Lost connection to MySQL server during query (server_errno=2013) [10:45:20] on log [10:45:28] maybe a network glitch? [10:45:36] has it recovered? [10:45:50] yes, it reconnected automatically [10:45:59] but I saw it "Connecting" for a long time [10:46:02] FWIW, two-sided arrows in this diagram would confuse me if there was no additional explanation [10:46:09] sobanski: indeed. [10:46:19] in fact it lagged for 1000seconds, kormat [10:47:14] it is in a "Connecting" state again [10:47:31] monitoring seems to imply it's happening for both instances on that machine [10:47:50] s1 is stopped on pourpuse because backups [10:47:59] it may be happening there, but we don't know [10:48:17] as it should be stopped, and not trying to reconnect at the momment [10:48:20] ah - is this a dbstore host? [10:48:23] yes [10:48:45] I will try to stop slave and start again on s6 [10:48:51] to see what happens [10:49:44] nothing, checking network connectivity between hosts [10:50:00] ping works [10:50:53] i'm not seeing any other host in s6 having issues [10:51:20] ERROR 2013 (HY000): Lost connection to MySQL server at 'reading authorization packet', system error: 2 "No such file or directory" [10:51:38] I have no idea what that is, but it is not network [10:51:57] it looks like the machine might be maxing out it's network interface [10:52:02] oh [10:52:14] it's pegged at tx 122MB/s [10:52:28] that could be the backups [10:52:46] but in the past they weren't as fast as to create replication issues [10:52:47] started at 10:18 UTC [10:52:54] yes, that was me [10:53:07] ack. is it possible to apply some rate-limiting? [10:53:18] in fact, I remember executing backups the same way a few days ago [10:53:35] yes, but not in the middle of the transfer [10:53:51] I will stop replication [10:53:57] alright. might be worth doing that by default in the future [10:54:44] to be fair, this doesn't normally happen- 2 backups at the same time from the same server [10:55:00] I just did it because ongoing memory issues with the server [10:58:44] 10DBA, 10Patch-For-Review: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) es1032 has been pooled in es1 [10:58:56] 10DBA, 10Patch-For-Review: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) 05Open→03Resolved [10:59:59] 10DBA, 10Orchestrator: Investigate hostname/fqdn handling in orchestrator - https://phabricator.wikimedia.org/T267929 (10Kormat) If there's an entry in `database_resolve` that maps to a bare hostname, e.g.: ` +--------------------+--------------------+---------------------+ | hostname | resolved_hos... [11:00:58] 10DBA, 10Patch-For-Review: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) Next step is to decommission the old hosts, but that will be handled separately on their own tickets [11:01:28] kormat: in the past, writing to disk serially was much slower than network capacity- but apparently I have optimized the backups too well :-) [11:14:45] I wonder if that could be the issue for large transfers failing? [11:24:42] 10Blocked-on-schema-change, 10DBA, 10Operations, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) 05Open→03Resolved Check completed successfully, we're done \o/ [11:27:37] kormat: I think https://phabricator.wikimedia.org/T267767 is waiting for you so don't be that happy! [11:27:58] 😭 [11:28:39] we have 4 in the queue: https://phabricator.wikimedia.org/tag/blocked-on-schema-change/ - I have started 2, and the user_newtalk is a bit more complex than the dropping default ones, so I am leaving that one aside for now [11:31:41] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) a:03Kormat [11:32:10] 🎉 [11:32:29] 🥀 [11:33:19] haha [12:01:39] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) pc1009 (pc3) is done ` root@pc1009:~# mysql -e "select @@report_host" +--------------------+ | @@report_host | +--------------------+ | pc1009.eqiad.wmnet... [12:02:27] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [12:02:38] ^ kormat I will deploy grants on pc3 and auto-discover it [12:03:20] 👍 [12:35:28] 10DBA, 10Orchestrator: Orchestrator doesn't use FQDN when manipulating replicas - https://phabricator.wikimedia.org/T267389 (10Kormat) This should now be fixed by https://gerrit.wikimedia.org/r/641402. Needs testing. [12:41:21] This is not urgent, but I was trying to complete the inventory of SRE datasets for my goal. Currently orchestrator is installed on dborch and its database is at db2093 (I know that may change later), correct? [12:41:50] yep [12:42:04] ah, but tendril and zarcillo have not been deleted from db2093 [12:42:16] so orchestrator may be much smaller than it looks [12:42:22] zarcillo is replicated to db2093 [12:42:28] the orchestrator database itself is tiny [12:42:40] <1MB [12:42:41] I am going to annotate 5.9M for now [12:43:00] the size of the sqldata/orchestrator for now [12:43:14] I don't need very accurate measures with such small db :-D [12:43:15] thank you! [12:43:18] np :) [12:44:21] you can see what I have been working at: https://docs.google.com/spreadsheets/d/1aAo8COkz3_P3NS73i-ZZXu0gocx1J6mlzA79Drwo8CA/edit (will need help at some point) [12:57:41] 10DBA, 10Operations, 10Orchestrator, 10Patch-For-Review: orchestrator: Use ssl for talking to db servers - https://phabricator.wikimedia.org/T267401 (10Kormat) 05Open→03Resolved a:03Kormat Fixed by https://gerrit.wikimedia.org/r/639765. From the commit description: > The orchestrator docs are a bit... [14:40:52] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) >>! In T267090#6629470, @Marostegui wrote: > Attempting the same on s6 with: > > Running a check on s6 tables on db1125 This came back clean, tomorrow I will do... [15:22:53] 10DBA, 10Operations, 10Orchestrator, 10User-Kormat: Explore orchestrator hooks to integrate them with dbctl, !log, irc alerts and emails - https://phabricator.wikimedia.org/T266452 (10Kormat) One thing that's not currently clear how to handle is starting/stopping pt-heartbeat on masters. [16:54:50] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) Server is back down and ready for maintenance after the backup.