[05:32:38] PROBLEM - MariaDB sustained replica lag on db2106 is CRITICAL: 1.402e+04 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2106&var-port=9104 [05:35:52] 10DBA: db2106 and db2147 crashed - https://phabricator.wikimedia.org/T279406 (10Marostegui) [05:36:09] 10DBA: db2106 and db2147 crashed - https://phabricator.wikimedia.org/T279406 (10Marostegui) The table look ok by the way: ` +----------------------+-------+----------+----------+ | Table | Op | Msg_type | Msg_text | +----------------------+-------+----------+----------+ | commonswiki.category |... [05:43:39] 10DBA: db2106 and db2147 crashed - https://phabricator.wikimedia.org/T279406 (10Marostegui) https://jira.mariadb.org/browse/MDEV-25344 [05:44:00] 10DBA: db2106 and db2147 crashed - https://phabricator.wikimedia.org/T279406 (10Marostegui) 05Open→03Resolved p:05Triage→03Medium a:03Marostegui [05:44:27] 10DBA: db2106 and db2147 crashed - https://phabricator.wikimedia.org/T279406 (10Marostegui) 05Resolved→03Open a:05Marostegui→03None [05:45:11] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) That's excellent, are we good to close this? [05:49:44] RECOVERY - MariaDB sustained replica lag on db2106 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2106&var-port=9104 [05:55:13] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Marostegui) That's good news Amir. Thanks for testing it. Let's go for the large mailing lists import to see how it looks indeed. [05:57:40] 10DBA: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 (10Marostegui) [05:58:33] 10DBA, 10GrowthExperiments-MentorDashboard, 10GrowthExperiments-Mentorship, 10Growth-Team (Current Sprint), 10User-Urbanecm_WMF (Engineering): Create growthexperiments_mentor_mentee database table on extension1 for wikis in growthexperiments.dblist - https://phabricator.wikimedia.org/T278573 (10Marostegui... [06:04:01] marostegui: Amir1 is ready to import larger mailing lists to the test database, I was just worried about causing replag if we're bulk importing 1-2GB of archives, should we make sure we have DBA supervision when we do the import or what? [06:06:11] legoktm: in general "it is ok" to create lag on misc hosts, as we don't use the slaves for reads. But it will definitely fire some alerts, so better if you give us a heads up so we are aware of those :) [06:06:44] legoktm: shouldn't you be sleeping anyways? [06:06:58] ack, will make sure he/we ping y'all [06:07:10] it's only 11pm here, too early for me to sleep :) [06:08:14] hahaha [06:15:12] 10DBA, 10decommission-hardware: decommission db1076.eqiad.wmnet - https://phabricator.wikimedia.org/T274752 (10Marostegui) 05Stalled→03Open [06:15:16] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:16:59] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1076.eqiad.wmnet - https://phabricator.wikimedia.org/T274752 (10Marostegui) [06:18:10] 10DBA: job_cmd is varbinary(255) in production while being varbinary(60) in code since 2007 - https://phabricator.wikimedia.org/T278621 (10Marostegui) p:05Triage→03Medium [06:28:28] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10elukey) Almost! There are a couple of things left: - clouddb1021 is still running with icinga notifications disabled, plus t... [06:29:28] 10Data-Persistence-Backup, 10SRE, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10ayounsi) As discussed over IRC a while ago, this is mostly due to the network being more used in the eqia... [06:35:34] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) Sounds good @elukey - thanks! [06:36:07] 10DBA: ipb_timestamp is varbinary(14) in old wikis while being binary(14) in the code since 2007 - https://phabricator.wikimedia.org/T278619 (10Marostegui) p:05Triage→03Medium [06:57:24] marostegui: is there supposed to be a # at the start of https://phabricator.wikimedia.org/project/manage/5291/ in its name [07:02:50] RhinosF1: The petition includes the "#" but it is probably a mistake, not a big deal anyways [07:02:52] Thanks though! [07:03:21] I noticed it and thought phab had gone crazy then realised it was in the name [07:03:27] Only project afaics that does it [07:23:13] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) Transfer on-going from db1169 to db1184 [07:26:13] 10Data-Persistence-Backup, 10SRE, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) > Are the backup long TCP sessions or many small ones? I would have to prove myself wrong with... [07:51:11] 10DBA, 10Add-Link, 10Growth-Team: Determine why querying is slow and what we can do about it - https://phabricator.wikimedia.org/T279411 (10kostajh) @Marostegui @jcrespo I would appreciate hearing any ideas you have of what might be going on. While we haven't yet deployed the addition of the primary key sch... [07:57:12] marostegui: morning :D I have this for you: https://gerrit.wikimedia.org/r/c/operations/puppet/+/675308 [07:57:18] OKR for migrating to puppet 6 [07:57:35] I'll import some mailing lists later today [08:00:19] Amir1: will check in a few minutes [08:00:50] 10DBA, 10Add-Link, 10Growth-Team: Determine why querying is slow and what we can do about it - https://phabricator.wikimedia.org/T279411 (10jcrespo) Run: ` EXPLAIN SELECT value FROM lr_cswiki_anchors WHERE lookup = {foo} LIMIT 1 ` both locally and in production and that will generate a query plan. You can... [08:06:29] Thanks! [08:20:38] Amir1: going to merge it, we'll see if tendril breaks :p [08:22:02] Amir1: done [08:24:36] marostegui: haha, thanks! [08:24:45] keep me posted [08:29:48] wilco! [08:30:03] hi marostegui [08:30:26] hey kostajh [08:30:32] I don't think I have credentials to access that, or at least I don't remember seeing them. When we productionized the service, the credentials were left in Janis' home directory to place into puppet [08:30:52] kostajh: don't worry, I can run the explains for you and paste it on the task [08:31:06] cheers [08:31:08] kostajh: will do it a bit later if that's ok, [08:31:13] yeah no rushj [08:31:28] thanks! [08:39:17] Amir1: icinga-wm> PROBLEM - Check systemd state on dbmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: tendril-5m.service,tendril-queries.service [08:39:39] ugh [08:39:48] yes [08:39:54] 10DBA: db2106 and db2147 crashed - https://phabricator.wikimedia.org/T279406 (10Kormat) a:03Kormat I'll reclone both servers. [08:39:57] do you know why it's failing? [08:40:08] I will take a look [08:40:41] one change is that crons just silently fail but systemd failures result in page [08:40:56] which is a curse and a blessing [08:42:28] Apr 6 08:40:01 dbmonitor1002 tendril-cron-5m.pl[6434]: DBI connect(';mysql_read_default_file=/etc/mysql/tendril.cnf;mysql_read_default_group=tendril','',...) failed: Access denied for user 'watchdog'@'208.80.155.104' (using password: YES) at /usr/local/bin/tendril-cron-5m.pl line 11. [08:42:47] I can fix that [08:43:39] Thanks [08:43:55] how it worked before than? [08:44:03] did I mess up the users [08:44:28] Maybe it wasn't :) [08:44:47] We switched to a new dbmonitor host a week ago (running buster) [08:44:54] And that IP wasn't in the grants [08:45:19] mysql using ip-based grants is still the dumbest thing ever. [08:45:36] root@dbmonitor1002:~# systemctl list-units --failed [08:45:36] 0 loaded units listed. Pass --all to see loaded but inactive units, too. [08:45:38] looks good now [08:45:44] the recovery should arrive soon to -operations [08:46:16] Apr 6 08:45:20 dbmonitor1002 tendril-cron-5m.pl[9075]: pc1007.eqiad.wmnet:3306 [08:46:17] it works now [08:46:24] RECOVERY - Check systemd state on dbmonitor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:54] \o/ [08:50:18] kormat: tbf is really safe [08:50:21] maybe too safe [08:53:16] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) db1184 is now replicating. [09:01:12] Amir1: it's a management nightmare. which strongly incentivises people to just use blanket host patterns [09:01:54] yeah, I think it doesn't support CIDR ranges either [09:02:34] the follow up clean up https://gerrit.wikimedia.org/r/c/operations/puppet/+/677111 [09:17:37] Amir1: I will give it a couple of hours before merging that lastone [09:18:00] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) checking tables on db1184 [09:18:43] sure [11:45:42] kormat: for db2106 and db2147 might be worth thecking if they need their buster kernel upgraded [11:46:33] mmm, they are not https://phabricator.wikimedia.org/T273281 so maybe it is done already [11:46:50] db2106 is done [11:46:58] db2147 was probably installed later than that, so probably has the right one [11:47:07] first thing i checked :) [11:48:36] aw <3 [13:53:25] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Milimetric) > - @JAllemandou may have some performance questions to add related to indexes IIRC, leaving a note in here to re... [13:54:50] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Services, and 2 others: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) >>! In T269211#6974963, @elukey wrote: > Almost! There are a couple of things left: > > - clouddb1021 is still r... [14:02:31] elukey: I have ack'ed the memory usage notification on icinga [14:20:10] marostegui: ah yes makes sense, I left in there to remember to remove or tweak the alert (other cloud replicas show the same) [14:22:56] yeah, it needs to be removed [14:23:00] +1 to that [17:01:00] 10Data-Persistence-Backup, 10SRE, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10Papaul) [17:13:20] 10DBA, 10Upstream: DB backup restore skip empty databases - https://phabricator.wikimedia.org/T200035 (10LSobanski) Interestingly, https://github.com/maxbube/mydumper/issues/110 received a response 6 days ago.