[05:58:36] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on es1023 - https://phabricator.wikimedia.org/T268796 (10Marostegui) a:03wiki_willy @wiki_willy this host is under warranty, can we order a new disk from Dell? [05:58:53] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on es1023 - https://phabricator.wikimedia.org/T268796 (10Marostegui) p:05Triage→03Medium [06:10:37] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [06:19:21] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312 (10Marostegui) @Bstorm have you found any other grant issues or should I go ahead and deploy all those roles/users to the rest of the clou... [06:20:58] 10DBA, 10Community-Tech, 10Expiring-Watchlist-Items: Watchlist Expiry: Release plan [rough schedule] - https://phabricator.wikimedia.org/T261005 (10Marostegui) Let's go ahead for the 01/12/2020 [06:52:51] 10DBA, 10Operations, 10CAS-SSO, 10User-jbond: Request new database for idp.wikimedia.org - https://phabricator.wikimedia.org/T268327 (10Marostegui) Added the new DB to the misc doc https://wikitech.wikimedia.org/w/index.php?title=MariaDB%2Fmisc&type=revision&diff=1889656&oldid=1889330 [07:05:17] 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [07:07:32] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [07:09:04] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) clouddb1016:3315: - Data copied from db1124:3315 - Host added to tendril and zarcillo - Root password changed - Replication started from:... [07:13:31] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [07:13:54] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [08:06:49] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) [08:10:45] I am checking x1 backups [08:26:05] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) [08:32:46] something weird happend with the systemd timer yesterday- it didn't execute [08:33:19] cumin2001 systemd[1]: regular_snapshot.service: Current command vanished from the unit file, execution of the command list won't be resumed. [08:34:24] So how exactly do I convice systemd to execute it? and why it failed on codfw but not on eqiad? [08:35:20] Nov 25 16:41:48 cumin2001 puppet-agent[7994]: (/Stage[main]/Profile::Mariadb::Backup::Transfer/Systemd::Timer::Job[regular_snapshot]/Systemd::Unit[regular_snapshot.service]/File[/lib/systemd/system/regular_snapshot.service]/content) content changed '{md5}90e5b0cd94a53b44591913ebe688f247' to '{md5}583d2cd189709b2a570ac3325b8de746' [08:35:25] yes [08:35:27] So something changed there? [08:35:27] that I know [08:35:33] I deployed this: [08:35:57] https://gerrit.wikimedia.org/r/c/operations/puppet/+/643223/4/modules/profile/manifests/mariadb/backup/transfer.pp [08:36:22] but from that I expected to run the new command, not to fail, and it only failed on 1 out of 2 hosts (it worked on cumin1001) [08:36:29] But from that same puppet run, it looks like it broke? [08:36:37] cause it is at the same time that the refresh was done [08:36:42] at 16:41:48 [08:36:57] although later it does a reload what worked apparently: Nov 25 16:41:48 cumin2001 puppet-agent[7994]: (/Stage[main]/Profile::Mariadb::Backup::Transfer/Systemd::Timer::Job[regular_snapshot]/Systemd::Unit[regular_snapshot.service]/Exec[systemd daemon-reload for regular_snapshot.service]) Triggered 'refresh' from 1 event [08:37:26] refresh is ok, but backups were 2 hours later? [08:38:11] so I am worried that our puppet timer code is unreliable under some conditions [08:38:55] maybe there is a race condition on update or something? [08:39:07] there are no logs apart from those on why it failed? [08:39:19] on the timer side, no [08:39:34] just the "it vanished, bye!" :-) [08:39:52] I am going to do an ensure => absent, ensure => present [08:40:16] and them report to code maintainers to see if they have some idea [08:40:33] check if it repeats again [08:41:00] we have monitoring for this, so we would have caught it (This is why x1 and soon other backups checks will fail) [08:41:30] but if it is a systemd puppet code timer issue it is more worrying because it is used for other stuff too [08:41:41] https://phabricator.wikimedia.org/T255132#6214939 [08:41:57] Maybe worth checking with him to see if he found something else about it [08:42:00] yeah [08:42:17] although if you can see, there was a merge after that [08:42:27] supposedly avoiding the issue [08:42:34] I will report there that it happened for us [08:42:38] yep [08:42:40] maybe as a workaround [08:42:53] we should disable and reenable timers when modifying them [08:43:21] it is not like cron didn't have its own issues (e.g. when disabling them) [08:43:28] (re:puppet) [08:43:40] thank marostegui you helped me a lot with this [08:48:38] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) s7 eqiad progress [x] dbstore1003 [] db1136 [] db1127 [x] db1116 [x] db1101 [x] db1098 [] db1094 [x] db1090 [] db1086 [] db1079 [08:53:50] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) [09:09:37] 10DBA, 10Orchestrator: Configure mariadb to notice/recover from replication issues quicker - https://phabricator.wikimedia.org/T268320 (10Marostegui) [09:25:28] 10DBA, 10Operations, 10Patch-For-Review, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [09:25:36] 10DBA, 10Operations, 10Patch-For-Review, 10User-Kormat, 10User-jbond: Standardize/centralize mapping from section to mariadb port/socket and prom-mysql-exporter port - https://phabricator.wikimedia.org/T257033 (10Kormat) 05Open→03Resolved a:03Kormat I think it's good enough to resolve at this point... [09:27:43] 10DBA, 10decommission-hardware: decommission es1015.eqiad.wmnet - https://phabricator.wikimedia.org/T268810 (10Marostegui) [09:28:44] 10DBA, 10decommission-hardware: decommission es1015.eqiad.wmnet - https://phabricator.wikimedia.org/T268810 (10Marostegui) [09:30:47] 10DBA, 10decommission-hardware: decommission es1016.eqiad.wmnet - https://phabricator.wikimedia.org/T268812 (10Marostegui) [09:31:14] 10DBA, 10decommission-hardware: decommission es1016.eqiad.wmnet - https://phabricator.wikimedia.org/T268812 (10Marostegui) [09:39:10] 10DBA, 10decommission-hardware: decommission es1016.eqiad.wmnet - https://phabricator.wikimedia.org/T268812 (10Marostegui) [09:51:08] I have added documentation at: https://wikitech.wikimedia.org/wiki/Mysql.py [09:53:59] jynus: nice! i just made a small edit to correct one bit [09:59:19] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on es1023 - https://phabricator.wikimedia.org/T268796 (10wiki_willy) a:05wiki_willy→03Cmjohnson @Marostegui - for sure. Moving over to @cmjohnson to start the RMA process with Dell (S/N: DTJT513 for a Dell PowerEdge R740xd). Thanks, Willy >>! In T268796#... [09:59:56] with a combination of those we could start thinking to create man pages, but sadly, while syncronyzing doc.wikimedia.org and packages documentation is trivial, not so much for wikitech pages [10:00:35] for transfer.py I ended up linking it: https://wikitech.wikimedia.org/wiki/Transfer.py#Usage [10:04:24] 10DBA, 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Config, and 2 others: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10Kormat) [10:30:10] sobanski: free offsite hosting? https://aws.amazon.com/opendata/open-data-sponsorship-program/ [10:49:06] jynus: sounds like we could also use this for Wiki replicas ;) [10:49:37] It's only for two years though [11:37:53] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) [11:58:15] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) [12:17:29] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10LSobanski) @Jclark-ctr based on the DC entry schedule, when do you expect you will be able to take a look at this? Knowing this would allow us to bette... [12:22:11] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Jclark-ctr) @lsobanski I will be on site Monday [12:24:04] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10LSobanski) @Cmjohnson Would it be possible to plan for racking 5 instead of 3 of the new hosts in one go? It would help us prepare fot Sanitarium host Buster/10.4... [12:24:32] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10LSobanski) Thanks! [12:40:33] having an onsite wiki-replica is the wet dream of many companies (google, amazon, etc) RE: amazon free hosting [12:41:30] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) s8 progress [x] dbstore1005 [] db1126 [x] db1116 [x] db1114 [] db1111 [] db1109 [] db1104 [x] db1101 [x] db1099 [x] db1092 [] db1087 [13:43:29] 10DBA, 10decommission-hardware: decommission es1017.eqiad.wmnet - https://phabricator.wikimedia.org/T268825 (10Marostegui) [13:43:41] 10DBA, 10decommission-hardware: decommission es1017.eqiad.wmnet - https://phabricator.wikimedia.org/T268825 (10Marostegui) a:03LSobanski [13:43:48] \o/ [13:48:51] :D [14:03:12] 10DBA, 10mariadb-optimizer-bug: Investigate possible optimizer regression on 10.4.17 with DELETE statements - https://phabricator.wikimedia.org/T268457 (10Marostegui) a:03Marostegui [14:04:09] 10DBA: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 (10Marostegui) This is running by default on all the clouddb hosts. [14:06:16] 10DBA: Add a link engineering: Database for link recommendation service - https://phabricator.wikimedia.org/T267214 (10Marostegui) @kostajh - reminder we are still waiting on knowing from where this database will be accessed. I could grant 10.64.% or whatever, but if there's something more concrete, that'd be us... [15:12:30] 10DBA, 10Beta-Cluster-Infrastructure, 10Operations, 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) [15:13:19] 10DBA, 10Beta-Cluster-Infrastructure, 10Operations, 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 (10jijiki) 05Open→03Resolved a:03jijiki I am marking this as resolved 🎉 [15:58:38] 10DBA, 10Community-Tech, 10Expiring-Watchlist-Items: Watchlist Expiry: Release plan [rough schedule] - https://phabricator.wikimedia.org/T261005 (10ifried) @Marostegui Fantastic, thank you so much! We'll update you when the release is complete.