[05:49:51] 10Blocked-on-schema-change, 10DBA: Schema change for dropping defaults of ipb_timestamp and ipb_expiry - https://phabricator.wikimedia.org/T273358 (10Marostegui) [05:50:11] 10Blocked-on-schema-change, 10DBA: Schema change for dropping defaults of ipb_timestamp and ipb_expiry - https://phabricator.wikimedia.org/T273358 (10Marostegui) 05Open→03Resolved All done [06:08:42] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1162.eqiad.wmnet'] ` The log ca... [06:28:49] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1162.eqiad.wmnet'] ` and were **ALL** successful. [06:43:56] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) Fully pooled: db1170:3312 db1170:3317 [06:44:15] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:44:55] 10DBA, 10decommission-hardware: decommission db1090.eqiad.wmnet - https://phabricator.wikimedia.org/T274333 (10Marostegui) [06:45:17] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:45:19] 10DBA, 10decommission-hardware: decommission db1090.eqiad.wmnet - https://phabricator.wikimedia.org/T274333 (10Marostegui) [06:45:32] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:01:15] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) s7 eqiad [] labsdb1012 - not needed [] labsdb1011 - not needed [] labsdb1010 - not needed [] labsdb1009 - not needed [x] dbstore1003 [x] db1174 [x] db1170 [x]... [08:12:05] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [08:21:25] 10DBA, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services), 10User-Kormat: Run wmfmariadbpy integration test suite on CI - https://phabricator.wikimedia.org/T261098 (10hashar) The CI configuration has been changed and the job Docker image no more contai... [08:44:32] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [08:44:34] 10DBA, 10SRE: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [08:45:15] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) Waiting for the new kernel to be released for Stretch @MoritzMuehlenhoff [08:57:50] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1162 is now replicating, but I won't pool it until I'm back next week. [09:47:23] 10DBA, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services), 10User-Kormat: Run wmfmariadbpy integration test suite on CI - https://phabricator.wikimedia.org/T261098 (10Kormat) @hashar: i've been working on a new integration framework for wmfmariadbpy ({... [09:52:01] 10DBA, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services), 10User-Kormat: Run wmfmariadbpy integration test suite on CI - https://phabricator.wikimedia.org/T261098 (10jcrespo) I am very happy with @Kormat proposals and the direction she is taking the r... [10:08:07] orchestrator will complain about db1081 as I am depooling it,please do not do anything, I want to see if orchestrator sucessfully forget that host after a few hours/days [10:08:19] (as it is supposed to= [10:08:20] ) [10:10:15] It should take 10 days from what I can see, we can always reduce it. Let's see if it works anyways [10:14:51] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1081.eqiad.wmnet - https://phabricator.wikimedia.org/T273040 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1081.eqiad.wmnet` - db1081.eqiad.wmnet (**PASS**) - Downtimed host on Icinga... [10:15:39] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1081.eqiad.wmnet - https://phabricator.wikimedia.org/T273040 (10Marostegui) This is ready for DCOps! [10:16:38] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:48:12] I've tuned the "mysql-aggregated" graph in 2 ways: added cloud dbs to the list of exclussion from memory panel [10:48:42] and added the number of kill commands to the errors (to match the errors on the per-instance graph) [10:49:41] I know kills are not errors, but they are normally produced by the query killer- I can revert if you don't like it [10:51:27] maybe I can rename panel to "Errors and query kills" [11:30:26] 10Blocked-on-schema-change: Schema change for dropping default of img_timestamp - https://phabricator.wikimedia.org/T273360 (10Marostegui) p:05Triage→03Medium [11:30:59] 10Blocked-on-schema-change: Schema change for dropping default of img_timestamp - https://phabricator.wikimedia.org/T273360 (10Marostegui) [12:04:33] 10Data-Persistence-Backup, 10SRE, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) It looks congestion-dependent? It peaks around ~22 UTC and improves at ~6 UTC: https://grafana-... [13:17:54] 10Data-Persistence-Backup, 10SRE, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) [14:06:14] "this is a simple refactoring", I said, "It is just renaming two lines on site.pp" https://gerrit.wikimedia.org/r/c/operations/puppet/+/662740/9/manifests/site.pp [14:06:46] :D [14:06:47] ^how it started. How it is going -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/662740 [14:10:44] hehe [14:31:12] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable communication between orchestrator and clouddb hosts - https://phabricator.wikimedia.org/T273606 (10Marostegui) @Kormat once the above patch is pushed, let's not force orchestrator to discover the new hosts, I want to see if it will actually... [15:05:03] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10jcrespo) This could explain random network errors between some hosts: T274234 [15:05:37] 10Data-Persistence-Backup, 10SRE, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) [18:40:25] db2121 MariaDB sustained replica lag CRITICAL 6.266e+05 ge 2 🤔 [18:41:07] now 0, that's a weird fluke [21:57:23] 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Productionize x2 databases - https://phabricator.wikimedia.org/T269324 (10Krinkle) >>! In T269324#6794983, @Marostegui wrote: >>>! In T269324#6794656, @Krinkle wrote: >> [...] There are broadly two options: >> >> * Use only one of them for all reads/wri...