[00:42:10] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [00:46:48] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [02:10:42] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ladsgroup) >>! In T120242#6914982, @Ottomata wrote: > >> I'd like to see a better explanati... [04:18:33] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: recover-mariadb should use logging (logger) to indicate actions taken - https://phabricator.wikimedia.org/T277162 (10Rohitesh-Kumar-Jain) Hi @jcrespo, Thanks for giving detailed comments, I would like to focus on the logger, for now, will work on change... [05:38:33] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1076 -> db1162 transfer is on-going now [05:39:19] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1136.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20210316053... [05:59:22] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1136.eqiad.wmnet'] ` and were **ALL** successful. [06:02:56] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) [06:03:00] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) [06:45:20] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10Marostegui) 05Open→03Resolved Host is being repooled. Thanks! [06:45:22] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [06:45:24] 10DBA, 10SRE: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [06:45:27] 10DBA, 10SRE: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [07:06:35] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) >>! In T256538#6889860, @Legoktm wrote: >>>! In T256538#6889858, @Marostegui wrote: >> @Ladsgroup are these just testing databases that will be deleted at some point or are these... [07:52:09] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [07:56:29] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) [07:58:03] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [08:02:19] 10DBA, 10Epic: DB alarming updates - https://phabricator.wikimedia.org/T277498 (10Peachey88) [08:06:15] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) I am slowly repooling db1162 [08:07:13] 10Data-Persistence-Backup, 10SRE-tools: transfer.py argument parsing exception - https://phabricator.wikimedia.org/T268258 (10rafayghafoor) Sorry, @jcrespo, I am interested in this project as I am willing to participate in GSoC this year and would like to work on this task. [08:22:01] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) s8 progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1005 [] db1172 [] db1154 [] db1126 [] db1124 [] db1116 [] db1114 [] db1111 [] db1109 [] db1104 [] db1101 [] db1099 [] db10... [08:22:03] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) s8 progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1005 [] db1172 [] db1154 [] db1126 [] db1124 [] db1116 [] db1114 [] db1111 [] db1109 [] db... [08:22:34] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) [08:22:44] 10Blocked-on-schema-change, 10DBA: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 (10Marostegui) [08:46:22] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: recover-mariadb should use logging (logger) to indicate actions taken - https://phabricator.wikimedia.org/T277162 (10jcrespo) > So firstly I am required to install a Debian virtual machine on my mac, as I will only be able to run the unit test and build a... [08:47:18] 10DBA: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 (10Marostegui) [08:53:36] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: recover-mariadb should use logging (logger) to indicate actions taken - https://phabricator.wikimedia.org/T277162 (10Marostegui) I would not encourage to set up your environment on Mac, I'd recommend you use a virtual machine for that, using for instance... [09:07:32] 10Data-Persistence-Backup, 10SRE-tools, 10Patch-For-Review: Make recover-dump show the time taken - https://phabricator.wikimedia.org/T277160 (10jcrespo) > Regarding decompression, In code review you had mentioned that it'd be interesting to track decompression if that part ran - is it just the tarball decom... [09:32:13] 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: recover-mariadb should use logging (logger) to indicate actions taken - https://phabricator.wikimedia.org/T277162 (10Rohitesh-Kumar-Jain) Hi @jcrespo & @Marostegui, Thanks for answering my queries, I will try to set up the environment on a virtual machi... [09:53:25] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1161.eqiad.wmnet'] ` The log ca... [10:10:50] PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [10:14:38] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1161.eqiad.wmnet'] ` and were **ALL** successful. [10:16:24] RECOVERY - MariaDB sustained replica lag on pc2009 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [10:17:46] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [10:18:29] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [10:25:50] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) a:03Marostegui [10:31:10] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) Reserved maintenance window on the Deployments' calendar [10:41:27] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [11:06:58] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) ` ~# mysql.py -hdb1136 -e "select @@report_host" +--------------------+ | @@report_host | +--------------------+ | db1136.eqiad.wmnet | +--------------------+ ` [14:00:40] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Ladsgroup) >>! In T256538#6916381, @Marostegui wrote: > Do you have some estimations on how long you want to run this test for? For me, hopefully a month or two. Add one or two to be safe. [14:01:32] 10DBA, 10SRE, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) >>! In T256538#6917363, @Ladsgroup wrote: >>>! In T256538#6916381, @Marostegui wrote: >> Do you have some estimations on how long you want to run this test for? > > For me, hopef... [14:38:06] 10Data-Persistence-Backup, 10Patch-For-Review: recover-mariadb should use logging (logger) to indicate actions taken - https://phabricator.wikimedia.org/T277162 (10Marostegui) [14:41:43] 10Data-Persistence-Backup: transfer.py argument parsing exception - https://phabricator.wikimedia.org/T268258 (10Marostegui) [15:06:45] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Marostegui) >>! In T120242#6914507, @Ottomata wrote: >> Debezium requires binlog_format=ROW,... [15:08:02] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) Hm, @ladsgroup I'm certainly not suggesting that we should ever bypass MediaWiki a... [15:33:29] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) > clouddb1021 is owned by Analytics so we can set up ROW there if that's Cool soun... [16:00:29] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [16:04:59] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [16:05:28] good work pc2008 [17:15:47] kormat: so, i did some investigation on where MW learns about the hostname, and i found that it consults etcd, which apparently does not have fqdn names, per https://noc.wikimedia.org/dbconfig/eqiad.json [17:16:10] i can make mysql.php to get theh ostname and not ip, but it'd be whatever etcd provides [17:16:56] so...we should probably make etcd provide FQDNs, if that's what's desired [17:17:46] and also...will the change affect MediaWiki itself (ie. actual queries, not just manual)? I'm wondering how SSL is done there [17:25:29] i just tried my change live on mwmaint, and it works with the snippet you proposed, but I'm afraid it tries to use www-data's .my.cnf, and i can't edit it to test [17:37:11] Urbanecm: if it's db1*** then FQDN would always be .eqiad.wmnet though, right? [17:37:23] RhinosF1: in wikimedia environment, yes [17:37:38] in my own wiki, it can be db123.my.really.cool.wikifarm.urbanec.cz :) [17:37:50] Urbanecm: could you not build the FQDN off the first number then for wikimedia [17:38:00] RhinosF1: not in mediawiki core [17:38:09] Oh [17:38:24] but whatever feeds etcd can probably get the FQDN easily [17:38:26] This got be in core :( [17:38:34] which is what i proposed :) [17:38:47] Yeah would have to be [18:08:47] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Joe) >>! In T120242#6917670, @Ottomata wrote: >> clouddb1021 is owned by Analytics so we can... [18:54:25] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) [19:52:41] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) > I think they will need further discussion (in the tech forum? with the interest... [20:12:07] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Joe) The best practices I am talking about are, basically: - **Don't use the database as an... [22:16:44] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Reliable (atomic) MediaWiki event production / MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) Thanks for responses, I want to respond more in full too, but here's a quick thoug...