[05:19:03] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10ops-monitoring-bot) Script wmf-auto-reimage... [05:25:28] 10DBA, 10CheckUser, 10Trust-and-Safety, 10WMF-Legal, and 2 others: Set wgCheckUserLogLogins to true on WMF wikis to log successful and unsuccessful login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 (10Marostegui) Yes, having an estimation on how many extra rows we'll store will be hel... [05:39:30] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10ops-monitoring-bot) Completed auto-reimage... [07:21:38] 10DBA, 10Patch-For-Review: Make partman/custom/no-srv-format.cfg work - https://phabricator.wikimedia.org/T251768 (10Kormat) @ArielGlenn: I had a look at your use-case, and made a more general solution that should cover your needs as well: https://gerrit.wikimedia.org/r/c/operations/puppet/+/601761 [07:23:30] 10DBA, 10Operations: In-place conversion from LVM to normal partition - https://phabricator.wikimedia.org/T252195 (10Kormat) 05Open→03Stalled This is on-hold for now. It looks like we don't need to get rid of lvm (https://gerrit.wikimedia.org/r/c/operations/puppet/+/601761). [07:23:33] 10DBA, 10Patch-For-Review: Make partman/custom/no-srv-format.cfg work - https://phabricator.wikimedia.org/T251768 (10Kormat) [07:26:29] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) 05Open→03Resolved Host repooled. All done. Thanks John for replacing the memory! [07:41:10] 10DBA: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 (10Kormat) 05Open→03Resolved Marking as resolved - the remaining bits are blocked by labsdb not being on 10.4 yet, and will be taken care of as part of decommissioning the old hosts once that block is cleared. [07:41:12] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Kormat) [08:34:27] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) Self note, db2124 from T238966#... [09:03:47] oh, I was still away [09:05:10] recovering 1.5TB of data over 1G is a pain [09:06:41] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1080.eqiad.wmnet'] ` The log can be found in `/var/log/wmf... [09:06:55] 10DBA, 10Patch-For-Review: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['pc2009.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202006030... [09:27:19] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1080.eqiad.wmnet'] ` and were **ALL** successful. [09:30:23] question to any of you, how did grants + 10.4 upgrade ended up? Is there any specific procedure needed, aside from mysql_upgrade? [09:30:49] and aside from DELETE HISTORY, already documented [09:32:09] no, I haven't seen anything else apart from the DELETE HISTORY [09:32:23] so it is only weird for roles, I am guessing? [09:32:35] yeah, I opened the bug with mariadb about it [09:33:20] https://jira.mariadb.org/browse/MDEV-22645 [09:33:48] yeah, still waiting for an answer [09:40:56] 10DBA: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pc2009.codfw.wmnet'] ` and were **ALL** successful. [10:20:23] 10DBA: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui) [10:22:56] 10DBA: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10Marostegui) [10:22:59] 10DBA: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui) 05Stalled→03Open [10:23:33] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10jcrespo) 05Open→03Resolved a:05jcrespo→03Jclark-ctr db1140 has been repopulated from dbprov snapshots of s1 and s6 and upgraded to 10.4. It has been added to tendril and zarcil... [10:24:00] ^almost 2 months to handle a board change [10:25:36] I will now try snapshots and dumps from 10.4, will call those "sX-10.4" on dbprov hosts [10:26:58] wow, it's been two months already? [10:27:06] time is flies a lot faster lately :( [10:27:30] actuall, will call them sX-104, as we use the dot to separate fileds on names, just in case [10:28:42] although now that I realize, I can do dumps with mydumper [10:29:00] but I don't have any dbprov/xtrabackup with 10.4 [10:38:25] how much space do you need for a testing dbprov on 10.4 [10:38:38] Maybe I can suggest a temporary host [11:07:18] 10DBA, 10CheckUser, 10Trust-and-Safety, 10WMF-Legal, and 2 others: Set wgCheckUserLogLogins to true on WMF wikis to log successful and unsuccessful login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 (10Ladsgroup) Looking at [[https://grafana.wikimedia.org/d/000000385/loginnotify?orgId=... [11:07:30] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) [11:07:57] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) >>! In T238966#6188293, @gerrit... [11:25:49] 10DBA, 10CheckUser, 10Trust-and-Safety, 10WMF-Legal, and 2 others: Set wgCheckUserLogLogins to true on WMF wikis to log successful and unsuccessful login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 (10Ladsgroup) login notify is not great, it only counts login from a new device or unsu... [11:27:19] 10DBA, 10CheckUser, 10Trust-and-Safety, 10WMF-Legal, and 2 others: Set wgCheckUserLogLogins to true on WMF wikis to log successful and unsuccessful login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 (10Reedy) The other question is how big are the rows? [11:34:08] 10DBA, 10CheckUser, 10Trust-and-Safety, 10WMF-Legal, and 2 others: Set wgCheckUserLogLogins to true on WMF wikis to log successful and unsuccessful login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 (10Marostegui) Thanks @Ladsgroup for the figures. If we are looking at doubling the siz... [11:35:11] dumps are running now, I may ask for a temporary host when they finish [11:35:46] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), 10Schema-change: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) >>! In T238966#6187900, @M... [11:35:46] cool [11:46:48] 10DBA, 10CheckUser, 10Trust-and-Safety, 10WMF-Legal, and 2 others: Set wgCheckUserLogLogins to true on WMF wikis to log successful and unsuccessful login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 (10Reedy) I don't think there's anymore indexes; things still fit in the existing queri... [11:49:53] 10DBA, 10CheckUser, 10Trust-and-Safety, 10WMF-Legal, and 2 others: Set wgCheckUserLogLogins to true on WMF wikis to log successful and unsuccessful login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 (10Marostegui) If we could get a better estimation on either wikidatawiki or enwiki wor... [13:13:34] I realized that I do have a par of dbprov role hosts already: backup1002 and backup2002, I will use those [13:14:07] ok, I might have some hosts to give you if you need a temp one [13:14:21] I think those should work [13:40:49] marostegui: pc2009 catching up on replication today was managing... 1s/s. [13:41:32] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=pc2009&var-port=9104&from=1591176977755&to=1591181136761&fullscreen&edit&panelId=6 [13:41:47] so i'm inclined to give a full day for pc1009, just in case anything goes wrong [13:41:58] pc1009 won't have lag [13:42:02] as it doesn't replicate from anything :) [13:42:12] 10DBA, 10CheckUser, 10Trust-and-Safety, 10WMF-Legal, and 2 others: Set wgCheckUserLogLogins to true on WMF wikis to log successful and unsuccessful login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 (10Huji) I think @Reedy has a valid point: the numbers from logstash and grafana are fo... [13:42:19] * kormat gently smacks forehead [13:42:28] i'm so used to dealing with non-master nodes :) [13:42:33] pc is confusing :) [13:42:43] to be fair [13:42:56] in theory, in an ideal state, all pc hosts would replicate [13:43:27] marostegui: in that case, yolo! [13:43:29] as when setup the idea was to create a master-master topology [13:43:36] even if pc1009 replicate, it wouldn't have lag, as nothing gets inserted on pc2009 [13:43:40] yep [13:44:48] FYI, I renamed zarcillo on db1115 to zarcillo_do_not_use_use_db2093_instead [13:45:07] will delete after a few weeks, we have backups [13:45:14] +1 [13:45:27] I almost made the mistake to edit that [13:47:47] another FYI to the DBAs: https://gerrit.wikimedia.org/r/c/operations/puppet/+/602080 [13:48:07] I am converting backupX002 into full dbprovs, but with no snapshots scheduled [13:48:20] so it can be used as temporary storage in case of disaster [13:48:42] I will test it right away to check 10.4 patch for transfer works [13:49:12] so if dbprov doesn't work for you or you need a buster host / 10.4 for it, you have one host on each dc to use [14:00:59] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gomwiktionary - https://phabricator.wikimedia.org/T250706 (10Andrew) 05Open→03Resolved a:03Andrew I always feel like I'm flying blind when I do this, but I've run the steps on https://wikitech.wikimedia.org/wiki... [14:01:06] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for awawiki - https://phabricator.wikimedia.org/T251410 (10Andrew) 05Open→03Resolved a:03Andrew wmcs steps done [14:05:09] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gomwiktionary - https://phabricator.wikimedia.org/T250706 (10Marostegui) 05Resolved→03Open This needs to be run on labsdb1011 as well. The reason why it needs to be run there is because it is being recovered from... [14:06:01] 10DBA, 10Patch-For-Review: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['pc1009.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202006031... [14:06:41] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for awawiki - https://phabricator.wikimedia.org/T251410 (10Marostegui) 05Resolved→03Open This needs to be run on labsdb1011 as well. The reason why it needs to be run there is because it is being recovered from severa... [14:31:36] 10DBA: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pc1009.eqiad.wmnet'] ` and were **ALL** successful. [14:31:48] careful, you logged !log stopping replication on pc1010 [14:31:54] but stopped it on pc2010 [14:32:12] oh crap. you're right. [14:32:31] no production impact, just FYI [14:32:49] this whole procedure is a giant mess from my PoV [14:33:13] I think you just need more context [14:33:26] that will come with time, don't worry [14:33:31] i'm sure that will help, [14:33:43] but i still maintain that the complexity is too damn high [14:34:12] aand i broke tendril [14:34:20] because i removed pc1009 [14:34:24] nah [14:34:31] just reload [14:34:36] I don't think stopping replication on pc1010 is needed, but up to you [14:34:57] just mentioned it because of the mismatch between log and action [14:35:00] kormat: pc1010 is going to page [14:35:10] unless it was downtimed for replication thread [14:35:20] it wasn't [14:35:46] it is probably still on warning state, so you might be on time :) [14:35:52] not sure about taht, pt-heartbeat shoud be smarty [14:35:56] about the local heartbeat [14:36:04] check icinga, it is on warning [14:36:29] yeah, the replica status, but that will be warning forever [14:36:32] and replication lag doesn't really matter on parsercache on MW level, but anyways, it is handled so I will keep doing other things [14:36:41] lag is based on heartbeat and it shouldn't page [14:36:47] but not 100% sure [14:38:38] sadly, pc2010 keeps lagging, not sure why now [14:40:29] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gomwiktionary - https://phabricator.wikimedia.org/T250706 (10Andrew) I've run maintain-replica-indexes and maintain-views on labsdb1011; the maintain_meta_p command fails: ` andrew@labsdb1011:~$ sudo /usr/local/sbin... [14:40:47] oh, I know why [14:40:59] because pc1010 coldness [14:41:11] I said so earlier on -operations [14:41:21] sorry, I read it as cache coldness [14:41:26] of the local host [14:41:33] pc2020 [14:41:44] I didn't realize that pc1010 was cold too [14:45:24] so I inherited mariadb_replication_check.pl [14:45:32] and did more fixes of my own [14:45:49] the reason why we still keep the perl script is that it is quite reliable [14:46:04] lots of logic into the small check to precent false positives [14:46:46] in this case, it notices that replication from pc1 stopped, but it still has the local replication heartbeat, so it didn't alert- but warned about the "strange" stopped slave [14:49:35] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gomwiktionary - https://phabricator.wikimedia.org/T250706 (10Marostegui) labsdb1011 is still reimporting lots of data, so maybe we should wait till it is already finished - some moving pieces for that script might not... [14:50:29] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10daniel) >>! In T238966#6188395, @Marostegui... [14:53:30] 10DBA: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 (10Kormat) pc3 is done now too. Let us never speak of this again. [14:53:37] 10DBA: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 (10Kormat) 05Open→03Resolved [14:53:39] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Kormat) [14:54:40] 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10jcrespo) [14:55:11] 10DBA: Research changes on prometheus-mysql-exporter after buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10jcrespo) [14:55:25] 10DBA: Research changes on prometheus-mysqld-exporter after buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10jcrespo) [14:55:41] 10DBA: Research changes on prometheus-mysqld-exporter after buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10jcrespo) [14:55:43] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10jcrespo) [14:58:40] no need to say T252182#6189217 [14:58:40] T252182: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 [14:58:43] you did good [15:00:07] I am wondering if we should by an extra pc host or improve the warmup of the standby host by using multisource or something [15:00:45] weaknesses are tracked on T133523 [15:00:45] T133523: Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 [15:02:34] I only mentioned the difference between message and action because it looked like a silly mistake, hopefully I didn't stressed you [16:02:11] 10DBA, 10MediaWiki-General, 10TechCom-RFC, 10Performance-Team (Radar): RFC: Discourage use of MySQL's ENUM type - https://phabricator.wikimedia.org/T119173 (10dbarratt) According to [[ https://www.wikidata.org/wiki/Q96032086 | SQL Antipatterns ]] using an `ENUM` is an antipattern unless the values are set... [16:31:04] 10Blocked-on-schema-change, 10DBA: CentralNotice: Update DB schema on Meta for new features - https://phabricator.wikimedia.org/T254371 (10AndyRussG) [16:32:16] 10Blocked-on-schema-change, 10DBA: CentralNotice: Update DB schema on Meta for new features - https://phabricator.wikimedia.org/T254371 (10AndyRussG) [17:11:37] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for gomwiktionary - https://phabricator.wikimedia.org/T250706 (10Andrew) works for me! Can you ping on this task when ready? Or do you have an eta? [20:22:11] 10DBA, 10CheckUser, 10Trust-and-Safety, 10WMF-Legal, and 2 others: Set wgCheckUserLogLogins to true on WMF wikis to log successful and unsuccessful login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 (10Ladsgroup) > For bots, although they sometimes login more frequently than human user... [21:48:44] 10DBA, 10CheckUser, 10Trust-and-Safety, 10WMF-Legal, and 2 others: Set wgCheckUserLogLogins to true on WMF wikis to log successful and unsuccessful login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 (10Huji) >>! In T253802#6190510, @Ladsgroup wrote: >> For bots, although they sometimes...