[05:08:23] 10DBA, 10Cloud-Services, 10MW-1.35-notes (1.35.0-wmf.36; 2020-06-09), 10Platform Team Initiatives (MCR Schema Migration), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) The schema change on `enwiki... [05:08:58] 10DBA, 10Cloud-Services, 10MW-1.35-notes (1.35.0-wmf.36; 2020-06-09), 10Platform Team Initiatives (MCR Schema Migration), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) s1 eqiad progress [] labsdb... [05:09:19] 10DBA, 10Cloud-Services, 10MW-1.35-notes (1.35.0-wmf.36; 2020-06-09), 10Platform Team Initiatives (MCR Schema Migration), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) [05:15:09] 10DBA, 10Parsoid, 10Parsoid-Tests: mysqldump of testreduce_vd database on scandium - https://phabricator.wikimedia.org/T258429 (10Marostegui) @jcrespo could you handle this? [05:15:20] 10DBA, 10Parsoid, 10Parsoid-Tests: mysqldump of testreduce_vd database on scandium - https://phabricator.wikimedia.org/T258429 (10Marostegui) p:05Triage→03Medium [06:15:35] 10DBA, 10Patch-For-Review, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Kormat) [06:54:35] 10DBA, 10Patch-For-Review, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Kormat) [06:55:14] 10DBA, 10Patch-For-Review, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Kormat) [07:01:27] 10DBA, 10Patch-For-Review, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Kormat) [07:15:35] 10DBA, 10Patch-For-Review, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Kormat) [07:29:27] 10DBA, 10Patch-For-Review, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Kormat) [07:35:41] 10DBA, 10Patch-For-Review, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Kormat) [07:35:53] 10DBA, 10Patch-For-Review, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Kormat) 05Open→03Resolved All done. [07:35:55] 10DBA, 10Epic, 10Patch-For-Review, 10User-Kormat: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 (10Kormat) [07:37:07] 10DBA, 10Patch-For-Review, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10Marostegui) Congratulations on handling your first switchover! [07:54:32] 10DBA, 10Epic, 10Patch-For-Review, 10User-Kormat: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['es1020.eqiad.wmnet'] ` The log can be found in `/var/log/wm... [08:12:40] aaand yet another machine fails to pxe boot [08:13:18] so it looks very specific for es hosts, no? [08:13:25] i think so, yeah [08:13:26] I haven't seen that on any of the db hosts for now [08:14:27] https://netbox.wikimedia.org/dcim/devices/?device_type_id=72 - the es hosts are the only ones we have of that model [08:15:09] yeah, they are very new [08:15:15] and they have dual ethernets and all that [08:15:24] they gave some troubles to install them [08:15:28] with stretch [08:16:35] manually selecting pxe at the bios worked first time [08:36:15] 10DBA, 10Epic, 10Patch-For-Review, 10User-Kormat: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1020.eqiad.wmnet'] ` and were **ALL** successful. [08:39:55] 10DBA, 10Parsoid, 10Security-Team, 10Parsoid-Tests, 10Security: mysqldump of testreduce_vd database on scandium - https://phabricator.wikimedia.org/T258429 (10jcrespo) This is physically possible, but moving data outside of the production realm/network (specially to publish on cloud VPS instances) will r... [08:44:29] 10DBA, 10Patch-For-Review, 10User-Kormat: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 (10jcrespo) I will keep using es1022 for backups unless you tell me not to. [09:03:25] thoughts? https://gerrit.wikimedia.org/r/c/operations/puppet/+/615155 [09:25:24] yeah, makes sense indeed [09:25:59] we always knew those were to be adjusted based on actual failures [09:26:31] yeah, definitely [09:26:39] this is the first time we change them actually, no? [09:26:51] no, it was changed once before [09:27:04] I think it was at 1% at first [09:27:11] ah yes, sounds familiar [09:27:18] but then when schema changes run, etc, there are larger changes [09:27:41] yeah, 25% reduction on enwiki.revision table (compressed) [09:27:44] for example [09:28:24] e.g. "Last dump for zarcillo at codfw (db2093.codfw.wmnet) taken on 2020-07-21 00:57:01 is 0 GB, but previous one was 1 GB, a change of 99.9%" [09:28:36] XD [09:29:28] BTW es4/5 are already 500GB in size [09:30:11] the backups you mean? [09:31:04] https://grafana.wikimedia.org/d/000000377/host-overview?panelId=12&fullscreen&orgId=1&from=1579599060886&to=1595323860886&var-server=es2025&var-datasource=thanos&var-cluster=misc [09:32:00] nice [09:32:05] steady growth [09:32:14] we installed them in march? [09:32:21] I thought it was past year XD [09:32:28] https://grafana.wikimedia.org/d/000000377/host-overview?panelId=28&fullscreen&orgId=1&from=1583932263230&to=1595323907362&var-server=es2025&var-datasource=thanos&var-cluster=misc [09:52:57] 10DBA, 10Operations, 10Patch-For-Review, 10User-Kormat: Database cumin aliases without a matching host - https://phabricator.wikimedia.org/T258376 (10Kormat) 05Open→03Resolved All fixed now. [09:53:43] 10DBA, 10Epic, 10Patch-For-Review, 10User-Kormat: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 (10Kormat) 05Open→03Resolved All of es4 is now running buster. [09:53:45] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Kormat) [11:13:15] 10DBA, 10Operations: db1085 crashed - https://phabricator.wikimedia.org/T258360 (10Marostegui) 05Open→03Resolved a:03Marostegui I have fully repooled this host. It doesn't have a BBU, but s6 doesn't really have much load, so it will probably be able to keep up with replication without issues. Next follo... [11:13:18] 10DBA, 10Operations: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [11:22:00] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for arywiki - https://phabricator.wikimedia.org/T257725 (10Marostegui) Check_private_data came back clean: `_p_` database created. Grants for `labsdbuser` role changed. This is ready for views creation by the #cloud-serv... [11:22:06] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for arywiki - https://phabricator.wikimedia.org/T257725 (10Marostegui) a:05Marostegui→03None [11:22:41] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for lijwikisource - https://phabricator.wikimedia.org/T258389 (10Marostegui) a:05Marostegui→03None Check_private_data came back clean: `_p_` database created. Grants for `labsdbuser` role changed. This is ready for v... [11:29:30] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks, 10User-DannyS712: Monitor the growth of CheckUser tables after the addition of Thanks data - https://phabricator.wikimedia.org/T257223 (10Marostegui) 21st July: ` -rw-r--r-- 1 dump dump 1.1G Jul 21 00:26 dump.s4.2020-07-21--00-00-01/commonswiki.cu_changes.000... [11:30:58] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks, 10User-DannyS712: Monitor the growth of CheckUser tables after the addition of Thanks data - https://phabricator.wikimedia.org/T257223 (10Marostegui) [12:10:17] jynus: thanks again for saving my ass yesterday. <3 [12:17:08] https://jynus.com/gif/tip_of_the_hat.gifv [12:17:18] :D [12:17:33] XDDD [12:17:51] https://jynus.com/gif/cheers.gifv [12:18:50] what's the story with the `tendril` database on db2093? we don't replicate it [12:19:02] we just have it empty [12:19:08] in case db1115 dies forever [12:19:08] kormat: maybe consider reprioritizing your goals to start thinking about grant handling [12:19:11] it's got 4.3k tables [12:19:46] kormat: yeah, we don't replicate it cause it cannot even keep up with replication [12:20:03] So we decided to have it there with just the schema, just in case db1115 dies [12:20:09] ah, i see [12:20:18] high parallelization + WAN [12:20:52] jynus: re: grant handling, yeah, that's definitely something with a lot of room for improvement. is there an existing task for it? [12:20:59] yes [12:21:38] as I mentioned on doc, I believe that is one of the main blockers for automation of several workflows, including autoprovisioning [12:22:15] I am not saying you should work on it, just think about it as it is a huge gap for us [12:22:47] yeah understood [12:23:06] but maybe the next big automation goal after the switchover thing? [12:23:46] this is the big thing, but there is no specific actionables yet https://phabricator.wikimedia.org/T146149 [12:24:27] but the general idea is that they used to be handled on puppet and that wasn't great [12:24:59] yeah puppet doesn't seem like the right tool for this [12:25:05] indeed [12:25:33] even if puppet was, there was serious concerns about how it was done, and that is why the ticket is private [12:26:01] so some preparation work could be actionable already: [12:26:15] 1) thinking of a general layer/method [12:26:41] 2) proper storage of secretes (which could be separate from the actuall acounting/checking system) [12:26:47] *secrets [12:27:13] 3) mapping client and services, as accounts in mysql relate 2 separate set of hosts normally [12:27:42] e.g. "otrs needs access to m1" [12:27:58] and that means maintainging a list of otrs application servers and a list of m1 servers [12:28:07] * kormat nods [12:28:16] and that is quite dynamic [12:28:29] this is beyond db needs [12:28:48] this is also for backup needs, grants right now are backed up, but very poorly [12:29:20] of and of course backups == provisioning system for us, so in the end everything is related [12:29:36] which explains why our provisioning is not yet as simple [12:30:58] i'm thinking that if we had automatic management of grants that we'd probably want to disable replication of the `mysql` table, and manage this from the outside [12:31:40] so I am not as worried on a first phase for automation as much as tracking [12:31:49] but... yeah. there's a large scope for things to improve [12:31:50] ack [12:32:07] if we can track and maintain, automation is at least an option [12:32:13] +1 [12:32:49] maintain and audit would be the first step IMHO, even if not perfect [12:33:06] marostegui did a first approach for the worst cases [12:33:22] ah yes, with that horror of a bash script [12:33:32] well it is better than nothing [12:33:41] technically yes. :) [12:34:01] I think checking is not that hard [12:34:01] so i won't turn him in to the war-crime tribunal _just_ yet [12:34:16] it is the tracking and inventory of access is what we want to do first [12:34:49] and that loops in into what we would like zarcillo to be [12:35:09] both for access and for data: T104459 [12:35:10] T104459: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 [12:36:49] so inventory => automated check => automated handling should probably the steps [12:36:56] *be [12:41:34] SGTM [12:57:48] 10DBA, 10Patch-For-Review: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['dbproxy1012.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202007211257_maro... [13:18:43] 10DBA, 10Patch-For-Review: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1012.eqiad.wmnet'] ` and were **ALL** successful. [13:53:24] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks, 10User-DannyS712: Monitor the growth of CheckUser tables after the addition of Thanks data - https://phabricator.wikimedia.org/T257223 (10Huji) [13:53:42] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks, 10User-DannyS712: Monitor the growth of CheckUser tables after the addition of Thanks data - https://phabricator.wikimedia.org/T257223 (10Huji) [13:55:18] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks, 10User-DannyS712: Monitor the growth of CheckUser tables after the addition of Thanks data - https://phabricator.wikimedia.org/T257223 (10Huji) @Marostegui is it okay that we are looking at the GZipped file size? Could there be an edge case where the unzipped... [13:57:49] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks, 10User-DannyS712: Monitor the growth of CheckUser tables after the addition of Thanks data - https://phabricator.wikimedia.org/T257223 (10Marostegui) We have some alerts on the backups that measure the delta between weekly backups and that has been working fin... [14:03:31] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks, 10User-DannyS712: Monitor the growth of CheckUser tables after the addition of Thanks data - https://phabricator.wikimedia.org/T257223 (10Marostegui) [14:15:59] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks, 10User-DannyS712: Monitor the growth of CheckUser tables after the addition of Thanks data - https://phabricator.wikimedia.org/T257223 (10Marostegui) [14:20:38] 10DBA, 10CheckUser, 10Growth-Team, 10Thanks, 10User-DannyS712: Monitor the growth of CheckUser tables after the addition of Thanks data - https://phabricator.wikimedia.org/T257223 (10Marostegui) [15:31:57] 10DBA, 10Parsoid, 10Security-Team, 10Parsoid-Tests, 10Security: mysqldump of testreduce_vd database on scandium - https://phabricator.wikimedia.org/T258429 (10sbassett) The db in question (testreduce_vd) will likely need a #privacy review performed by @JFishback_WMF. The #security-team will plan to tria... [15:49:47] 10DBA, 10Parsoid, 10Security-Team, 10Parsoid-Tests, 10Security: mysqldump of testreduce_vd database on scandium - https://phabricator.wikimedia.org/T258429 (10ssastry) 05Open→03Declined Actually, I didn't mean to add all this work on all your plates. This is not that important. I can just create a ne... [20:02:48] 10DBA, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Patch-For-Review, 10User-Marostegui: DBA review for Echo push notification subscription tables - https://phabricator.wikimedia.org/T246716 (10Mholloway) Hi @Marostegui, thanks for the merge! Anticipating launching to production... [20:18:52] 10DBA, 10Operations, 10ops-eqiad: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Jclark-ctr) @Marostegui Replacement Dimm has arrived please reach out to me for scheduling down time i am available for the next 2 hours but will be on site tomorrow 9:30am est [21:03:41] 10DBA, 10Platform Team Workboards (Clinic Duty Team), 10Schema-change, 10User-DannyS712: iwlinks indexes should be UNIQUE INDEXes - https://phabricator.wikimedia.org/T256842 (10eprodromou) OK, accepting this for Clinic Duty.