[05:51:14] 10Blocked-on-schema-change, 10DBA: Schema change for renaming several indexes in sites table - https://phabricator.wikimedia.org/T270621 (10Marostegui) [05:51:31] 10Blocked-on-schema-change, 10DBA: Schema change for renaming several indexes in sites table - https://phabricator.wikimedia.org/T270621 (10Marostegui) 05Open→03Resolved All done [05:58:39] 10Blocked-on-schema-change, 10DBA: Increase size of slot_roles.role_id - https://phabricator.wikimedia.org/T270054 (10Marostegui) s6 eqiad progress [x] labsdb1012 [x] labsdb1011 [x] labsdb1010 [x] labsdb1009 [x] dbstore1005 [x] db1155 [x] db1140 [x] db1139 [x] db1131 [x] db1125 [x] db1113 [x] db1098 [x] db109... [05:59:01] 10Blocked-on-schema-change, 10DBA: Increase size of slot_roles.role_id - https://phabricator.wikimedia.org/T270054 (10Marostegui) [06:02:42] marostegui: thank you so much! [06:02:49] We are 86% done with abstraction [06:02:52] Amir1: :***** [06:24:57] 10DBA, 10ops-codfw: cold reset and upgrade pc2010's idrac - https://phabricator.wikimedia.org/T272337 (10Marostegui) [06:25:07] 10DBA, 10ops-codfw: cold reset and upgrade pc2010's idrac - https://phabricator.wikimedia.org/T272337 (10Marostegui) p:05Triage→03Medium [06:30:37] 10DBA, 10ops-codfw: cold reset and upgrade pc2010's idrac - https://phabricator.wikimedia.org/T272337 (10Marostegui) 05Open→03Resolved I tried the cold reset from the CLI and looks like I was able to reboot the host from the idrac, so considering this fixed [06:44:38] 10Blocked-on-schema-change, 10DBA: Increase size of slot_roles.role_id - https://phabricator.wikimedia.org/T270054 (10Marostegui) [07:21:29] 10Blocked-on-schema-change, 10DBA: Increase size of slot_roles.role_id - https://phabricator.wikimedia.org/T270054 (10Marostegui) [07:23:09] 10Blocked-on-schema-change, 10DBA: Increase size of slot_roles.role_id - https://phabricator.wikimedia.org/T270054 (10Marostegui) [07:24:03] 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) [07:24:28] 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) clouddb1016:3315 and clouddb1020:3315 moved [07:34:17] 10Blocked-on-schema-change, 10DBA: Increase size of slot_roles.role_id - https://phabricator.wikimedia.org/T270054 (10Marostegui) [07:34:38] marostegui: for when you have a bit of time: https://gerrit.wikimedia.org/r/c/operations/puppet/+/657043 [07:40:18] Amir1: I will merge in a sec [07:44:39] marostegui: thanks. no rush [07:44:54] Amir1: merged :) [07:45:11] Thanks. one down 131 to go :D [07:47:07] When I started cleaning those, they were 508 though [08:10:19] 10DBA, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Slow load times for Special:Homepage on cswiki - https://phabricator.wikimedia.org/T267216 (10Marostegui) 05Open→03Resolved Closing this as this query didn't show up in the last 96h Thanks everyone for getting... [08:17:22] 10Blocked-on-schema-change, 10DBA: Increase size of slot_roles.role_id - https://phabricator.wikimedia.org/T270054 (10Marostegui) [09:07:39] while checking for backup alert monitoring, I ran into db1155 prometheus default mysql scrapper enabled FYI (I am guessing lots of those hosts are WIP) or I can disable it if that helps [09:08:51] yep, please disable it if you are still there [09:08:56] I think I am [09:08:59] if not I can take care of it [09:09:27] this is technically a bug on how we use the package [09:09:43] maybe we can disable it on puppet [09:10:12] systemd now clean, alerts should recover soon [09:11:23] the other ongoing issue is that matomo backups got reduced a 114.0%, I may create a ticket for dataEng [09:12:39] reduced a 114%? [09:12:45] so they are negative now? [09:13:18] negative backups - everything they run, we get a new disk added ;) [09:13:28] yeah, I have to check how the formula works [09:13:42] but I am guessing it is a rounding error [09:15:47] I see, the issue is that we calculate the delta of change based on the last backup, not the previous one [09:16:26] 535M -> 251M [09:18:31] will fix it on next release, but will need time as we may need to recalibrate threasholds [09:20:04] 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) [09:20:30] 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) clouddb1013:3313 and clouddb1017:3313 moved [09:36:02] Amir1: do you happen to know from top of your head if slot_roles table gets lots of reads? [09:36:51] marostegui: it doesn't, it supposed to be cached in NameTableStore [09:37:02] (in memcached and memory) [09:37:22] Amir1: cool, I was about to deploy it on enwiki directly on the master, and I noticed no issues with wikidata or commons, but worth asking :) [09:37:23] thanks [09:37:49] if things break, I deny evverything [09:38:47] hahaha [09:39:44] "i deny everything, including this denial" [09:44:28] do you deny denying that denial? [09:45:46] https://usercontent.irccloud-cdn.com/file/GjciR1MC/image.png [09:45:58] :) [09:53:28] 10Data-Persistence-Backup, 10Analytics: Matomo backup size got halved, normally pointing to a backup or underlying data issue - https://phabricator.wikimedia.org/T272344 (10jcrespo) [09:54:10] 10Data-Persistence-Backup, 10Analytics: Matomo backup size got halved, normally pointing to a backup or underlying data issue - https://phabricator.wikimedia.org/T272344 (10jcrespo) Luca may know matomo the most? [10:02:50] I was wrong [10:03:15] the 114% is actually right, just it doubled, not halfed [10:04:06] I have to learn to read [10:04:21] and/or improve the error message [10:04:32] 10Data-Persistence-Backup, 10Analytics: Matomo backup size doubled, we should check this is normal operation - https://phabricator.wikimedia.org/T272344 (10jcrespo) [10:05:17] 10Data-Persistence-Backup, 10Analytics: Matomo backup size doubled, we should check this is normal operation - https://phabricator.wikimedia.org/T272344 (10jcrespo) p:05Triage→03Low I just realized it doubled, not halved, which would be a way more common operation to happen. [10:10:02] 10Data-Persistence-Backup, 10Analytics: Matomo backup size doubled, we should check this is normal operation - https://phabricator.wikimedia.org/T272344 (10jcrespo) piwiki log tables seem to be the main responsible for the growth: {P13830} [10:14:56] 10Data-Persistence-Backup, 10Analytics: Matomo database backup size doubled, we should check this is normal operation - https://phabricator.wikimedia.org/T272344 (10jcrespo) [10:19:28] 10DBA, 10Orchestrator: orchestrator: Monitor for non-fqdns in the host resolve cache - https://phabricator.wikimedia.org/T272347 (10Kormat) p:05Triage→03Medium [10:21:57] 10DBA, 10Orchestrator: orchestrator: Monitor for non-fqdns in the host resolve cache - https://phabricator.wikimedia.org/T272347 (10Kormat) [10:22:31] 10DBA, 10Orchestrator, 10SRE, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) [10:36:38] 10DBA, 10Orchestrator: orchestrator: Monitor for non-fqdns in the host resolve cache - https://phabricator.wikimedia.org/T272347 (10Marostegui) Maybe we could also add a self healing to run the `reset-hostname-resolve-cache` + orchestrator restart if issues are found (maybe in a concrete hour of the day). But... [10:40:19] 10Blocked-on-schema-change, 10DBA: Increase size of slot_roles.role_id - https://phabricator.wikimedia.org/T270054 (10Marostegui) [10:53:30] 10Blocked-on-schema-change, 10DBA: Increase size of slot_roles.role_id - https://phabricator.wikimedia.org/T270054 (10Marostegui) [11:00:02] any chance there was maintenance tonight on some codfw es2 hosts? es2022 and es2025? [11:01:58] dumping failed at 00:02:28 with "Lost connection to MySQL server during query" maybe network error? [11:03:11] jynus: not that I am aware [11:03:19] jynus: at what time was that? [11:03:28] 00:02:28 [11:03:34] definitely not me :) [11:03:37] he he [11:04:08] I will check log if there was maintenance or any anomaly at that time [11:05:03] are those on the same rack? [11:06:02] probably not, but it could be a network error from where they are taken (backup2002) [11:08:42] I don't see any anomaly on retry [11:09:35] because it is codfw, I will leave it running during the day, we should have no issue there [11:21:46] checking mysql stats, it gets reported as Aborted_clients, so not a server problem [11:22:22] weird [13:40:59] 10DBA, 10Data-Services: Prepare and check storage layer for trwikivoyage - https://phabricator.wikimedia.org/T271261 (10Urbanecm) >>! In T271261#6725446, @LSobanski wrote: > Thanks, let us know when the database is created, so we can sanitize it. The database was just created [14:06:16] 10DBA, 10Data-Services: Prepare and check storage layer for trwikivoyage - https://phabricator.wikimedia.org/T271261 (10Marostegui) a:03Marostegui [14:06:34] Urbanecm: just ^ today? or should I expect more wikis to be created? [14:06:46] marostegui: yes, just one wiki today :) [14:06:52] Urbanecm: sweet, thank you [14:07:21] 10DBA, 10Orchestrator, 10SRE, 10CAS-SSO, and 2 others: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) [14:13:14] 10DBA, 10Orchestrator, 10SRE, 10CAS-SSO, and 2 others: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) [14:15:52] sobanski: https://orchestrator.wikimedia.org/ [14:16:43] \o/ [14:16:48] ohhhh [14:17:28] can I randomly drag stuff around? :-P [14:17:48] volans: it's in read-only mode, so... maybe? [14:18:05] :-) [14:18:15] volans: as long as you don't drag anything into the bin [14:18:39] which host is the web server running? [14:18:41] well done kormat :) [14:19:07] thanks <3 [14:19:15] jynus: dborch1001.wikimedia.org, a ganeti vim. [14:19:17] *vm [14:19:19] thanks [14:20:03] nice, it already has "orchestrator.wikimedia.org requires authentication" [14:20:18] that makes me sleep better :-) [14:21:35] SSO FTW [14:21:41] yep. getting it behind idp was as prereq for making it available [14:22:17] did you find your objectives already? planning on deploying wider this Q? [14:22:27] as in, more clusters? [14:22:35] also cool work! [14:23:07] oh, I am guessing it is blocked on the restarts [14:23:11] jynus: expanding it to cover a misc section is on the roadmap for this Q [14:23:18] cool! [14:23:47] sorry, when I see something exciting, I can only think over new possibilties [14:23:52] 10DBA, 10Data-Services: Prepare and check storage layer for trwikivoyage - https://phabricator.wikimedia.org/T271261 (10Marostegui) This has been sanitized. Tested with my user and everything works fine, ran check private data on clouddb1013 and 1020 which back clean. Now waiting for labsdb* hosts to finish i... [14:23:59] rather than recognizing good work [14:24:13] I am trying to treat myself of that :-PPP [14:25:24] also I now understand that many bash scripts maybe could be replaced by orchestrator cli tools [14:28:40] jynus: yes, especially as soon as we install orchestrator-client on cumin1001, we can start playing with it there and interect with it to see hosts per section etc [14:29:04] and if the db metadata is sane [14:29:19] we can move zarcillo to orchestator [14:29:35] at least the "what instances we have" part [14:29:41] jynus: that's a possibility, yes. [14:31:48] thank you both for working towards that, kormat: you should definitely announce the milestone next week at sre meeting [14:32:22] good point! [15:17:35] 10Data-Persistence-Backup, 10Analytics: Matomo database backup size doubled, we should check this is normal operation - https://phabricator.wikimedia.org/T272344 (10elukey) @jcrespo Thanks a lot for the ping, I'll review the data with @razzi and we'll get back to you asap. Really great alert! I like it :) [15:37:49] 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Restart m5 master (db1128) - https://phabricator.wikimedia.org/T272388 (10Marostegui) [15:38:05] 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Restart m5 master (db1128) - https://phabricator.wikimedia.org/T272388 (10Marostegui) p:05Triage→03Medium [15:38:11] 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Restart m5 master (db1128) - https://phabricator.wikimedia.org/T272388 (10Marostegui) [15:38:25] 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [15:38:27] 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Restart m5 master (db1128) - https://phabricator.wikimedia.org/T272388 (10Marostegui) [15:43:47] 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Restart m5 master (db1128) - https://phabricator.wikimedia.org/T272388 (10aborrero) I think this works for us, thanks for the heads up. Do you think the downtime should be announced to stakeholders? Wikitech being down seems like something som... [15:46:15] 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Restart m5 master (db1128) - https://phabricator.wikimedia.org/T272388 (10Marostegui) >>! In T272388#6758254, @aborrero wrote: > I think this works for us, thanks for the heads up. > > Do you think the downtime should be announced to stakehold... [15:53:27] 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Restart m5 master (db1128) - https://phabricator.wikimedia.org/T272388 (10ssastry) Works for me. Tangentially, we are currently in process of possibly stopping all use of testreduce database for our tests and it is possible we might get it done... [15:53:58] 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Restart m5 master (db1128) - https://phabricator.wikimedia.org/T272388 (10ssastry) Works for me. Tangentially, we are currently in process of possibly stopping all use of testreduce database for our tests. [15:54:08] 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Restart m5 master (db1128) - https://phabricator.wikimedia.org/T272388 (10Marostegui) >>! In T272388#6758291, @ssastry wrote: > Works for me. Tangentially, we are currently in process of possibly stopping all use of testreduce database for our... [15:54:37] 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Restart m5 master (db1128) - https://phabricator.wikimedia.org/T272388 (10jcrespo) One small detail- I am unsure if labswiki use the proxy / has its failover service configured, due to it being handled by mediawiki, so for wikitech it may be a... [15:55:56] 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Restart m5 master (db1128) - https://phabricator.wikimedia.org/T272388 (10Marostegui) @jcrespo there is no active proxy for m5 - as I stated on the task there will be no reads and no writes. [16:00:33] 10DBA, 10wikitech.wikimedia.org, 10User-notice, 10cloud-services-team (Kanban): Restart m5 master (db1128) - https://phabricator.wikimedia.org/T272388 (10Marostegui) [16:08:23] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for trwikivoyage - https://phabricator.wikimedia.org/T271261 (10Marostegui) a:05Marostegui→03None This is ready for the views creation. * `_p` database created * grant added to `labsdbuser` role Please #cloud-servic... [16:18:40] 10DBA, 10Orchestrator, 10SRE, 10Patch-For-Review, 10User-Kormat: orchestrator: Puppetize - https://phabricator.wikimedia.org/T265990 (10Kormat) [16:18:54] 10DBA, 10Orchestrator, 10SRE, 10CAS-SSO, and 2 others: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) 05Open→03Resolved [16:20:21] 10DBA, 10Orchestrator, 10SRE, 10CAS-SSO, and 2 others: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Marostegui) Great work! [18:01:59] 10DBA, 10Performance-Team, 10Platform Engineering Roadmap Decision Making, 10SRE, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Krinkle) @daniel @WDoranWMF Now that the docs have landed (thanks @nnikkhoui), I believe the next step is removing the obsolete gr... [18:03:11] 10DBA, 10Performance-Team, 10Platform Engineering Roadmap Decision Making, 10SRE, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Krinkle) [22:01:45] 10DBA, 10MediaWiki-extensions-Translate, 10User-brennen, 10Wikimedia-production-error: Error 1146: Table 'mediawikiwiki.translate_cache' doesn't exist - https://phabricator.wikimedia.org/T272428 (10brennen) p:05Triage→03Unbreak! [22:04:51] 10DBA, 10MediaWiki-extensions-Translate, 10User-brennen, 10Wikimedia-production-error: Error 1146: Table 'mediawikiwiki.translate_cache' doesn't exist - https://phabricator.wikimedia.org/T272428 (10Jdforrester-WMF) Code defining and using this table was added for {T182433} in https://gerrit.wikimedia.org/r... [22:12:45] 10DBA, 10MediaWiki-extensions-Translate, 10User-brennen, 10Wikimedia-production-error: Error 1146: Table 'mediawikiwiki.translate_cache' doesn't exist - https://phabricator.wikimedia.org/T272428 (10RhinosF1) I think https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/606424/54/utils/MessageUp... [22:14:43] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Jclark-ctr) DYT7773 is correct ST for db1156 located last DB server racked in D3 U12 [22:14:47] 10DBA, 10MediaWiki-extensions-Translate, 10User-brennen, 10Wikimedia-production-error: Error 1146: Table 'mediawikiwiki.translate_cache' doesn't exist - https://phabricator.wikimedia.org/T272428 (10Jdforrester-WMF) Aha, yeah, that's it. Creating the table on all wikis is trivial, but I'd like the Language... [22:15:14] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Jclark-ctr) [22:17:08] 10DBA, 10MediaWiki-extensions-Translate, 10User-brennen, 10Wikimedia-production-error: Error 1146: Table 'mediawikiwiki.translate_cache' doesn't exist - https://phabricator.wikimedia.org/T272428 (10Reedy) And a reminder to stop this regressing again in future; `sql/translate_cache.sql` should be added to `... [22:23:20] 10DBA, 10Language-Team, 10MediaWiki-extensions-Translate, 10User-brennen, 10Wikimedia-production-error: Error 1146: Table 'mediawikiwiki.translate_cache' doesn't exist - https://phabricator.wikimedia.org/T272428 (10Urbanecm) >>! In T272428#6759850, @Jdforrester-WMF wrote: > Aha, yeah, that's it. > > Cre... [22:44:12] 10DBA, 10Language-Team, 10MediaWiki-extensions-Translate, 10User-brennen, 10Wikimedia-production-error: Error 1146: Table 'mediawikiwiki.translate_cache' doesn't exist - https://phabricator.wikimedia.org/T272428 (10brennen) We're coming up on 15:00 Pacific. Per [[https://wikitech.wikimedia.org/wiki/Depl... [23:01:08] 10DBA, 10Language-Team, 10MediaWiki-extensions-Translate, 10User-brennen, 10Wikimedia-production-error: Error 1146: Table 'mediawikiwiki.translate_cache' doesn't exist - https://phabricator.wikimedia.org/T272428 (10Ladsgroup) And new tables in production should be coordinated with DBAs before rolling out.