[06:47:42] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1015.eqiad.wmnet - https://phabricator.wikimedia.org/T268810 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `es1015.eqiad.wmnet` - es1015.eqiad.wmnet (**PASS**) - Downtimed host on Icinga... [07:25:55] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1015.eqiad.wmnet - https://phabricator.wikimedia.org/T268810 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `es1015.eqiad.wmnet` - es1015.eqiad.wmnet (**FAIL**) - **Failed downtime host o... [07:31:41] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) [07:34:23] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) s1 eqiad progress [x] dbstore1003 [x] db1140 [] db1139 [] db1135 [] db1134 [x] db1119 [] db1118 [] db1106 [x] db1105 [x] db1099 [] db1089 [x] db1084 [] db... [07:45:19] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1015.eqiad.wmnet - https://phabricator.wikimedia.org/T268810 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `es1015.eqiad.wmnet` - es1015.eqiad.wmnet (**FAIL**) - **Failed downtime host o... [07:49:27] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1015.eqiad.wmnet - https://phabricator.wikimedia.org/T268810 (10Marostegui) [07:49:33] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1015.eqiad.wmnet - https://phabricator.wikimedia.org/T268810 (10Marostegui) a:05Marostegui→03wiki_willy [07:49:46] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1015.eqiad.wmnet - https://phabricator.wikimedia.org/T268810 (10Marostegui) Ready for #dc-ops [08:25:18] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [08:27:18] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [08:27:50] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [08:53:38] jynus: db1133 will be on s1 for a long time? Asking cause it is not showing up on ./section s1, I can fix that if you want me to. It is useful to track schema changes [08:53:59] let me see [08:54:48] normally I add it to tendril and zarcillo, maybe I made a mistake? [08:55:11] jynus: don't know, not a big deal, just saw it when tracking a schema change on s1, do you want me to fix it? [08:55:55] I am working on it [08:56:04] oh, thanks [08:56:21] ah, I know why [08:56:26] it is on test-s1 [08:56:27] not s1 [08:56:40] so ./section test-s1 [08:57:08] yeah, but normally I just do sX for schema changes [08:57:26] it cannot be on s1 because otherwise it misleads how mediawiki people uses the prometheus graphs [08:57:27] on codfw isn't a problem as the schema change is deployed via master [08:57:45] so up to you, but it was changed because service ops complained [08:57:51] :( [08:58:26] so there is an issue with groups [08:58:54] we need 2 classification "is s1 the replica group" and is "s1 the mediawiki classification" [08:59:20] we can update section to check for test-(section)? would that work? [08:59:42] yeah, I think that'd be enough [08:59:51] and later rethink how we want to classify things [09:00:02] one that works for both uses [09:00:09] replication set and group [09:00:57] I would wait to see how orchestator does things [09:01:09] and remodel it around it, maybe, what do you think? [09:01:36] sure, as long as we never put test-sX hosts to serve production [09:01:45] as they might (or might not) miss schema changes [09:01:59] I think that is the whole point of being in a different section :-) [09:02:17] maybe we can rename the command from section to replica-set [09:02:17] yes, I mean as a mistake: "oh this host is replicating from s1, let's put it to help with the load" [09:02:39] so one means "hosts that replicate from one another" [09:02:48] and the other "mediawiki group" [09:03:09] as both classifications may be useful, depending on the a action [09:03:31] I don't think renaming the command solves anything [09:03:40] The possibility of missing the host is the same [09:05:13] Anyways, if we never use the host to serve production without re-installing it, I guess then we are ok [09:05:15] well, maybe we could change replication-tree into working as section [09:05:27] so we don't even have to query tendril [09:06:09] https://gerrit.wikimedia.org/r/c/operations/software/+/644173 [09:07:02] that works [09:07:05] +1ed - thanks [09:10:02] let me test it before merging [09:10:12] I tested the query manually on zarcillo [09:10:17] ah [09:10:20] and it worked? [09:10:22] yep [09:10:30] cool, thanks [09:10:35] sorry, it is indeed confusing [09:10:49] but the issue is that sometimes "group of mediawiki hosts" is requested [09:11:08] and sometimes "list of host that repliocate from primary" is [09:11:20] and it is difficult to find the right env for each group [09:12:11] we should probably add section to wmfmariadbpy for deployment [09:12:32] gerrit is doing weird stuff for me, cannot merge [09:12:35] I have updated db1077 from test-s1 to pc1, as it is now there [09:13:04] so tendril is a temporary thing until we sort out what orchestrator offers us [09:13:13] Do you want me to try to merge? are you logged in? Sometimes I don't notice when I have been logged out [09:13:16] we should keep these issues in mind and find the best way [09:13:26] "Credentials expired" it says now :-( [09:13:30] please merge if you can [09:13:42] merged [09:13:48] pull everywhere [09:14:06] but we will definitely need to rethink classifications [09:15:21] maybe a table for replication updated automatically, and a table for other classifications (mw vs test vs cloud, etc) [09:16:02] and you can use the first for schema changes [09:16:12] but the second is used by production monitoring [09:17:47] also, test-s1 gets normally destroyed often, so it would have gotten the schema change eventually :-D [09:18:45] BTW, backups are running on codfw because the systemd.timer didn't execute again [09:19:10] I don't know what to do exactly to reenable a "vanished command" timer [09:19:16] I disabled and reenabled [09:19:25] yeah, I saw a bunch of alerts there [09:19:37] unreliable timers are scary [09:19:59] because we were supposed to remove all crons and replace them with systemd.timers [09:21:03] Maybe it is worth a ticket to track those, as they are relatively new lately, or at least I haven't seen those issues being so recurrent before [09:21:04] have you? [09:21:41] I replied on a ticket of the same error, and everybody (even me) was like "this is a one time thing" [09:22:06] but as it completely disabled the timer, I may create an UBN [09:32:07] I created T268974 [09:32:07] T268974: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 [09:34:32] I have subscribed myself [09:34:32] thanks [09:34:54] so once the timer is working again, we can lower from UBN [09:37:31] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) I did a transfer from db1124:3318 to clouddb1016:3318 and there are InnoDB errors right after I started replication: ` Nov 30 09:34:36 cloudd... [09:42:38] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [10:29:14] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [12:36:20] 10DBA, 10Privacy Engineering, 10Security-Team, 10Patch-For-Review: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10ArielGlenn) Soooo... any approval forthcoming or should I just merge this? @Reedy what do you think? [13:11:01] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) s8 situation: * Transfer from db1087 (sanitarium master) to clouddb1016:3318 completed successfully. * Sanitization on clouddb1016:3318 was... [13:17:51] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [13:42:42] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [14:20:34] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) From s1 only db1139 is pending the schema change - as it is now under maintenance due to HW issues. [14:21:19] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) [14:45:15] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [15:04:30] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312 (10Bstorm) Sorry, I haven't had a chance to test. I plan to today. [15:16:26] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) [16:38:18] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10Papaul) [17:17:15] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10Papaul) [17:45:57] 10DBA, 10Privacy Engineering, 10Security-Team, 10Patch-For-Review: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10ArielGlenn) Ok, merged., thanks for the +1's. What's next on this task? [18:41:28] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10Papaul) [18:58:42] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10Papaul) [19:01:57] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10Papaul) [19:05:15] 10DBA, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Slow load times for Special:Homepage on cswiki - https://phabricator.wikimedia.org/T267216 (10MMiller_WMF) We want to do the work of breaking up the query into smaller pieces. This goes into Ready for Development... [19:40:39] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Jclark-ctr) reseated dimm`s errors persisted following up with HP [20:16:07] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10Papaul) [22:50:52] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312 (10Bstorm) @Marostegui The views are created on s1@clouddb1013. That was nice and smooth. The indexes are in process. It's taking a littl... [22:59:28] 10DBA: New database request: sockpuppet - https://phabricator.wikimedia.org/T268505 (10WDoranWMF)