[05:39:05] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10Marostegui) @jcrespo would you mind taking a look at the above patches ^ I have also updated our etherpad with the plan Thanks! [06:10:05] 10DBA, 10Goal: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 (10Marostegui) [06:10:28] 10DBA, 10Goal: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 (10Marostegui) p:05Triage→03Normal a:03Marostegui [06:11:35] 10DBA, 10Goal: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 (10Marostegui) [06:11:37] 10DBA, 10Operations, 10ops-eqiad, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [06:16:01] jynus: check the sre etherpad, I have added our goals but please double check them and re-org as you feel, line 62 to 81 [06:16:19] jynus: also added this week updates to our section (line 84) [06:41:19] 10DBA, 10Gerrit, 10Operations, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Dzahn) 05Resolved→03Open let's only resolve stuff that is actually resolved, not what will be resolved i... [06:58:12] sorry, I was looking at our etherpad, and was getting very confused [06:58:46] I have updated the SRE one only [06:59:01] it is ok, I was like "but I don't see it" :-D [06:59:06] did it move up or down? [06:59:08] haha [06:59:12] and I was like, wat? [06:59:13] on the SRE? [06:59:16] ah [06:59:17] hahahah [06:59:18] on ours [06:59:41] ours only have the usual new entry when we do a failover [07:02:47] I need to fix the backups before updating that, they failed but run fast without replication [07:03:55] yeah, I didn't add any updates there really [07:03:58] just the structure [07:04:02] as it was empty [07:04:16] title, key points tasks etc [07:04:31] thanks for doing that [07:04:34] it is very useful [07:09:10] no problem! [07:23:28] I am acking the other alarm for db2044 [07:23:36] (yes, don't ask me why there are 2) [07:24:13] thanks [07:24:23] and apparently dbstore1001 is low again on disk [07:26:07] I don't think it can take snapshots of s1 at the time [07:26:13] so I will change it [07:41:11] question, marostegui: will private data check fire wrongly this monday again, or was it created already? [07:41:24] no, it was created fine [07:41:27] cool [07:41:28] And I ran it to double check [07:41:30] thanks [07:41:31] And it was good [07:42:06] https://phabricator.wikimedia.org/T212625#5074335 [07:43:11] thanks, I wasn't subscribed to that one [07:43:59] no worries! [08:56:07] Hi, i am trying to find the right Icinga URLs for Monitoring checks, what about this one: [08:56:10] description => 'eventlogging_sync processes', [08:56:30] it checks if eventlogging_sync.sh is running [08:56:38] is that more https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging or more DBA [08:56:54] because it is in the mariadb profile [08:57:31] profile/manifests/mariadb/misc/eventlogging/replication.pp [08:58:30] mutante: that is eventlogging indeed, not us as DBA [08:59:06] ok, thanks ! [08:59:22] mutante: those database db1107 and db1108 are databases, but the eventlogging ones :) [09:00:16] the other ones in this profile are 2x "haproxy_failover" a [09:00:55] ok, yep. in this case i am only going by puppet modulees/profiles and had no idea about host names [09:01:04] makes sense [09:02:23] so if we had an alert about haproxy failing to load balance between replicas.. that would go to https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [09:02:39] it also mentions haproxy there i see. so that's cool [09:03:25] mutante: not really, it doesn't explain how that works on misc (haproxy is only on misc now) [09:04:09] We need a new section, point it there, but we need a new section to troubleshoot it [09:05:12] ok, well. as long as the page is the right one we can use it and add sections later [09:09:26] mutante: we also have https://wikitech.wikimedia.org/wiki/MariaDB#HAProxy which is incomplete [09:10:34] marostegui: cool, that seems good. it doesnt necessarily have to have all the content yet, just want to get to the place where we can make it a required param for new checks [11:59:50] 10DBA, 10Goal, 10Patch-For-Review: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203 (10jcrespo) With replication down, backup of s8 (with x1 concurrency, but it shouldn't matter in this case) took 4h45m: * 2h45m for the 1.5TB transference * 25s... [12:12:37] Apr 5 12:06:03 cloudservices2002-dev mysqld[2134]: 2019-04-05 12:06:03 0 [Note] /opt/wmf-mariadb103/bin/mysqld (mysqld 10.3.8-MariaDB-log) starting as process 2134 ... [12:12:37] Apr 5 12:06:03 cloudservices2002-dev mysqld[2134]: 2019-04-05 12:06:03 0 [ERROR] Can't find messagefile '/opt/wmf-mariadb10/share/errmsg.sys' [12:12:37] Apr 5 12:06:03 cloudservices2002-dev mysqld[2134]: 2019-04-05 12:06:03 0 [ERROR] Aborting [12:12:45] should I just create that file? [12:13:42] no, you have missconfiguration there [12:14:24] oh I see, 103 vs 10 [12:14:35] we don't even have 10.3 packages [12:15:13] you should probably review your profile and how it uses the mariadb module [12:15:47] or, alternatively, not use the mariadb module, which is intended for production configuration [12:16:01] (and can be used for non production, but needs work) [12:17:16] blame andrewb (jk) who I think was the person that set that up in the past :-) [12:17:20] :-P [12:17:44] I'm staring at modules/profile/manifests/openstack/base/pdns/auth/db.pp with little clues on what to do [12:17:57] also affected because I'm rushing :-P [12:18:13] one trick I highly recommend [12:18:19] no matter what you do [12:18:38] set up on hiera enable_notifications: 0 [12:18:48] so that after install, even if there are errors [12:19:07] they don't go off (disabled notifications by default) [12:19:17] 10DBA, 10Goal, 10Patch-For-Review: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203 (10Marostegui) I think it is pretty good compared with doing it without stopping replication. Probably once the sources have been migrated to the final HW will r... [12:19:22] mysql servers are tricky to setup because they require provisioning [12:19:46] and that is automated outside of puppet [12:20:55] we always do the setup with either disabled_notifications:0 or install them as spares, which has the same effect (just a tip) [13:05:15] 10DBA, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Marostegui: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 (10Marostegui) I have blocked a window on Tuesday to tentatively get it deployed if no one objects... [13:07:28] Reserved also a window for the failover [15:49:43] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T220102 (10Papaul) a:05Papaul→03Marostegui complete [16:21:33] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T220102 (10Marostegui) Thanks, it is rebuilding ` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 54% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physicaldrive 1I:1:2 (port... [17:56:37] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T220102 (10Marostegui) 05Open→03Resolved Finished correctly, thanks! ` root@db2044:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380264FFFB0) Port Name: 1I P... [18:12:30] 10DBA, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Marostegui: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 (10aaron) >>! In T210725#5065110, @Marostegui wrote: > I would like to elaborate more on my idea on... [18:14:48] 10DBA, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Marostegui: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 (10Marostegui) >>! In T210725#5089115, @aaron wrote: >>>! In T210725#5065110, @Marostegui wrote: >>...