[00:18:41] 10DBA, 10Core Platform Team Workboards (Clinic Duty Team), 10MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), 10Wikimedia-production-error: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T239877 (10brennen) [00:59:56] 10DBA, 10Core Platform Team Workboards (Clinic Duty Team), 10MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), 10Wikimedia-production-error: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T239877 (10Jdforrester-WMF) [06:01:16] 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10Marostegui) [06:08:32] 10DBA: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 (10Marostegui) [06:45:13] 10DBA, 10Patch-For-Review: Decommission db1062.eqiad.wmnet - https://phabricator.wikimedia.org/T239188 (10Marostegui) [06:45:24] 10DBA, 10Patch-For-Review: Decommission db1062.eqiad.wmnet - https://phabricator.wikimedia.org/T239188 (10Marostegui) [06:48:16] 10DBA, 10Growth-Team, 10Operations, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Marostegui) The new hosts for es4 and es5 have been ordered and will be most likely set up (installed and racked... [09:19:53] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:39:39] 10DBA: Compress wikisahred.cx_corpora on x1 hosts - https://phabricator.wikimedia.org/T240325 (10Marostegui) [09:53:18] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [10:03:45] I see lots of ErrorException from line 186 of /srv/mediawiki/php-1.35.0-wmf.8/extensions/Cite/src/ReferenceStack.php: PHP Notice: Undefined index: [10:03:54] ErrorException from line 186 of /srv/mediawiki/php-1.35.0-wmf.8/extensions/Cite/src/ReferenceStack.php: PHP Notice: Undefined index: text [10:03:56] and so on [10:04:12] yep [10:04:15] there is a task for it [10:04:20] I didn't find it [10:04:22] looks like it comes from the train last night [10:04:27] do you remember where? [10:04:36] https://phabricator.wikimedia.org/T240248 [10:04:40] thanks [10:28:18] hey guys [10:28:24] since you both missed yesterday's meeting: [10:28:31] we're gonna need to define our OKRs for next quarter [10:28:35] so I propose we discuss them on thursday [10:28:46] sounds good to me [10:32:49] ok [10:40:59] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [11:14:45] rerunning backups on db2099, s4 replication will be stopped for an hour or so [11:15:04] cool [11:15:23] I created a task for backup2001, just checking you've got it on your radar [11:15:32] oh [11:15:51] I think I added you as a subscriber [11:15:52] let me check [11:16:00] https://phabricator.wikimedia.org/T240177 [11:16:38] 10DBA, 10Operations: backup2001 rebooted itself - https://phabricator.wikimedia.org/T240177 (10jcrespo) p:05Triage→03Normal a:03jcrespo [11:46:09] 10DBA, 10Operations: backup2001 rebooted itself - https://phabricator.wikimedia.org/T240177 (10jcrespo) a:05jcrespo→03Papaul Not the first time this happens: T237730 And firmware was updated at that time. @Papaul Could you file a support issue to vendor, given it is the second time this happened? What inf... [12:06:45] !log disable puppet to restart puppetdb service [12:06:45] jbond42: Not expecting to hear !log here [12:10:46] ^^ re-enabled [12:12:32] !log disable puppet to restart puppetdb (postgressql) service [12:12:32] jbond42: Not expecting to hear !log here [12:12:42] sorry spoke to soon [12:13:28] 10DBA, 10Operations, 10Patch-For-Review, 10Puppet, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [12:15:33] and just noticed this is the wrong channle doh! [12:16:33] 10DBA, 10Operations, 10Patch-For-Review, 10Puppet, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [12:24:59] bconsole: "Connecting to Director backup1001.eqiad.wmnet:9101. Director authorization problem. Most likely the passwords do not agree." [12:26:07] jbond42: did you merge recently the password change? [12:26:21] it doesn't seem to work since 12:00 UTC [12:27:22] "Skipping run of Puppet configuration client; administratively disabled (Reason: 'perform puppetdb restart T237259 - jbond'" at that time [12:27:23] T237259: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 [12:28:06] jynus: i did the password change on thursday [12:28:19] i thought i had re-enabled puppet just running cuming again to make sure [12:28:31] I cannot connect to bacula now [12:28:43] maybe it wasn't changed on the client config or something [12:28:47] or needs a restart? [12:29:14] what do you mean bvy "i cant connect to bacula now" [12:29:24] the change i m,ade would only effect client to server connections [12:29:28] but it stopped working at 12:00: https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&from=1575808153397&to=1575980953397&var-dc=eqiad%20prometheus%2Fops&var-job=gerrit1001.wikimedia.org-Hourly-Sun-production-srv-gerrit-git [12:30:16] if I run "bconsole" on backup1001, it no longer connects [12:31:00] looking [12:31:04] bnet_server.c:197 Connection from 10.64.48.36:57074 refused by hosts.access [12:32:00] maybe it is unrelated [12:32:09] maybe but it is suspicious [12:33:36] Dec 10 12:00:21 backup1001 bacula-dir[5569]: 10-Dec 12:00 Message delivery ERROR: fopen /var/lib/bacula/log failed: ERR=Too many open files [12:33:44] i just had a look and the passwords in /etc/bacula/bconsole.conf and /etc/bacula/bacula-dir.conf both match [12:34:24] I am going to restart bacula-dir, maybe it is just a coincidence [12:34:43] ok let me know if its still a problem and ill help dig [12:35:07] at :00 many backup jobs start at the same time, so it could be it [12:35:51] it works now [12:35:55] 10DBA, 10Operations, 10Patch-For-Review, 10Puppet, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [12:35:57] so it is not some puppet weirdness [12:36:15] ok cool thanks :) [12:36:18] or at leasy I don't think it is [12:36:21] *least [12:39:24] these copy jobs need some tuning I am not fully understand [12:39:33] specially for the first time they run [12:39:55] plus I think it may be trying to copy jobs whose data doesn't exist, like old backups from the old db [12:41:03] no idea i havd not even heard of bacula till here. however i have just checked and the bacula user is limited to 1024 open files probably worth upping that a bit [12:41:40] yeah [12:41:59] well, to be fair, bacula-dir should not need to handle lots of fails [12:42:10] only clients and storage should [12:42:41] but your suggestion is, however, a great point [12:43:23] s/fails/files/ and that is a fail on itself [12:43:44] feel free to add me as a reviewer to up the limits :) [12:44:12] stats are back, and only one job was cancelled [12:44:35] will take a break and later see what is the best way to handle this [12:44:59] good to here and ack [12:50:32] 10DBA, 10Operations, 10Patch-For-Review, 10Puppet, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [12:56:34] 10DBA, 10Operations, 10Patch-For-Review, 10Puppet, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [12:56:53] 10DBA, 10Operations, 10Patch-For-Review, 10Puppet, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [14:09:51] it got overloaded again [14:09:57] :-( [14:10:07] the copy thing needs tuning [14:11:29] that's the fine tuning we said it would take time :-) [14:12:52] I will schedule 1 copy in the next hour for 1 day [14:13:08] so it can at least progress something [14:13:13] I guess no HW logs on backups2001 btw? [14:13:26] I didn't check [14:13:31] ah [14:13:36] but there were none last time, I asked papaul what he needs [14:54:41] I think copy now should happen without issues until next week [14:54:58] fingers crossed! [14:55:07] once copy is done, we will reevaluate policy [14:55:35] my suspicion is that there are many errors on job metadata that is not physically present due to the migration [14:55:48] the question is if the errors will repeat on every attempt [14:56:02] and metadata needs to be purged [14:56:22] or it will fix itself by detecting data was not present [14:56:38] will reevaluate when copy finishes [14:58:43] did you by any chance downtime db2090? (maybe it was me), marostegui [14:58:54] I did [14:58:56] Why? [14:59:01] I downtimed all s4 codfw [14:59:04] it was running backups [14:59:08] db2090? [14:59:15] and if it was stopped, it may have started replication [14:59:20] db2090 is the master [14:59:24] mm [14:59:32] maybe then I am confusing it [14:59:48] sorry [14:59:50] db2099 [14:59:53] I have downtimed db2090 (and all s4 codfw) as there is a long running alter table on s4 codfw master [14:59:54] my fault [15:00:13] is it ok that db2099 is running replication? [15:00:22] yeah [15:00:29] there is only replication stopped on db2090 (the master) [15:00:30] ok, then I have not caused any harm [15:00:33] cool [15:00:48] sorry, I though I have caused an issue by starting/stopping replication there [15:00:51] no :) [15:00:53] all good! [15:00:54] *had [15:00:57] thanks for asking though [15:01:12] I should warn you if I do off-hours snapshots [15:01:13] 10DBA: Compress wikisahred.cx_corpora on x1 hosts - https://phabricator.wikimedia.org/T240325 (10Marostegui) [15:01:37] as those could accidentally start and stop replication [15:01:53] I am not stopping replication for this massive alter, so we should be good on that front [15:02:38] An alter can mess up with the snapshots via xtrabackup I guess, no? [15:02:44] As the frm might be different before and after [15:02:52] if it is caught in the middle of the later [15:02:54] *alter [15:04:47] mmmm, not sure [15:05:16] I think locking prevents it [15:05:31] but it would be in any case undesirable [15:05:59] that is one of the reasons we have multiple copies at the same time [15:06:11] the other thing this shows is a weakness on the backups [15:06:28] backups are checked that they are recently generated [15:06:49] but they don't show the last time an update happened on the db [15:07:02] so if for some reason- replication is stopped, or some intermediate master is stopped [15:07:12] I don't think it would be detected immediately [15:07:16] not even on testing [15:07:33] ah I see what you mean [15:07:46] sure, we have backup redundancy between dcs also mitigating this [15:08:00] so it could be we are taking a backup of a host that had replication stopped for 30 days [15:08:04] is that what you mean? [15:08:09] yes [15:08:17] it would be nice to check something like "timestamp according to heartbeat" [15:08:22] yeah, interesting [15:08:26] or better, revision table/recentchanges [15:08:48] and those are the kind of issues that woudl be frequent, rather than a full crash [15:09:08] you could approach that with a check of the last time the ibd was changed I guess [15:09:14] one mitigation would be dc duplication as we already do [15:09:35] but yes, maybe some extra heuristic check would be nice [15:09:51] maybe note this conversation on the wishlist? [15:09:55] but I cannot think of a perfect one [15:10:22] ibdata may miss, like pt-heartbeat, an update only happening on the master, but not the real data [15:10:43] I guess a way would be to do a query or check the ibd before attempting the backup, and then doing the backup and showing a warnings like: the backup was done ok, but xxxxx table last write happened XX days ago [15:10:44] revision/recentchanges may be very little editted if a small wiki [15:11:21] the last edition on X table however would be maybe good enough [15:11:43] I will add it to the ticket for backup improvement [15:11:53] good! thank you [15:14:04] 10DBA, 10Epic: Improve regular production database backups handling - https://phabricator.wikimedia.org/T138562 (10jcrespo) [15:14:32] 10DBA, 10Epic: Improve regular production database backups handling - https://phabricator.wikimedia.org/T138562 (10jcrespo) [15:15:08] I am thinking something like "s1: check_table: enwiki.revision" [15:15:29] if enwiki.revision not edited in the last hour, alert + store that info on the metadata database [15:15:53] that way it would also work with non mw databases [15:16:06] yeah, that'd be useful [15:16:21] but the backup should continue either way [15:16:56] "m3: check_table: phabricator_whatever.tickets / check_table_time: 24 hours" [15:17:20] and making that optional I guess or mandatory? [15:17:28] option I think [15:17:31] +1 [15:17:53] but on the db: "last_table_edit: 2019-17-23 23:45:51" [15:18:09] yeah [15:18:10] if last_table_edit < now() - 1 hour, alert [15:18:36] or < backup_time - 1 hour [15:19:21] we would even detect issues outside of backups [15:19:41] "codfw enwiki is not being updated in the last 24 hours" [15:20:02] maybe could be done outside of backups [15:20:46] at the same time than T207253 [15:20:47] T207253: Compare a few tables per section between hosts and DC - https://phabricator.wikimedia.org/T207253 [15:21:21] yeah, we could even detect unused tables [15:21:21] Going to call it a day [15:21:21] Talk to you tomorrow! :* [15:21:23] as long as the backup sources are also compared, that may work [15:21:28] bye! [15:37:59] jynus: in relation to increasing open file limits i created https://gerrit.wikimedia.org/r/c/operations/puppet/+/556207 which should do the job but of course i dont know the correct limits or if you want to apply them to all theses services [15:39:46] oh, you didn't have to [15:39:50] it was my problem [15:40:17] I only pinged you because the error message sent me to password + you happened to do something at that time [15:40:23] I was later wrong [15:40:33] in fact, I am not sure if to increase the limit [15:40:42] because it should never happen [15:40:49] its no problem i wantd to see if it could just be controled in initd or if you had to do something in limits.conf as well [15:41:04] s/initd/systemd/ [15:41:20] after the testing the change was minimal so really not a problem [15:41:30] Allow me please to put that in the backburner [15:41:48] because I may fix the "requiring too many open files" in a different way [15:42:05] and later rethink if that is needed [15:42:08] yes sure thats not a problem [15:42:15] I think it would be nice for fd and sd [15:42:33] but I in the director it may be synthom of something else going wrong [15:42:46] ack [15:42:49] specially as we are panning to increase concurrency [15:43:04] of number of jobs running at the same time [15:55:11] 10DBA, 10Growth-Team, 10Operations, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Anomie) >>! In T107610#5454434, @Catrope wrote: > It doesn't look like `ExternalStoreDB` currently supports over... [17:43:46] 10DBA, 10Core Platform Team Workboards (Clinic Duty Team), 10MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), 10Wikimedia-production-error: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T239877 (10Arlolra)