[06:53:49] DBA, Dumps-Generation, Patch-For-Review: db1053 transient replica lag during dumping - https://phabricator.wikimedia.org/T132279#2482868 (ArielGlenn) Failed because I forgot how to set the config setting properly. Interestingly enough we had lag of only about 15 seconds max during the en wiki stubs... [08:21:44] DBA, Operations, ops-eqiad: dbstore1002 disk errors - https://phabricator.wikimedia.org/T140337#2482951 (jcrespo) I had that very same problem with the old disk, but I assumed it was because it had failed. :-( Let me see if I see anything else bad. [09:58:14] DBA, Dumps-Generation, Patch-For-Review: db1053 transient replica lag during dumping - https://phabricator.wikimedia.org/T132279#2483136 (jcrespo) 15 seconds is not a concern. Also, as I said many times, if you told me you need to stop replication, that would be ok (and I would like to explore that o... [13:10:34] DBA, Dumps-Generation, Patch-For-Review: db1053 transient replica lag during dumping - https://phabricator.wikimedia.org/T132279#2483313 (ArielGlenn) Yeah I don't think we should need to stop it. This approach of running the problematic job in two batches should be just fine, as long as I remember h... [13:13:51] DBA, Operations, ops-eqiad: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2483315 (jcrespo) [13:14:09] DBA, Operations, ops-eqiad: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2483328 (jcrespo) ``` MariaDB MISC m2 localhost (none) > SHOW DATABASES; +--------------------+ | Database | +--------------------+ | bugzilla_testing | | frimpressions | | heartbeat | |... [13:16:34] mysql, information_schema, performance_schema, sys are all system databases [13:16:40] so forget about it [13:21:08] <_joe_> jynus: I know :) [13:21:09] _joe_, ping [13:21:22] <_joe_> I am looking at "scholarships" [13:21:26] thank you for the help, I really appreciate the archeology work [13:21:41] <_joe_> apart from that, everything relevant is accounted for [13:21:43] I am checking now the access starts, maybe they return 0 [13:22:24] <_joe_> scholarships.wikimedia.org [13:22:51] <_joe_> it's up [13:22:53] the problem is that I have the starts, but as they have not been yet upgraded, we do not have a last_access [13:22:54] <_joe_> so it's ok [13:23:09] yes, and I think if it was down [13:23:15] it wouldn't be so critical [13:23:22] <_joe_> honestly scolarships seems basically a static site now [13:23:24] <_joe_> yes, [13:23:29] they would send a ticket and we would solve it [13:23:47] it is just I felt stupid not knowing that it was used for [13:23:54] <_joe_> frimpressions is the only one I don't know shit about [13:23:59] that is what I created https://wikitech.wikimedia.org/wiki/MariaDB/misc [13:25:05] <_joe_> templates/mariadb/production-grants-m2.sql.erb:24: ON `frimpressions`.* TO 'frack'@'10.64.0.166'; [13:25:06] _joe_, I can confirm those have not been acessed since the server started [13:25:13] <_joe_> so I guess it's frack [13:25:21] but probably not in use [13:25:32] it does not have access (or shouldn't) [13:25:50] I am pinging jeff [13:25:59] <_joe_> uhm the only grant I see for frimpressions is for dbproxy [13:26:01] <_joe_> heh [13:26:03] <_joe_> anyways [13:26:12] <_joe_> that I see on puppet, didn't check on the db [13:26:28] I think that is all [13:26:29] <_joe_> btw I am reinstalling dbproxy1002 now [13:26:37] <_joe_> it's jessie, is that ok? [13:26:49] yes, I literally changed it on monday [13:26:58] <_joe_> I remembered that much :P [13:27:01] and I asked myself how to failover a SPOF [13:27:29] you contributed to https://phabricator.wikimedia.org/T125027 more than I did already! [13:28:02] (but I hope you understand why it takes me so much time, as I inherited so many SPOF) [13:28:31] the fix is to have a couple of proxies, but I needed time to architecture that [13:29:02] (which is not easy with so many applications, and so different) [13:29:33] thank you, _joe_ I will keep it from here [13:30:31] *take [13:30:38] I will not wast more of your time [13:31:46] <_joe_> jynus: yeah I think it's normal when you're alone doing the job of a team of 3/4 people [13:31:53] <_joe_> to have things that are not "perfect" [13:32:14] well, what I want you to understand (which I think you do, but I need to say it) [13:32:21] <_joe_> :) [13:32:30] is, I inherit X, I know X is wrong [13:32:45] I bring it up, and I slowly work to fix it [13:32:57] but it will be "bad" for some time [13:33:03] <_joe_> we're all mainly in the same boat [13:33:05] think labsdb RAID level [13:33:18] <_joe_> I mean even things I did are not as good as I'd like [13:33:22] <_joe_> because time constraints [13:33:29] it is not in this case [13:33:40] <_joe_> I know [13:33:54] <_joe_> I mean even if things I did are not done as well as I'd like [13:34:00] sure [13:34:03] <_joe_> try to imagine things that predate me :) [13:34:13] of course [13:34:37] <_joe_> I think dbproxy1002 will be back in 10-15 minutes btw [13:34:53] I will continue the installation [13:35:05] happily, if I hade to decide one to break [13:35:17] this would be one of those, because puppet will automatize evertyhing [13:35:20] it has not data [13:35:40] that was also why it was not in the top priority thing [13:36:00] but without gerrit, I was not confortable to setup another host as quickly [13:37:54] thinking back, and it is obvious now, is that dbproxies being l3 can't simply take traffic without configuration beforehand [13:39:11] what I mean is that I thought all proxies would be configured the same, which is obviously not the case [13:39:21] <_joe_> godog: heh, nope [13:41:58] <_joe_> jynus: dbproxy1002 is up btw [13:43:26] oh, no [13:43:53] godog, I thought of failovering the proxy, because it would only need a hot config change [13:44:11] but probably the other proxies were not even setup [13:44:30] (not without config, but even without ha installed, etc.) [13:44:36] and the dns would be needed anyway [13:44:55] which is why we use proxies in the first time (avoid dns failovers) [13:45:02] *place [13:45:29] but the right fix is setup HA for the proxies (I do not have clear how) [13:45:38] heh [13:45:52] I would like to have the proxy withing the services, because that allows to failover every service separatelly [13:46:04] but I do not think that is possible [13:46:15] I will ask for your expertise at some point [13:46:29] for braistorming [13:47:09] maybe VIPs is the easier option, I just hate it [13:53:09] I cannot find dbproxy1002 now on icinga :-/ [14:09:39] <_joe_> jynus: sorry I disappeared but restbase outage [14:10:13] today is a bad day to stop playing pokemon go [14:12:33] It still won't let me start [14:13:47] Reedy, are you talking about restbase or pokemon? [14:13:56] pokemon :) [14:14:35] <_joe_> jynus: so dbproxy1002 is up and running since a few [14:14:45] <_joe_> should be correctly configured too [14:15:15] ah, that is why I couldn't install console [14:16:19] <_joe_> I told you [14:16:25] <_joe_> 15:42 < _joe_> jynus: dbproxy1002 is up btw [14:16:28] <_joe_> :) [14:16:42] well I understood up == reimaged, not puppetized [14:20:05] confess, this was all a plan of yours to check that my puppet was correct :-P [14:36:35] <_joe_> jynus: yes! [14:36:40] <_joe_> apparently it is :) [16:10:13] DBA, Operations, ops-eqiad: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2483793 (jcrespo) p:Triage>Normal dbproxy1002 seems to be back up again thanks to @fgiunchedi and @Joe. I will point the DNS back to the proxy again at an appropriate window. [16:20:44] DBA, Operations, ops-eqiad: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2483855 (jcrespo) a:jcrespo [16:42:06] DBA, Labs, Labs-Infrastructure, Striker: Investigate moving labsdb (replicas) user credential management to 'Striker' (codename) - https://phabricator.wikimedia.org/T140832#2483968 (bd808) [17:39:17] DBA, Operations, ops-eqiad: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2484406 (jcrespo) I am checking times, according to logs (request numbers are too low) gerrit and OTRS were down between 12:43 and 12:58. [18:33:55] DBA, MediaWiki-Database, User-notice, Wikimedia-log-errors: Special:Block request makes the database timeout on db1055 - https://phabricator.wikimedia.org/T140650#2484691 (Bsadowski1) Happened to me a few minutes ago when attempting to block the same user at the same time as another administrator... [20:01:56] DBA, Community-Tech, MediaWiki-extensions-PageAssessments, Reports-bot, Schema-change: Create tables for PageAssessments in enwiki database - https://phabricator.wikimedia.org/T139552#2484944 (kaldari) @jcrespo: I would be happy to do this (for future wikis as well) if you can just let me kno... [20:45:00] DBA, Community-Tech, MediaWiki-extensions-PageAssessments, Reports-bot, Schema-change: Create tables for PageAssessments in enwiki database - https://phabricator.wikimedia.org/T139552#2485172 (jcrespo) I think that need an extra parameter to run it on the master. Please seek advice regarding... [20:58:10] DBA, Operations, ops-eqiad: dbstore1002 disk errors - https://phabricator.wikimedia.org/T140337#2485251 (jcrespo) ``` megacli -PDRbld -ShowProg -PhysDrv'[32:6]' -a0 Rebuild Progress on Device at Enclosure 32, Slot 6 Completed 98% in 908 Minutes. ``` [20:59:31] DBA, DC-Ops, Operations, ops-eqiad: dbstore1002.mgmt.eqiad.wmnet: "No more sessions are available for this type of connection!" - https://phabricator.wikimedia.org/T119488#2485255 (jcrespo) stalled>Resolved a:Cmjohnson [21:01:28] DBA, Operations, Patch-For-Review: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#2485260 (jcrespo) It seems dbproxy1002 was "accidentally" upgraded to jessie today: T140983 [21:05:04] DBA, Operations, ops-eqiad: dbproxy1002 down - https://phabricator.wikimedia.org/T140983#2485265 (jcrespo) We need to revert https://gerrit.wikimedia.org/r/300254 once we check everything is working and have a window where it is not disruptive. [21:08:11] DBA, Operations, Patch-For-Review: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#2485279 (jcrespo) [21:08:14] DBA, Operations, Phabricator, Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2485280 (jcrespo)