[06:16:52] 10DBA, 10Patch-For-Review, 10cloud-services-team (Kanban): Drop nova and nova_api databases from m5 - https://phabricator.wikimedia.org/T248313 (10Marostegui) Renamed `nova_api` tables: ` root@cumin1001:/home/marostegui# mysql.py -hdb1133 nova_api -e "show tables" +----------------------------+ | Tables_in_n... [06:31:23] 10Blocked-on-schema-change, 10DBA: Schema change: Make page.page_restrictions column NULL - https://phabricator.wikimedia.org/T248333 (10Marostegui) [06:35:38] 10DBA, 10Data-Services, 10Operations, 10Patch-For-Review: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) This can now go after merging: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/582961/ https://gerrit.wikimedia.o... [06:39:09] 10DBA, 10Data-Services, 10Operations, 10Patch-For-Review: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) Everything seems to be working fine on dbproxy1019 and dbproxy1018 after merging the above changes. Everything is reachab... [07:37:00] 10Blocked-on-schema-change, 10DBA: Schema change: Make page.page_restrictions column NULL - https://phabricator.wikimedia.org/T248333 (10Marostegui) [08:26:44] 10DBA, 10Operations, 10Patch-For-Review: Add favicon to icinga and tendril - https://phabricator.wikimedia.org/T204110 (10jcrespo) [08:27:01] ^yay [08:27:55] oh nice! [08:28:04] did you see my comment on your change? [08:28:22] I fixed it :) [08:28:47] sorry, didn't see the email [08:30:46] thanks :* [08:40:09] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1137.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202003250839_... [08:59:15] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1137.eqiad.wmnet'] ` and were **ALL** successful. [10:44:00] there is some collection failures on db1077, should I follow the guide for prometheus exporter? [10:46:19] go for it [10:46:23] it is a test host [10:46:28] but yeah, if you can clear those errors, that's good [10:46:46] asking to check you were finished with it [10:46:53] will proceed [10:46:54] yep, all done [10:46:55] thanks [10:53:01] scrapping errors starting to decrease now [10:53:58] with your permissions I will tune s4 today as I did with s1 [10:54:04] (core weights) [10:54:14] db1091 seems too loaded compared to the others [10:55:00] latency is 20ms+ [10:57:57] It is not go for it [10:58:09] sorry: go for it [10:58:10] XD [10:58:11] ah [10:58:19] I was like, why? [10:58:22] haha [10:58:42] will check with you on deploy to make sure I don't interfere with your maintenance [11:06:47] marostegui: about to deploy a weight reduction for db1091 [11:08:10] Go for it [11:08:14] I am done with dbctl [11:10:13] I think that got most of the instances under 10ms or so per query [11:15:15] nice [11:21:29] https://gerrit.wikimedia.org/r/c/operations/software/tendril/+/583203 [11:21:31] +1? [11:21:42] Ah sorry, I didn't see that [11:21:51] first time I send it to you [11:22:05] don't want to bother you with those patches until I +1 ed [11:22:23] Done! [11:22:27] Nice first contrib :) [11:22:32] but I want to be aware of them [11:22:35] *you [11:27:15] there is 2 deployed changes on tendril [11:27:19] *undeployed [11:27:54] what do you mean? [11:27:55] your change wasn't deployed to production b98d81453bc4e788491a400a7db4c238b [11:28:12] let me see [11:28:14] I think that is ok because it was documentation, but FYI [11:28:37] I deployed it on dbmonitor2001, if you want to check it [11:28:46] what do you mean deployed? [11:28:51] ah, a pull? [11:28:54] that's the issue :-D [11:29:04] I setup tendril as manually deployed long ago [11:29:19] Yes, I just probably forgot to do the pull [11:29:24] so it doesn't have git(latest), juts the clone [11:29:31] that is ok,but you knew that, right? [11:29:35] yes yes [11:29:43] if not it is a fail on my documentation (me) not you [11:29:57] it was a comment only change [11:29:59] But as we rarely deploy on tendril... [11:30:02] It is easy to froget [11:30:07] But better that way [11:30:12] I thought that it was safer [11:30:27] and easy to spot [11:30:36] unless the change was noop, like yours [11:30:56] will make sure docs are updated, though [11:32:05] yeah it is safer [11:32:12] but as we are not used to deploy there...easy to forget [11:32:33] tendril replacement will have a saner workflow, so that is why I didn't invest time on improving it [11:33:00] now I just need to invalidate my tendril cache [11:36:46] look, as I was telling you, it happened to me- I rebased on dbprov2001 instead of dbprov1001 :-D [11:37:16] what we could do is creating an alert when web < HEAD [11:37:31] and tell someone else to do that 0:-D [11:43:03] 10DBA, 10Operations, 10Patch-For-Review: Add favicon to icinga and tendril - https://phabricator.wikimedia.org/T204110 (10jcrespo) 05Open→03Resolved The work is completed after we deployed the change to production: {F31701300} Thank you very much @Privacybatm for your contribution. [11:43:31] ^🥳 [11:45:14] <3 [11:46:05] I can see the favicon!!! \o/ [12:03:00] I am going to start taxing my regex checking skills :-D [12:03:12] hahaha [12:03:17] last time I did it, I did 10x at the same time [12:03:24] so it was easier [12:03:31] in case that works for you [12:03:31] This is the last one :) [12:03:42] well, I hope not [12:03:49] For this batch, yes :) [12:03:50] I hope we reimage more in the future! [12:03:53] :-D [12:47:22] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2115.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202003251247_... [12:57:13] I love having a favicon [12:57:17] Now I can see the tab super quick [13:04:54] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2115.codfw.wmnet'] ` Of which those **FAILED**: ` ['db2115.codfw.wmnet'] ` [13:36:14] 10DBA, 10Patch-For-Review: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2115.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202003251336_... [14:07:59] yep [14:13:38] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2115.codfw.wmnet'] ` and were **ALL** successful. [14:50:37] 10DBA, 10Core Platform Team, 10Wikimedia-Rdbms: Mysterious replication lag observed by MW in Codfw - https://phabricator.wikimedia.org/T248481 (10Krinkle) [14:52:42] 10DBA, 10Performance-Team, 10WMF-JobQueue, 10Wikimedia-Rdbms, 10Wikimedia-Incident: read only on mediawiki generates "LoadBalancer.php: Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T218692 (10Krinkle) 05Stalled→03Resolved This is now confirmed to be fixed. With @Maro... [14:53:11] 10DBA, 10Core Platform Team, 10Wikimedia-Rdbms: Mysterious replication lag observed by MW in Codfw - https://phabricator.wikimedia.org/T248481 (10Marostegui) For the record, confirmed no lag shown on pt-heartbeat table entries or `show slave status\G`. [14:56:23] 10DBA, 10Core Platform Team, 10Wikimedia-Rdbms: Mysterious replication lag observed by MW in Codfw - https://phabricator.wikimedia.org/T248481 (10Marostegui) And also it happens from time to time with eqiad: https://logstash.wikimedia.org/goto/e1b9fb34d80a6e9f1255764a341c7c74 [15:02:50] 10DBA, 10Core Platform Team, 10Wikimedia-Rdbms: Mysterious replication lag observed by MW in Codfw - https://phabricator.wikimedia.org/T248481 (10jcrespo) > there is no real lag I've already written many dissertations :-) about why this happens, you will fin them on many other tickets. This happens on eqiad... [15:12:08] 10DBA, 10Core Platform Team, 10Wikimedia-Rdbms: Mysterious replication lag observed by MW in Codfw - https://phabricator.wikimedia.org/T248481 (10Krinkle) We have code in various places that ask for the "lag" of the data they are about to process. For example, to inform a TTL for a cache or long-term storage... [15:18:47] 10DBA, 10Core Platform Team, 10Wikimedia-Rdbms: Mysterious replication lag observed by MW in Codfw - https://phabricator.wikimedia.org/T248481 (10jcrespo) Just to be clear, I know there is a reason to it- but it is something to have into account when logging/taking decisions. Note it is *not only* REPEATABLE... [15:22:39] 10DBA, 10Core Platform Team, 10Wikimedia-Rdbms: Mysterious replication lag observed by MW in Codfw - https://phabricator.wikimedia.org/T248481 (10Anomie) It seems unlikely that anyone will be able to debug this unless they catch it actively happening. At the moment, that does not seem to be the case. ` line... [15:24:55] 10DBA, 10Wikimedia-Rdbms: Mysterious replication lag observed by MW in Codfw - https://phabricator.wikimedia.org/T248481 (10Anomie) [15:33:28] 10DBA, 10Wikimedia-Rdbms: Mysterious replication lag observed by MW in Codfw - https://phabricator.wikimedia.org/T248481 (10Marostegui) >>! In T248481#5998733, @Anomie wrote: > > Is it possible that there was real lag at 14:14:30, but it had resolved itself by the time @Marostegui checked it? > From what I... [15:45:13] 10DBA, 10cloud-services-team (Kanban): Drop nova and nova_api databases from m5 - https://phabricator.wikimedia.org/T248313 (10aborrero) We talked about this in our team meeting. Please go ahead and clean this up :-) thanks!