[00:30:21] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312 (10Bstorm) I haven't created all the users yet either. I'm going to need to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/64... [00:31:32] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312 (10Bstorm) PS - I am aware of the wmf-pt-killer script setup causing puppet to fail. I'll get that tomorrow. [05:52:03] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312 (10Marostegui) >>! In T268312#6657838, @Bstorm wrote: > I haven't created all the users yet either. I'm going to need to deploy https://ge... [06:04:04] 10DBA, 10decommission-hardware: decommission es1017.eqiad.wmnet - https://phabricator.wikimedia.org/T268825 (10Marostegui) I have depooled this host to give it a kernel upgrade for T264154 (I won't repool it anymore). [06:07:13] 10DBA, 10decommission-hardware: decommission es1017.eqiad.wmnet - https://phabricator.wikimedia.org/T268825 (10Marostegui) [06:07:53] 10DBA, 10decommission-hardware: decommission es1018.eqiad.wmnet - https://phabricator.wikimedia.org/T269069 (10Marostegui) [06:08:48] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) s3 eqiad progress [x] dbstore1004 [] db1123 [] db1112 [x] db1095 [] db1078 [] db1075 [06:13:37] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1018.eqiad.wmnet - https://phabricator.wikimedia.org/T269069 (10Marostegui) [06:17:12] 10DBA, 10decommission-hardware: decommission es1017.eqiad.wmnet - https://phabricator.wikimedia.org/T268825 (10Marostegui) This host was rebooted, and expected, never came back. The idrac also doesn't work... [06:17:14] 10DBA, 10decommission-hardware: decommission es1017.eqiad.wmnet - https://phabricator.wikimedia.org/T268825 (10Marostegui) [06:17:16] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1018.eqiad.wmnet - https://phabricator.wikimedia.org/T269069 (10Marostegui) [06:40:35] 10DBA: Replication broken on db1124:3311 - https://phabricator.wikimedia.org/T269072 (10Marostegui) [06:40:46] 10DBA: Replication broken on db1124:3311 - https://phabricator.wikimedia.org/T269072 (10Marostegui) 05Open→03Resolved p:05Triage→03Medium [06:54:34] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1017.eqiad.wmnet - https://phabricator.wikimedia.org/T268825 (10Marostegui) [07:46:09] es growth seems to have slowed down [07:46:21] (to normal levels) [07:58:59] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312 (10Marostegui) labsdb user and grants deployed: ` clouddb1013:3311 929 clouddb1013:3313 929 clouddb1014:3312 929 clouddb1014:3317 929 clou... [07:59:57] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312 (10Marostegui) [08:04:22] 10DBA: Add a link engineering: Database for link recommendation service - https://phabricator.wikimedia.org/T267214 (10Marostegui) a:03Marostegui [08:28:54] the grants for m1 pki and idp werent created correctly, so backups didn't run there [08:30:35] ok, will check later [08:30:42] I will take care [08:30:45] ok thanks [08:30:54] I have to redo the grants and redo the m1 backups [08:41:38] 10DBA, 10Patch-For-Review: Add a link engineering: Database for link recommendation service - https://phabricator.wikimedia.org/T267214 (10Marostegui) The database has been created under m2 (so it lives together with `recommendationapi`). As agreed, two users were created: - RW one to do the imports: `adminli... [08:45:51] 10DBA, 10Patch-For-Review: Add a link engineering: Database for link recommendation service - https://phabricator.wikimedia.org/T267214 (10Marostegui) @jcrespo can you add this database to the backups? Thanks. [08:47:42] 10DBA, 10Patch-For-Review: Add a link engineering: Database for link recommendation service - https://phabricator.wikimedia.org/T267214 (10jcrespo) > @jcrespo can you add this database to the backups? roger and willco. Doing right now. [08:54:18] there is no reviewdb (gerrit) on m2 anymore, does anyone know if that was deleted at some point? [08:55:09] I think they were, but grants not cleaned up: T255715 [08:55:09] T255715: Make sure both `reviewdb-test` (used forgerrit upgrade testing) and `reviewdb` (formerly production) databases get torn down - https://phabricator.wikimedia.org/T255715 [08:57:27] root@db1107.eqiad.wmnet[mysql]> select * from user where user like '%gerrit%' or user like '%review%'; [08:57:27] Empty set (0.002 sec) [08:58:18] yeah, it is the dump grants I mentioned at: https://phabricator.wikimedia.org/T255715#6304445 [08:58:35] I am taking care of those as I add the new ones [08:58:41] ok cool, thanks [08:59:33] don't worry about those, just ping me when creating and deleting dbs and I will take care, no problem on my side, and less worries for you :-D [09:00:07] I will also setup monitoring to not needing any manual checks [09:00:36] for "backups of non-existent dbs" and "no backups of existent dbs" [09:00:47] we really need a better grant management system :-) [09:01:27] so we will trick kormat into working on that next :-) [09:01:58] * kormat makes sign of "warding off evil" [09:04:49] kormat: when we hire one more person, you will be able to drop piles of work into them! [09:05:13] lol: [09:05:16] > To form the gesture, use your thumb to hold down your middle and ring fingers, then extend your pointer and pinkie like horns. Though this might ward off evil spirits, it could also attract heavy metal fans or University of Texas fans. [09:05:43] is that from Wikipedia, the Free Encyclopedia? [09:06:21] https://www.chicagotribune.com/news/ct-xpm-2006-10-13-0610130345-story.html in this case :) [09:21:40] 10DBA, 10Patch-For-Review: Add a link engineering: Database for link recommendation service - https://phabricator.wikimedia.org/T267214 (10jcrespo) >>! In T267214#6658399, @Marostegui wrote: > @jcrespo can you add this database to the backups? {icon check} Backups of m2 added for the mwaddlink, on both datac... [09:22:01] 10DBA, 10Patch-For-Review: Add a link engineering: Database for link recommendation service - https://phabricator.wikimedia.org/T267214 (10Marostegui) Thank you! [09:22:40] marostegui: i'm so very glad we have the instance<->port mapping as config now instead of code re: ^. props to jynus for the work on that <3 [09:24:29] \o/ [09:32:41] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1018.eqiad.wmnet - https://phabricator.wikimedia.org/T269069 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin2001 for hosts: `es1018.eqiad.wmnet` - es1018.eqiad.wmnet (**PASS**) - Downtimed host on Icinga -... [09:33:40] thanks, kormat, although I think you were who inspired to do it [09:34:30] one thing you should double check is the "alternative port" puppet code conflicts [09:34:58] I think we jumped to 3350 for some reason, but I cannot remember why [09:35:53] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1018.eqiad.wmnet - https://phabricator.wikimedia.org/T269069 (10Marostegui) [09:36:14] jynus: we jumped to 3350 to leave some space for production sections, aiui [09:37:22] ah, it was before my time: https://gerrit.wikimedia.org/r/c/operations/puppet/+/489644 [09:37:28] just I am worried we could be forgetting about something; prometheus, alterantive ports, etc. [09:37:43] what do you mean re: "alternative port"? [09:38:00] kormat: mysql opens 2 ports, one for normal operation [09:38:24] and another for X additional slots for accounts with SUPER privileges (e.g. if the first is full) [09:38:52] ahh, i see. i'll have a look. [09:38:54] check the templates to see where those fall [09:39:31] I think it shouldn't be an issue, but knowing for sure would be safer [09:40:11] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission es1018.eqiad.wmnet - https://phabricator.wikimedia.org/T269069 (10Marostegui) Home was run successfully: ` # homer asw2-d-eqiad* commit "T269069" INFO:homer.devices:Initialized 35 devices INFO:homer:Committing config for query asw2-d-eqiad* wit... [09:41:55] jynus: hmm. it seems for core single-instance hosts the extra port is 3307. extra-port is not configured for anything else except for phabricator, where it's port_num+20 [09:42:30] it is not configured for multiinstance core? [09:42:57] I think that is the place were it could conflict with other instances [09:43:37] correct. [09:44:00] just give it a look and do any adjustments if needed, just something that appeared on my mind when I saw 3340 [09:44:02] you know there is also the - sign, not only the + sign right? :-P [09:44:22] volans-- [09:44:23] oh yeah [09:46:58] 10DBA, 10decommission-hardware: decommission es1017.eqiad.wmnet - https://phabricator.wikimedia.org/T268825 (10Marostegui) a:05LSobanski→03Marostegui [09:51:16] 10DBA, 10Patch-For-Review: Add a link engineering: Database for link recommendation service - https://phabricator.wikimedia.org/T267214 (10Marostegui) Added the new database to our wikitech page: https://wikitech.wikimedia.org/w/index.php?title=MariaDB%2Fmisc&type=revision&diff=1890045&oldid=1889656 [10:04:26] 10DBA, 10Patch-For-Review: Add a link engineering: Database for link recommendation service - https://phabricator.wikimedia.org/T267214 (10kostajh) Thank you @Marostegui and @jcrespo! > @kostajh if you need the credentials also to start the initial imports and all that, let me know on which server I can left... [10:12:33] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) db1139:3311 done [10:12:43] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) [10:14:18] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) [10:14:47] 10DBA, 10Patch-For-Review: Add a link engineering: Database for link recommendation service - https://phabricator.wikimedia.org/T267214 (10Marostegui) No, they are the same ones [10:21:39] 10DBA: Standardize extra-port for mariadb instances - https://phabricator.wikimedia.org/T269097 (10Kormat) [10:38:26] 10DBA, 10CheckUser: Monitor the growth of CheckUser tables at enwiki and few other very large wikis - https://phabricator.wikimedia.org/T267275 (10Marostegui) [11:23:14] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) [11:23:30] 10Blocked-on-schema-change, 10DBA: Schema change for renaming namespace_title index on watchlist - https://phabricator.wikimedia.org/T268004 (10Marostegui) 05Open→03Resolved This is all done [11:26:49] 10Blocked-on-schema-change, 10DBA: Schema change for timestamp fields of jobs table - https://phabricator.wikimedia.org/T268391 (10Marostegui) a:03Marostegui [11:31:15] 10DBA: New database request: sockpuppet - https://phabricator.wikimedia.org/T268505 (10Marostegui) p:05Triage→03Medium @hnowlan let's throttle the writes if we can and it is not too much of a hassle. This database will live with many more, so let's make sure we don't overload the host. Regarding the grants... [11:32:19] 10DBA: New database request: sockpuppet - https://phabricator.wikimedia.org/T268505 (10Marostegui) Also, if possible I would prefer if we go for a more generic username, instead of `hnowlan`, just in case as this can lead to confusions if for instance, you stop being the owner for this service :) [11:43:05] 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [11:44:23] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [12:14:33] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [12:35:01] FYI, restarting apache2 on dbmonitor1001 to pick up a security update [12:35:11] thanks [13:10:50] marostegui: https://gerrit.wikimedia.org/r/c/operations/software/+/644515 needs revert [13:10:57] after checking the tables, it will not work as is [13:11:13] why is that? [13:11:19] it needs a PK, and those are not PKs, so it will need software changes [13:11:32] it will work but it will generate lots of false positives [13:11:51] :_( [13:11:54] as it assumes the column used for index will not have repeating [13:12:04] it should be able to make it work with multiple columns [13:12:11] yeah, sounds familiar [13:12:14] but needs small software changes [13:12:35] small doesn't necesarilly mean easy, I haven't checked [13:12:44] probably not easy [13:13:16] so ideally it should say on the cols pl_from, pl_namespace,pl_title [13:13:32] check if it works as is, and we can modify the patch [13:13:42] if it doesn't work, it will need software changes [13:13:44] I have reverted for now [13:13:48] but it has to use the pk [13:14:14] if it is just an ORDER BY it may work? [13:14:46] I don't think so, cause those PK have 3 columns so the first one can be ok, but then the others? [13:15:00] oh, the iteration [13:15:00] does it check the whole row or just the PK value? [13:15:22] it checks the whole value always, those are juts how it iterates [13:15:30] the "index" to iterate [13:15:49] by it is true it may not work with the where [13:16:10] yeah, I don't think it would [13:16:22] wait [13:16:34] I think it will, with some caviats [13:16:44] it has a column argument, and an order by argument [13:16:47] it has to use both [13:16:58] Yeah, but the PK can have 3, and we are checking just one of them [13:17:16] that's why I asked if it it compares the whole row or just the PK values [13:17:22] if it checks the whole row, it might [13:17:44] so wiki pagelinks pl_from ... --order-by="pl_from, pl_namespace,pl_title" [13:17:50] should kinda-work [13:18:02] I remember adding that option precisely by this use case [13:18:28] indexing only one one column, but ordering by the whole thing [13:18:28] you mean just doing pagelinks PK and then using --order-by and adding the whole PK there? [13:18:39] yeah [13:18:56] but the check_tables.txt would still not be enough to iterate [13:19:33] well, we can modify it to add a third column orderby, and duplicate the 2nd parameters [13:19:47] the think is to know if it would work [13:20:45] let me double check the code to see how it iterates [13:22:03] in theory it would work, the problem is it would generate very large queries [13:22:30] as it does between id 1000 and 2000 [13:22:45] so it would not generate safe queries- they would be long running ones [13:23:01] yeah, better not to try [13:23:18] I think it would be ok, with very small steps [13:23:41] try on a very small wiki [13:23:45] on codfw [13:23:48] see how it works [13:24:18] but the parameters would be the ones you defined for column, and an extra one --order-by with the whole PK [13:24:22] I will try some other day, don't have much spare time now [13:24:24] ok [13:24:31] I will create a task for it later [13:25:04] for now, I think it is easier to dump + diff [13:25:09] or that is what I would do [13:25:26] yeah, but dumping+diff those massive tables... [13:25:38] better than nothing :-) [13:25:56] it is the problem of multi-column pks [13:26:00] or string-pks [13:26:03] :-/ [13:29:37] I think I will try with a normal compare and examine the false positives if any [13:29:46] So far it worked fine on templatelinks [13:30:04] there were two results and they were real [13:30:31] so if the files were generated from the same source, they most likely return in the same order [13:30:41] but I am unsure it is guaranteed [13:31:12] it may be for innodb [13:31:48] the problem it still can generate large selects (e.g. all templates on commons using CC-BY-SA-3.0) [13:35:22] similare error just happened on clouddb1013 [13:35:45] I am on that [13:38:14] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [13:44:08] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312 (10Marostegui) @Bstorm clouddb1017:3311 clouddb1013:3311 are currently down as I need to rebuild them due to some data inconsistency which... [14:49:10] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [14:55:25] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10Marostegui) This just happened: ` # transfer.py --no-checksum --no-encrypt db1106.eqiad.wmnet:/srv/sqldata clouddb1017.eqiad.wmnet:/srv 2020-12-01 13:42:02 INFO: About to transfer /srv/sqldata from db1106.... [14:56:06] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10Marostegui) And the same on the other transfer: ` # transfer.py --no-checksum --no-encrypt db1106.eqiad.wmnet:/srv/sqldata clouddb1013.eqiad.wmnet:/srv 2020-12-01 13:41:21 INFO: About to transfer /srv/sqld... [14:57:45] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10Marostegui) However the process is still there on cumin1001: ` root 12196 0.0 0.0 15820 6376 pts/16 S 13:41 0:00 ssh -F /etc/cumin/ssh_config -oForwardAgent=no -oForwardX11=no -oConnectTimeou... [15:05:51] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10jcrespo) If some process is killing cumin1001 connections, the transfer will fail. The most likely cause is a temporary loss of connections betweeen cumin and some of the host, which closes the ssh connection. [15:07:47] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10jcrespo) I just read the last comment, then that wouldn't explain that. I started a transfer.py run through remote-backup-mariadb at 14:58:36, would that match? [15:09:13] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10jcrespo) BTW, you don't need "--no-checksum --no-encrypt", those options are by default on '/etc/transferpy/transferpy.conf'. [15:09:51] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10Marostegui) What should we do with this task then? Is it worth having it open or should we just assume it can happens and we need to retry? [15:11:03] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10Marostegui) >>! In T262388#6659978, @jcrespo wrote: > I just read the last comment, then that wouldn't explain that. > > I started a transfer.py run through remote-backup-mariadb at 14:58:36, would that ma... [15:11:43] 10DBA, 10CheckUser: Monitor the growth of CheckUser tables at enwiki and few other very large wikis - https://phabricator.wikimedia.org/T267275 (10Marostegui) [15:12:17] 10DBA: transfer.py fails when copying data between es hosts - https://phabricator.wikimedia.org/T262388 (10jcrespo) > What should we do with this task then? Is it worth having it open or should we just assume it can happens and we need to retry? I don't know why it is failing, I would need time to debug. @LSoba... [15:17:07] 10DBA, 10Community-Tech, 10Expiring-Watchlist-Items: Monitor the growth of watchlist table at wikidata and wikicommons - https://phabricator.wikimedia.org/T268096 (10Marostegui) [15:35:30] 10DBA, 10CheckUser: Monitor the growth of CheckUser tables at enwiki and few other very large wikis - https://phabricator.wikimedia.org/T267275 (10Marostegui) [15:36:53] 10DBA, 10CheckUser: Monitor the growth of CheckUser tables at enwiki and few other very large wikis - https://phabricator.wikimedia.org/T267275 (10Marostegui) I want to monitor enwiki size two more weeks, as there was a big increase from one week to another. Let's see if that's a trend [15:38:42] 10DBA, 10CheckUser: Monitor the growth of CheckUser tables at enwiki and few other very large wikis - https://phabricator.wikimedia.org/T267275 (10Marostegui) [16:44:07] 10DBA: New database request: sockpuppet - https://phabricator.wikimedia.org/T268505 (10hnowlan) [17:04:56] 10DBA: New database request: sockpuppet - https://phabricator.wikimedia.org/T268505 (10hnowlan) >>! In T268505#6659031, @Marostegui wrote: > @hnowlan let's throttle the writes if we can and it is not too much of a hassle. This database will live with many more, so let's make sure we don't overload the host. Thr... [18:33:03] FYI, I submitted the wikibugs patch to send Data-Persistence-Backup activity to the channel, so that you're not surprised when these show up. [20:00:21] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312 (10Bstorm) >>! In T268312#6658333, @Marostegui wrote: > @Bstorm you can go ahead from your side, and let's mark the hosts as done once eve... [21:09:08] 10DBA, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Slow load times for Special:Homepage on cswiki - https://phabricator.wikimedia.org/T267216 (10kostajh) a:05Tgr→03None [21:38:03] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312 (10Bstorm) Running create views across all the hosts except clouddb1013 and clouddb1017, I got this anomaly: ` 2020-12-01 21:30:17,981 W... [22:38:53] 10DBA, 10SRE-tools, 10conftool, 10serviceops, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10RLazarus) 05Open→03Resolved a:03RLazarus >>! In T261767#6450152, @Marostegui wrote: >... [23:29:54] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2142.codfw.wmnet ` The log can be f... [23:29:59] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2142.codfw.wmnet'] ` Of which those **FAILED**: ` ['db2142.codfw.wmnet'] ` [23:30:09] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-11-29) rack/setup/install db214[234] - https://phabricator.wikimedia.org/T267041 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2142.codfw.wmnet ` The log can be f...