[06:38:22] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) Thanks @Dzahn for the above!. I have fixed them, basically it was the old single instance pt-kill service, which has been replaced by a multi... [06:40:57] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312 (10Marostegui) [06:54:53] 10DBA, 10Security: Update Buster DB hosts to 4.19.160 - https://phabricator.wikimedia.org/T272121 (10Marostegui) [06:58:16] 10DBA, 10Security: Update Buster DB hosts to 4.19.160 - https://phabricator.wikimedia.org/T272121 (10Marostegui) p:05Triage→03Medium [07:08:57] 10DBA, 10Security: Update Buster DB hosts to 4.19.160 - https://phabricator.wikimedia.org/T272121 (10Marostegui) [07:09:25] 10DBA, 10Security: Update Buster DB hosts to 4.19.160 - https://phabricator.wikimedia.org/T272121 (10Marostegui) ` root@dbproxy2004:~# dpkg -l | grep linux | grep 160 ii linux-image-4.19.0-13-amd64 4.19.160-2 amd64 Linux 4.19 for 64-bit PCs (signed) ii linux-perf-4.19... [07:13:18] 10DBA, 10Security: Update Buster DB hosts to 4.19.160 - https://phabricator.wikimedia.org/T272121 (10Marostegui) @elukey can you take care of db1108? [07:46:22] 10DBA, 10SRE: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:39:56] 10DBA, 10Security: Update Buster DB hosts to kernel 4.19.160 - https://phabricator.wikimedia.org/T272121 (10Kormat) [09:09:42] 10DBA, 10SRE, 10ops-eqiad: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) [09:11:26] 10DBA, 10SRE, 10ops-eqiad: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) [09:12:40] 10DBA, 10SRE, 10ops-eqiad: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) p:05Triage→03High Setting to high as we are trying to finish up the new wiki replicas infra [09:24:48] hey folks [09:25:00] hi [09:26:02] we may or may not have some issues with toolsdb clouddb1001.clouddb-services.eqiad1.wikimedia.cloud, I'm investigating [09:28:50] mmmm [09:28:59] perhaps something going on with wiki replicas? [09:29:01] Jan 15 09:23:37 labstore1004 /usr/local/sbin/maintain-dbusers[25861]: Could not connect to clouddb1019.eqiad.wmnet:3316 due to (2003, "Can't connect to MySQL server on 'clouddb1019.eqiad.wmnet' (timed out)"). Skipping. [09:29:06] arturo: yes, it is down [09:29:13] arturo: https://phabricator.wikimedia.org/T272125 [09:29:57] oh, ok! [09:30:27] I wonder if this could be related to this as well? [09:30:29] Jan 15 08:48:40 labstore1004 /usr/local/sbin/maintain-dbusers[17708]: Could not connect to clouddb1017.eqiad.wmnet:3311 due to (2003, "Can't connect to MySQL server on 'clouddb1017.eqiad.wmnet' ([Errno 111] Connection refused)"). Skipping. [09:30:49] 10DBA, 10SRE, 10ops-eqiad: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) There's something going on with this host: ` racadm>>serveraction powerstatus Server power status: OFF racadm>>serveraction powerup Server power operation initiated successfully racadm>>serverac... [09:30:59] arturo: yes, they were all restarted for kernel upgrade [09:31:04] and clouddb1019 never came back [09:31:09] ok, excellent [09:32:43] so I expect the maintain-dbusers to be failing for a while then. I'm glad the root cause surfaced [09:34:04] 10DBA, 10SRE, 10ops-eqiad: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10aborrero) [09:34:07] 10DBA, 10SRE, 10ops-eqiad: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) And without doing anything again: ` racadm>>serveraction powerstatus Server power status: ON ` [09:34:39] 10DBA, 10SRE, 10ops-eqiad: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) Interestingly, I cannot see anything on the console, so I have no idea what it is doing and if it is rebooting or doing something else. [09:35:58] 10Data-Persistence-Backup, 10SRE, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) p:05Triage→03High [09:39:24] let me know if we can help with anything [09:40:14] arturo: it requires on-site love I am afraid [09:41:56] ok :-/ [09:42:51] if they were already available, the only thing is to depool them [09:43:13] none of the hosts are used [09:45:27] we cannot easily depool them from maintain-dbusers, it gets DB names from puppetdb [10:16:12] 10DBA, 10Orchestrator, 10SRE, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) [11:05:35] 10DBA, 10Orchestrator, 10SRE, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) [12:34:31] I am going to restart prometheus exporter for db2132, probably just the buster issue on metrics monitoring [12:49:48] restarting db1139 [12:49:52] * jynus crosses fingers [14:48:52] * bstorm reads backscroll [14:49:39] Looks like I should add a way to skip replica servers in maintain-dbusers. It is *supposed* to skip gracefully, but not if things are hard down, I think. [14:51:15] :( [14:51:42] Yes, that'd be useful I think as this sort of thing can happen or even maintenances, and now we have more instances, so more things to fail too \o/ [16:03:57] 10DBA, 10SRE, 10ops-eqiad: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10wiki_willy) a:03Cmjohnson [16:11:15] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for bclwiktionary - https://phabricator.wikimedia.org/T270280 (10Bstorm) [16:12:59] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for diqwiktionary - https://phabricator.wikimedia.org/T270276 (10Bstorm) [16:19:29] 10DBA, 10SRE, 10ops-eqiad: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Bstorm) [21:59:19] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for skrwiktionary - https://phabricator.wikimedia.org/T268458 (10Bstorm) 05Open→03Resolved a:03Bstorm This one is all set. `lang=mysql MariaDB [skrwiktionary_p]> select * from page limit 10; +---------+------------... [22:11:47] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for diqwiktionary - https://phabricator.wikimedia.org/T270276 (10Bstorm) 05Open→03Resolved a:03Bstorm All set `lang=mysql MariaDB [diqwiktionary_p]> select * from page limit 2; +---------+----------------+---------... [22:21:45] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for bclwiktionary - https://phabricator.wikimedia.org/T270280 (10Bstorm) 05Open→03Resolved a:03Bstorm This is done. `lang=mysql MariaDB [bclwiktionary_p]> select page_id,page_namespace,page_title from page limit 2;... [22:58:43] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for wawikisource - https://phabricator.wikimedia.org/T269432 (10Bstorm) 05Open→03Resolved a:03Bstorm This is done. `lang=mysql MariaDB [wawikisource_p]> select page_id,page_namespace,page_title from page limit 2; +... [23:05:50] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for niawiki - https://phabricator.wikimedia.org/T270414 (10Bstorm) 05Open→03Resolved a:03Bstorm This is done. `lang=mysql MariaDB [niawiki_p]> select page_id,page_namespace,page_title from page limit 2; +---------+... [23:17:05] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for niawiktionary - https://phabricator.wikimedia.org/T270410 (10Bstorm) 05Open→03Resolved a:03Bstorm This is done, `lang=mysql MariaDB [niawiktionary_p]> select page_id,page_namespace,page_title from page limit 2;...