[05:21:07] 10DBA, 10Wikimedia-Rdbms, 10Goal, 10MW-1.36-notes (1.36.0-wmf.20; 2020-12-01), and 5 others: FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW - https://phabricator.wikimedia.org/T221159 (10aaron) 05Open→03Declined I don't think it would be worth using pt-heartbeat fo... [06:59:26] 10DBA, 10Data-Services, 10User-Kormat, 10cloud-services-team (Kanban): Parametrize wmf-pt-kill so it can connect to different sockets - https://phabricator.wikimedia.org/T260511 (10Marostegui) 05Open→03Resolved a:03Bstorm Thanks a lot Brooke for working on this. Looks like it is working fine, pt-kill... [07:02:13] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312 (10Marostegui) [08:22:09] 10Blocked-on-schema-change, 10DBA: Schema change for renaming user_properties_property index - https://phabricator.wikimedia.org/T270187 (10Marostegui) Started the schema change on s3 master with NO replication. It will take around 15 hours. [08:22:42] 10Blocked-on-schema-change, 10DBA: Schema change for renaming user_properties_property index - https://phabricator.wikimedia.org/T270187 (10Marostegui) [08:25:40] 10DBA, 10Data-Services, 10User-Kormat, 10cloud-services-team (Kanban): Parametrize wmf-pt-kill so it can connect to different sockets - https://phabricator.wikimedia.org/T260511 (10Marostegui) Note: I also tested the old hosts and a stop/start on them for each service. [09:05:25] 10DBA, 10MediaWiki-Page-derived-data, 10Platform Engineering, 10TechCom-RFC, and 2 others: RFC: Normalize MediaWiki link tables - https://phabricator.wikimedia.org/T222224 (10jcrespo) I am happy for the above, don't get me wrong! But I may suggest a direction of priority of optimization, from a backup's p... [09:39:30] 10DBA, 10Data-Services: Prepare and check storage layer for trwikivoyage - https://phabricator.wikimedia.org/T271261 (10Marostegui) >>! In T271261#6738735, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/PBjA9XYBhxWNv8gI9tM8} [2021-01-12T08:4... [09:43:45] 10DBA, 10Data-Services: Prepare and check storage layer for bclwiktionary - https://phabricator.wikimedia.org/T270280 (10Marostegui) Database sanitized, `_p` database created, and grants given to `labsdbuser`. check private data came back clean. I also tested the triggers with my user on this wiki and it was o... [09:44:03] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for niawiki - https://phabricator.wikimedia.org/T270414 (10Marostegui) Database sanitized, `_p` database created, and grants given to `labsdbuser`. check private data came back clean. I also tested the triggers with my us... [09:44:10] 10DBA, 10Data-Services: Prepare and check storage layer for diqwiktionary - https://phabricator.wikimedia.org/T270276 (10Marostegui) Database sanitized, `_p` database created, and grants given to `labsdbuser`. check private data came back clean. I also tested the triggers with my user on this wiki and it was o... [09:44:35] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for niawiktionary - https://phabricator.wikimedia.org/T270410 (10Marostegui) Database sanitized, `_p` database created, and grants given to `labsdbuser`. check private data came back clean. I also tested the triggers with... [10:15:48] 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host on candidate masters - https://phabricator.wikimedia.org/T271106 (10Marostegui) [10:16:54] 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host on candidate masters - https://phabricator.wikimedia.org/T271106 (10Marostegui) ` root@db1138:~# mysql -e "select @@report_host" +--------------------+ | @@report_host | +--------------------+ | db1138.eqiad.wmnet | +--------------------+ ` [10:17:05] 10DBA: Switchover s4 (commonswiki) from db1081 to db1138 - https://phabricator.wikimedia.org/T271427 (10Marostegui) [10:22:08] 10DBA: Switchover s4 (commonswiki) from db1081 to db1138 - https://phabricator.wikimedia.org/T271427 (10Marostegui) [10:27:49] 10DBA, 10Patch-For-Review: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 (10Marostegui) Sanitarium positions to replicate from on s7 {P13737} [11:20:22] 10DBA: Switchover s4 (commonswiki) from db1081 to db1138 - https://phabricator.wikimedia.org/T271427 (10Marostegui) [11:29:50] 10DBA, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Slow load times for Special:Homepage on cswiki - https://phabricator.wikimedia.org/T267216 (10Urbanecm_WMF) >>! In T267216#6730772, @Marostegui wrote: >>>! In T267216#6730372, @Etonkovidova wrote: > [...] >> It se... [11:30:52] 10DBA, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Slow load times for Special:Homepage on cswiki - https://phabricator.wikimedia.org/T267216 (10Marostegui) @Urbanecm_WMF that explains it.....thanks much for the explanation :) [11:38:36] 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1080) - https://phabricator.wikimedia.org/T271540 (10Marostegui) Procedure: Pre restart [] Silence m1 hosts [] buffer pool dump + disablement in advance to make the restart faster Restart [] `!log m1 master restart - T271540` [] db1080: restart my... [12:33:10] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Marostegui) Reminder: do not add IPV6 entries to these hosts (T267043#6692741) [13:33:23] 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1080) - https://phabricator.wikimedia.org/T271540 (10Trizek-WMF) >>! In T271540#6736583, @Trizek-WMF wrote: > I forgot about the wikitech-l email. Good idea! I will send one tomorrow. Sent to wikitech-l and wikitech-ambassadors https://lists.wikim... [14:43:55] 10DBA, 10Patch-For-Review: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 (10Marostegui) I have finished the sanitization of db1155:3317, and triple checked that `centralauth` is clean. [15:33:08] 10Data-Persistence-Backup: export_smart_data_dump.service failed on dbprov2001 because of a timeout in the raid facter - https://phabricator.wikimedia.org/T271821 (10jcrespo) [15:38:46] 10Data-Persistence-Backup: export_smart_data_dump.service failed on dbprov2001 because of a timeout in the raid facter - https://phabricator.wikimedia.org/T271821 (10jcrespo) Sadly, I cannot reproduce right now: ` dbprov2001# time /usr/bin/ruby /var/lib/puppet/lib/facter/raid.rb {"raid":["megaraid"]} real 0m0... [16:03:36] 10DBA, 10SRE, 10ops-codfw: db2140 crashed due to HW memory errors - https://phabricator.wikimedia.org/T271084 (10Papaul) a:05Papaul→03Marostegui DiMM B6 replaced , server is back up. return tracking information below. {F33996733} [16:04:58] 10DBA, 10SRE, 10ops-codfw: db2140 crashed due to HW memory errors - https://phabricator.wikimedia.org/T271084 (10Marostegui) Thanks Papaul. Going to start mysql, check its data, enable replication and later repool it. Will close this task once fully done [16:33:44] 10Data-Persistence-Backup: export_smart_data_dump.service failed on dbprov2001 because of a timeout in the raid facter - https://phabricator.wikimedia.org/T271821 (10jcrespo) 05Open→03Resolved Relevant logs before failure: `lines=10 Jan 12 15:11:01 dbprov2001 systemd[1]: Started Collect SMART information f... [16:35:21] 10Data-Persistence-Backup: export_smart_data_dump.service failed on dbprov2001 because of a timeout in the raid facter - https://phabricator.wikimedia.org/T271821 (10jcrespo) 05Resolved→03Open Sorry, I was checking the wrong prometheus error. The error can be clearly explained by high (100%) disk usage at th... [16:39:25] 10Data-Persistence-Backup: export_smart_data_dump.service failed on dbprov2001 because of a timeout in the raid facter - https://phabricator.wikimedia.org/T271821 (10jcrespo) And this is due to commonswiki image dump- this is a larger issue than I anticipated. [16:50:36] 10Data-Persistence-Backup: export_smart_data_dump.service failed on dbprov2001 because of a timeout in the raid facter - https://phabricator.wikimedia.org/T271821 (10jcrespo) p:05Triage→03Low I don't think this is a high importance issue, as the only fallout is temporary metric monitoring failure (which was... [16:50:46] 10Data-Persistence-Backup: export_smart_data_dump.service failed on dbprov2001 because of a timeout in the raid facter - https://phabricator.wikimedia.org/T271821 (10jcrespo) a:05jcrespo→03None [16:54:16] 10DBA, 10Patch-For-Review: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 (10Marostegui) Replication started on db1155:3317. Also InnoDB compression has been started [16:55:06] 10DBA, 10Patch-For-Review: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 (10Marostegui) Compressing the recently created databases on db1154:3315 and on clouddb1016:3315 and cloudb1020 too [18:24:21] tendril seem to be down, I am investigating [18:24:58] the usual... [18:25:13] memory leak of cpu/contention? [18:25:15] *or [18:25:32] clean up locking up everything, I will stop the event scheduler + truncate that huge useless table [18:26:07] web intergface is now reposisive, but scraping not going through [18:26:23] yup [18:26:31] let me know if you need help [18:29:59] tendril is back and data will start to show up as events start running [18:30:55] as this is happening fairly oftern, should we auto-delete stuff or auto-truncate from that table? [18:31:03] sorry if I'm missing context here [18:31:07] volans: that is what is killing it :) [18:31:09] the clean ups [18:31:14] ahahah [18:31:17] I will need to space them up or reduce even more the throttling [18:31:40] we should really truncate that table daily, it is not really used for anything useful [18:31:48] stop event scheduler + truncate + enable event scheduler [18:32:54] volans, the background hard issue is balancing working on fires on the old system vs investing time on a new one [18:33:05] thanks marostegui for taking care of it [18:33:26] I am going to stop the event scheduler for now to let all the pile up queries finish before enabling it back [18:35:45] heads up in case you see something weird on db2078 in the morning- I may miss some step on tendril, will take care of it tomorrow [18:36:19] sounds good [18:45:51] 10DBA: db2078 m1 mysqld process crashed - https://phabricator.wikimedia.org/T270877 (10jcrespo) db2078 has been reprovisioned, but some finishing steps and checks may be needed tomorrow to check things are working as expected/db2078 is back into regular production access. [18:46:27] ok - I think tendril is back to an stable situation [18:46:42] I had to truncate a few more tables and restart the daemon [18:46:47] I am going to log off, it's been a long day [18:46:49] o/ [18:47:01] thank you, bye! [18:47:10] :* [19:05:33] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Cmjohnson) quick update, all the servers are cabled, need to add to netbox next and then setup idrac. These will be ready to be handed over tomorrow.