[05:21:07] <wikibugs>	 10DBA, 10Wikimedia-Rdbms, 10Goal, 10MW-1.36-notes (1.36.0-wmf.20; 2020-12-01), and 5 others: FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW - https://phabricator.wikimedia.org/T221159 (10aaron) 05Open→03Declined I don't think it would be worth using pt-heartbeat fo...
[06:59:26] <wikibugs>	 10DBA, 10Data-Services, 10User-Kormat, 10cloud-services-team (Kanban): Parametrize wmf-pt-kill so it can connect to different sockets - https://phabricator.wikimedia.org/T260511 (10Marostegui) 05Open→03Resolved a:03Bstorm Thanks a lot Brooke for working on this. Looks like it is working fine, pt-kill...
[07:02:13] <wikibugs>	 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Deploy labsdbuser and views to new clouddb hosts - https://phabricator.wikimedia.org/T268312 (10Marostegui)
[08:22:09] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for renaming user_properties_property index - https://phabricator.wikimedia.org/T270187 (10Marostegui) Started the schema change on s3 master with NO replication. It will take around 15 hours.
[08:22:42] <wikibugs>	 10Blocked-on-schema-change, 10DBA: Schema change for renaming user_properties_property index - https://phabricator.wikimedia.org/T270187 (10Marostegui)
[08:25:40] <wikibugs>	 10DBA, 10Data-Services, 10User-Kormat, 10cloud-services-team (Kanban): Parametrize wmf-pt-kill so it can connect to different sockets - https://phabricator.wikimedia.org/T260511 (10Marostegui) Note: I also tested the old hosts and a stop/start on them for each service.
[09:05:25] <wikibugs>	 10DBA, 10MediaWiki-Page-derived-data, 10Platform Engineering, 10TechCom-RFC, and 2 others: RFC: Normalize MediaWiki link tables - https://phabricator.wikimedia.org/T222224 (10jcrespo) I am happy for the above, don't get me wrong!  But I may suggest a direction of priority of optimization, from a backup's p...
[09:39:30] <wikibugs>	 10DBA, 10Data-Services: Prepare and check storage layer for trwikivoyage - https://phabricator.wikimedia.org/T271261 (10Marostegui) >>! In T271261#6738735, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/PBjA9XYBhxWNv8gI9tM8} [2021-01-12T08:4...
[09:43:45] <wikibugs>	 10DBA, 10Data-Services: Prepare and check storage layer for bclwiktionary - https://phabricator.wikimedia.org/T270280 (10Marostegui) Database sanitized, `_p` database created, and grants given to `labsdbuser`. check private data came back clean. I also tested the triggers with my user on this wiki and it was o...
[09:44:03] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for niawiki - https://phabricator.wikimedia.org/T270414 (10Marostegui) Database sanitized, `_p` database created, and grants given to `labsdbuser`. check private data came back clean. I also tested the triggers with my us...
[09:44:10] <wikibugs>	 10DBA, 10Data-Services: Prepare and check storage layer for diqwiktionary - https://phabricator.wikimedia.org/T270276 (10Marostegui) Database sanitized, `_p` database created, and grants given to `labsdbuser`. check private data came back clean. I also tested the triggers with my user on this wiki and it was o...
[09:44:35] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for niawiktionary - https://phabricator.wikimedia.org/T270410 (10Marostegui) Database sanitized, `_p` database created, and grants given to `labsdbuser`. check private data came back clean. I also tested the triggers with...
[10:15:48] <wikibugs>	 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host on candidate masters - https://phabricator.wikimedia.org/T271106 (10Marostegui)
[10:16:54] <wikibugs>	 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host on candidate masters - https://phabricator.wikimedia.org/T271106 (10Marostegui) ` root@db1138:~# mysql -e "select @@report_host" +--------------------+ | @@report_host      | +--------------------+ | db1138.eqiad.wmnet | +--------------------+ `
[10:17:05] <wikibugs>	 10DBA: Switchover s4 (commonswiki) from db1081 to db1138 - https://phabricator.wikimedia.org/T271427 (10Marostegui)
[10:22:08] <wikibugs>	 10DBA: Switchover s4 (commonswiki) from db1081 to db1138 - https://phabricator.wikimedia.org/T271427 (10Marostegui)
[10:27:49] <wikibugs>	 10DBA, 10Patch-For-Review: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 (10Marostegui) Sanitarium positions to replicate from on s7 {P13737}
[11:20:22] <wikibugs>	 10DBA: Switchover s4 (commonswiki) from db1081 to db1138 - https://phabricator.wikimedia.org/T271427 (10Marostegui)
[11:29:50] <wikibugs>	 10DBA, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Slow load times for Special:Homepage on cswiki - https://phabricator.wikimedia.org/T267216 (10Urbanecm_WMF) >>! In T267216#6730772, @Marostegui wrote: >>>! In T267216#6730372, @Etonkovidova wrote: > [...] >> It se...
[11:30:52] <wikibugs>	 10DBA, 10GrowthExperiments, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Slow load times for Special:Homepage on cswiki - https://phabricator.wikimedia.org/T267216 (10Marostegui) @Urbanecm_WMF that explains it.....thanks much for the explanation :)
[11:38:36] <wikibugs>	 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1080) - https://phabricator.wikimedia.org/T271540 (10Marostegui) Procedure:  Pre restart [] Silence m1 hosts [] buffer pool dump + disablement in advance to make the restart faster  Restart [] `!log m1 master restart - T271540` [] db1080: restart my...
[12:33:10] <wikibugs>	 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Marostegui) Reminder: do not add IPV6 entries to these hosts (T267043#6692741)
[13:33:23] <wikibugs>	 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1080) - https://phabricator.wikimedia.org/T271540 (10Trizek-WMF) >>! In T271540#6736583, @Trizek-WMF wrote: > I forgot about the wikitech-l email. Good idea! I will send one tomorrow.   Sent to wikitech-l and wikitech-ambassadors https://lists.wikim...
[14:43:55] <wikibugs>	 10DBA, 10Patch-For-Review: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 (10Marostegui) I have finished the sanitization of db1155:3317, and triple checked that `centralauth` is clean.
[15:33:08] <wikibugs>	 10Data-Persistence-Backup: export_smart_data_dump.service failed on dbprov2001 because of a timeout in the raid facter - https://phabricator.wikimedia.org/T271821 (10jcrespo)
[15:38:46] <wikibugs>	 10Data-Persistence-Backup: export_smart_data_dump.service failed on dbprov2001 because of a timeout in the raid facter - https://phabricator.wikimedia.org/T271821 (10jcrespo) Sadly, I cannot reproduce right now:  `  dbprov2001# time /usr/bin/ruby /var/lib/puppet/lib/facter/raid.rb {"raid":["megaraid"]}  real 0m0...
[16:03:36] <wikibugs>	 10DBA, 10SRE, 10ops-codfw: db2140 crashed due to HW memory errors - https://phabricator.wikimedia.org/T271084 (10Papaul) a:05Papaul→03Marostegui DiMM B6 replaced , server is back  up.  return tracking information below. {F33996733}
[16:04:58] <wikibugs>	 10DBA, 10SRE, 10ops-codfw: db2140 crashed due to HW memory errors - https://phabricator.wikimedia.org/T271084 (10Marostegui) Thanks Papaul. Going to start mysql, check its data, enable replication and later repool it. Will close this task once fully done
[16:33:44] <wikibugs>	 10Data-Persistence-Backup: export_smart_data_dump.service failed on dbprov2001 because of a timeout in the raid facter - https://phabricator.wikimedia.org/T271821 (10jcrespo) 05Open→03Resolved Relevant logs before failure:   `lines=10 Jan 12 15:11:01 dbprov2001 systemd[1]: Started Collect SMART information f...
[16:35:21] <wikibugs>	 10Data-Persistence-Backup: export_smart_data_dump.service failed on dbprov2001 because of a timeout in the raid facter - https://phabricator.wikimedia.org/T271821 (10jcrespo) 05Resolved→03Open Sorry, I was checking the wrong prometheus error. The error can be clearly explained by high (100%) disk usage at th...
[16:39:25] <wikibugs>	 10Data-Persistence-Backup: export_smart_data_dump.service failed on dbprov2001 because of a timeout in the raid facter - https://phabricator.wikimedia.org/T271821 (10jcrespo) And this is due to commonswiki image dump- this is a larger issue than I anticipated.
[16:50:36] <wikibugs>	 10Data-Persistence-Backup: export_smart_data_dump.service failed on dbprov2001 because of a timeout in the raid facter - https://phabricator.wikimedia.org/T271821 (10jcrespo) p:05Triage→03Low I don't think this is a high importance issue, as the only fallout is temporary metric monitoring failure (which was...
[16:50:46] <wikibugs>	 10Data-Persistence-Backup: export_smart_data_dump.service failed on dbprov2001 because of a timeout in the raid facter - https://phabricator.wikimedia.org/T271821 (10jcrespo) a:05jcrespo→03None
[16:54:16] <wikibugs>	 10DBA, 10Patch-For-Review: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 (10Marostegui) Replication started on db1155:3317. Also InnoDB compression has been started
[16:55:06] <wikibugs>	 10DBA, 10Patch-For-Review: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 (10Marostegui) Compressing the recently created databases on db1154:3315 and on clouddb1016:3315 and cloudb1020 too
[18:24:21] <marostegui>	 tendril seem to be down, I am investigating
[18:24:58] <marostegui>	 the usual...
[18:25:13] <jynus>	 memory leak of cpu/contention?
[18:25:15] <jynus>	 *or
[18:25:32] <marostegui>	 clean up locking up everything, I will stop the event scheduler + truncate that huge useless table
[18:26:07] <jynus>	 web intergface is now reposisive, but scraping not going through
[18:26:23] <marostegui>	 yup
[18:26:31] <jynus>	 let me know if you need help
[18:29:59] <marostegui>	 tendril is back and data will start to show up as events start running
[18:30:55] <volans>	 as this is happening fairly oftern, should we auto-delete stuff or auto-truncate from that table?
[18:31:03] <volans>	 sorry if I'm missing context here
[18:31:07] <marostegui>	 volans: that is what is killing it :)
[18:31:09] <marostegui>	 the clean ups
[18:31:14] <volans>	 ahahah
[18:31:17] <marostegui>	 I will need to space them up or reduce even more the throttling 
[18:31:40] <marostegui>	 we should really truncate that table daily, it is not really used for anything useful
[18:31:48] <marostegui>	 stop event scheduler + truncate + enable event scheduler
[18:32:54] <jynus>	 volans, the background hard issue is balancing working on fires on the old system vs investing time on a new one
[18:33:05] <jynus>	 thanks marostegui for taking care of it
[18:33:26] <marostegui>	 I am going to stop the event scheduler for now to let all the pile up queries finish before enabling it back
[18:35:45] <jynus>	 heads up in case you see something weird on db2078 in the morning- I may miss some step on tendril, will take care of it tomorrow
[18:36:19] <marostegui>	 sounds good
[18:45:51] <wikibugs>	 10DBA: db2078 m1 mysqld process crashed - https://phabricator.wikimedia.org/T270877 (10jcrespo) db2078 has been reprovisioned, but some finishing steps and checks may be needed tomorrow to check things are working as expected/db2078 is back into regular production access.
[18:46:27] <marostegui>	 ok - I think tendril is back to an stable situation
[18:46:42] <marostegui>	 I had to truncate a few more tables and restart the daemon
[18:46:47] <marostegui>	 I am going to log off, it's been a long day
[18:46:49] <marostegui>	 o/
[18:47:01] <jynus>	 thank you, bye!
[18:47:10] <marostegui>	 :*
[19:05:33] <wikibugs>	 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Cmjohnson) quick update, all the servers are cabled, need to add to netbox next and then setup idrac.  These will be ready to be handed over tomorrow.