[04:50:38] <wikibugs>	 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) Once the disk have failed we will get an automatic ticket for getting that disk replaced. I don't think we need this tracking taks.
[05:36:44] <wikibugs>	 10DBA, 10Operations: db2061 has predictive disk errors - https://phabricator.wikimedia.org/T208957 (10Marostegui) I would ignore this until the disks fail and we get the automatic failed disk task created
[05:39:06] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Marostegui) 05Open>03Resolved I agree with Jaime, a bigger stripe size shouldn't be an issue. Plus, we will be having SSDs, which will probably compensa...
[05:39:42] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Marostegui)
[05:49:34] <wikibugs>	 10DBA: Duplicate rows error in db2095 replication @s7 - https://phabricator.wikimedia.org/T208672 (10Marostegui) >>! In T208672#4724031, @Banyek wrote: > I re-run the check_private_data.py to see the if reimport was good as the filters/triggers were working  What's the status of this? Was this fully done?
[05:59:03] <wikibugs>	 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata, and 2 others: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) >>! In T203709#4724115, @Addshore wrote: > @Marostegui Do we have an ETA on these indexes being re...
[06:08:01] <wikibugs>	 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata, and 2 others: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) This is very weird, I am _pretty_ sure I did s8 eqiad whilst codfw was active. The fact that this...
[06:08:20] <wikibugs>	 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata, and 2 others: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui)
[06:17:41] <wikibugs>	 10DBA, 10MediaWiki-API, 10MediaWiki-Database: prop=revisions API timing out for a specific user and pages they edited - https://phabricator.wikimedia.org/T197486 (10Marostegui) Yeah, I am curious to see if 10.0.37 really fixes this :-)
[06:32:07] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1117 went away - https://phabricator.wikimedia.org/T208150 (10Marostegui) 05Open>03Resolved a:03Cmjohnson I see no more errors on the idrac logs since the reboot. Let's close this and re-open if this happens again and then we'll need to get the vendor involved. `   E...
[06:34:31] <wikibugs>	 10DBA, 10Operations: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Marostegui) a:05Papaul>03None I guess this is the reason why dbstore2002:3313 is lagging so much behind? (It could be something else, I am just catching up on emails) Should we ease a bit replication options to make...
[06:43:32] <wikibugs>	 10DBA, 10MediaWiki-extensions-WikibaseMediaInfo, 10SDC Engineering, 10StructuredDataOnCommons, 10Wikidata: MediaInfo extension should not use the wb_terms table - https://phabricator.wikimedia.org/T208330 (10Marostegui) I would prefer if we don't enable more stuff in production that uses `wb_term` table,...
[07:02:36] <wikibugs>	 10DBA, 10Data-Services, 10Datasets-General-or-Unknown: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 (10Marostegui) >>! In T174802#4733544, @Urbanecm wrote: >  What about sending a notification via #user-notice and drop the data after a month since...
[07:43:32] <wikibugs>	 10DBA, 10Analytics, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10jcrespo) This is not something we handle- we don't decide on the table structure (this refactoring, comment storage, was owned by Platform team), wh...
[08:07:58] <wikibugs>	 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) Already caught up with Jaime about why this ticket exists. All good here
[08:12:22] <wikibugs>	 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 (10Marostegui)
[08:12:24] <wikibugs>	 10DBA, 10foundation.wikimedia.org: Drop the petition_data table from production - https://phabricator.wikimedia.org/T208979 (10Marostegui)
[08:12:35] <wikibugs>	 10DBA, 10foundation.wikimedia.org: Drop the petition_data table from production - https://phabricator.wikimedia.org/T208979 (10Marostegui) p:05Triage>03Normal
[08:17:01] <wikibugs>	 10DBA, 10DC-Ops: disk error on db1065 - https://phabricator.wikimedia.org/T208709 (10Marostegui) I would suggest for this disk to fail completely before replacing it.
[08:27:01] <wikibugs>	 10DBA: Failover m3 codfw master - https://phabricator.wikimedia.org/T209261 (10Marostegui)
[08:27:37] <wikibugs>	 10DBA: Failover m3 codfw master - https://phabricator.wikimedia.org/T209261 (10Marostegui) p:05Triage>03Normal
[08:27:52] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10User-Banyek: db2042 (m3) master RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Marostegui) This failed again and I have created T202051 to track it
[08:29:18] <marostegui>	 Is dbstore2002:3313 so lagged because of the BBU failure? https://phabricator.wikimedia.org/T208320#4738624
[08:30:28] <jynus>	 no idea, I saw it on thursday when it was only 1 day behind and saw it replicating the maintenance, so I supposed it was going to be temporary
[08:30:47] <jynus>	 it may be still the ongoing maintenance
[08:30:53] <marostegui>	 banyek: any idea?
[08:32:16] <banyek>	 no, I didn't spent time on that. Practically that wasn't a good idea, but if I recall correctly after I realized it's normally lagging because of backups I didn't spent too much attention on it as I had other, more scary stuff to do
[08:33:44] <marostegui>	 it is 6 days behind now :(
[08:34:33] <banyek>	 we shall enable the write back cache event the BBU is bad then
[08:36:13] <marostegui>	 I don't know if it is that, I am catching up with emails but saw that alert there, so I was wondering what's been done
[08:36:30] <marostegui>	 I don't know if there are scripts running for maintenance or what
[08:36:36] <banyek>	 Last week was pretty rough the sanitariums broke constantly
[08:37:22] <marostegui>	 I would expect 3311 to also lag if it was the BBU
[08:37:38] <marostegui>	 So maybe there are scripts running for maintenance, I just don't know, I am not in the loop
[08:38:19] <banyek>	 The tables are compressed in dbstore2002 too, that could be also a reason - even it is unprobable
[08:38:24] <banyek>	 (is that a word?)
[08:40:36] <marostegui>	 I am going to ease consistency variables to get it back to a more recent state, 6 days is too much (and it is still increasing)
[08:40:40] <jynus>	 so UPDATE /* MigrateComments::migrate www-data@mwmain... */ is still running there
[08:40:54] <marostegui>	 yeah, I saw that one
[08:43:00] <wikibugs>	 10DBA, 10Operations: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Marostegui) >>! In T208320#4738624, @Marostegui wrote: > I guess this is the reason why dbstore2002:3313 is lagging so much behind? (It could be something else, I am just catching up on emails) > Should we ease a bit re...
[08:50:48] <addshore>	 hope you had a good vacation marostegui 
[08:53:57] <marostegui>	 hehe I did!
[08:53:58] <marostegui>	 Thanks :)
[09:08:38] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping rc_moved_to_title/rc_moved_to_ns on wmf databases - https://phabricator.wikimedia.org/T51191 (10Marostegui) 05stalled>03Open
[09:09:07] <wikibugs>	 10DBA, 10Schema-change: Drop externallinks.el_from_namespace on wmf databases - https://phabricator.wikimedia.org/T114117 (10Marostegui) 05stalled>03Open
[09:09:15] <wikibugs>	 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 (10Marostegui)
[09:09:33] <wikibugs>	 10DBA, 10Patch-For-Review: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 (10Marostegui) 05stalled>03Open
[09:09:43] <wikibugs>	 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Marostegui)
[09:09:45] <wikibugs>	 10DBA, 10Schema-change, 10Tracking: [DO NOT USE] Schema changes for Wikimedia wikis (tracking) [superseded by #Blocked-on-schema-change] - https://phabricator.wikimedia.org/T51188 (10Marostegui)
[09:09:47] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping rc_cur_time on wmf databases - https://phabricator.wikimedia.org/T67448 (10Marostegui) 05stalled>03Open
[09:12:29] <wikibugs>	 10DBA, 10Data-Services, 10Datasets-General-or-Unknown, 10User-notice: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 (10Urbanecm) Suggested wording for the notice: "Tables used by nowadays archived extension [Education Program](https://www.mediawik...
[09:30:07] <phuedx>	 hullo hullo
[09:30:41] <phuedx>	 i'd like to start running this script https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/472596/ to fix ~1.5m rows with poorly distributed page.page_random values
[09:31:06] <phuedx>	 the script has been reviewed by a whole bunch of folk but not by a dba afaict
[09:33:32] <wikibugs>	 10DBA, 10Operations: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Banyek) s3 is the only section there which is not compressed.  Btw. We can check if the BBU causes it, because if we enable write caching we can see the results.
[09:33:54] <phuedx>	 anomie confirmed on the associated task (https://phabricator.wikimedia.org/T208909) that it won't conflict with the scripts that they're running, which should be about finished
[09:34:56] <banyek>	 brb
[09:35:00] <wikibugs>	 10DBA, 10Schema-change, 10Tracking: [DO NOT USE] Schema changes for Wikimedia wikis (tracking) [superseded by #Blocked-on-schema-change] - https://phabricator.wikimedia.org/T51188 (10Marostegui)
[09:35:02] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping rc_cur_time on wmf databases - https://phabricator.wikimedia.org/T67448 (10Marostegui) 05Open>03Resolved
[09:35:39] <marostegui>	 phuedx: I have been out on holidays for 2 weeks, so I don't have the context, I am still catching up, so not sure what to say :)
[09:35:42] <marostegui>	 banyek: ^
[09:35:55] <marostegui>	 banyek: you've got more details I guess?
[09:36:08] <phuedx>	 i can add some context, if you'd like
[09:36:53] <marostegui>	 Let's wait for banyek as he was here for those two weeks, so I guess he knows what all this is about :)
[09:40:46] <phuedx>	 marostegui, banyek: sure. my team's been operating with a lot of caution because large updates to the db are outside of our regular wheelhouse. the script's been reviewed by quite a few folk so we're more confident. i guess what i'm asking is: is there any reason why i shouldn't start running the script? :)
[09:40:59] <phuedx>	 also, i hope you had a good vacation!
[10:03:24] <banyek>	 marostegui: tldr; iirc ; so there's a 14 year old problem with some random number distribution around the entire database which could be fixed running a maintenance script and phuedx needs our blessing to start the maintenance script now
[10:04:05] <marostegui>	 Sure, I assume it has all the wait for replications, it will be done in reasonable size batches and all that?
[10:04:51] <phuedx>	 marostegui: we were thinking a batch size of 1000. we query the replicas for affected rows in the batch. update the affected rows. wait for replicas. continue
[10:05:07] <phuedx>	 1.6m rows are affected. 600,000 of which are in the enwiki database
[10:05:25] <marostegui>	 phuedx: I would start with lower batch size, just to be on the safe side
[10:05:46] <phuedx>	 marostegui: acknowledged. 200 (which i think is set in the change itself)
[10:05:48] <phuedx>	 ?
[10:05:54] <marostegui>	 sure, that sounds good
[10:06:01] <banyek>	 https://gerrit.wikimedia.org/r/#/c/472596/
[10:06:16] <marostegui>	 banyek: did you review it yourself too?
[10:06:55] <banyek>	 I checked the php code, but I wasn't through - I didn't +1'd
[10:07:23] <marostegui>	 actually I don't care much about the code, but about the logic :)
[10:08:10] <marostegui>	 phuedx: where will you run the script on at first? like, which wiki
[10:08:11] <banyek>	 I need to re-check it, because I don't know. As I see tgr approved and that could be a sign of good code
[10:08:45] <phuedx>	 marostegui: this is the first time i've done something like this, so your advice (and banyek's!) will be invaluable
[10:08:58] <marostegui>	 phuedx: are all the wikis affected?
[10:10:17] <phuedx>	 marostegui: yes, aiui
[10:10:37] <marostegui>	 phuedx: why not starting with testwiki then?
[10:10:50] <phuedx>	 marostegui: seems reasonable, then mediawikiwiki etc
[10:11:17] <phuedx>	 it affects any wikis that were in production between October 11, 2004 and July 7, 2005
[10:12:16] <marostegui>	 phuedx: after those you can check small.dblist
[10:12:38] <phuedx>	 sweet! so i could foreachwiki against small.dblist
[10:12:55] <marostegui>	 sure, after test wiki and mediawikiwiki that is a good option
[10:13:36] <phuedx>	 alright. if you're happy. then i'll pull the script onto mwmaint1002 and run it. i'll shout out in -operations too
[10:13:49] <marostegui>	 sure, make sure to !log it
[10:16:35] <phuedx>	 marostegui: i appear to be denied access to the wmf_raw.mediawiki_revision table on quarry
[10:16:39] <phuedx>	 if i could run this query: https://quarry.wmflabs.org/query/31134
[10:16:48] <phuedx>	 then i could know which wikis are affected
[10:16:55] <phuedx>	 *all of the wikis
[10:17:38] <marostegui>	 is that on labs?
[10:17:56] <marostegui>	 what is that wmf_raw?
[10:18:00] <phuedx>	 wait. that's a hive table :/
[10:18:04] <phuedx>	 *facepalm*
[10:59:40] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping rc_moved_to_title/rc_moved_to_ns on wmf databases - https://phabricator.wikimedia.org/T51191 (10Marostegui) 05Open>03Resolved
[12:21:55] <addshore>	 :D
[12:23:28] <banyek>	 I go eat something
[13:20:24] <phuedx>	 marostegui: the script has finally been deployed and i'm going to run it against testwiki
[13:23:53] <wikibugs>	 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.33-notes (1.33.0-wmf.4; 2018-11-13), 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2): [Bug] Update old nonuniformly distributed page_random values - https://phabricator.wikimedia.org/T208909 (10phuedx) a:03phuedx
[13:28:43] <phuedx>	 no failures seen. 32 rows updated.
[13:28:58] <phuedx>	 marostegui: i'm going to run the script against mediawikiwiki
[13:35:12] <phuedx>	 no failures. 246 rows updated
[13:49:49] <phuedx>	 marostegui: small.dblist covers 227 of the 577 wikis in my list of affected wikis
[13:53:58] <phuedx>	 i'm going to feed that dblist to foreach wiki, if that's alright by you
[13:58:42] <marostegui>	 sure
[14:00:36] <phuedx>	 sorry i meant mwscriptwikiset
[14:00:40] <phuedx>	 foreachwiki is all wikis :)
[14:00:53] <phuedx>	 *foreachwikiindblist
[14:03:30] <wikibugs>	 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata, and 3 others: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) I know realised why the change was "gone" from all the eqiad s8 hosts except db1071, db1087,db1124...
[14:04:37] <wikibugs>	 10DBA, 10User-Banyek: Checking archive tables across the databases - https://phabricator.wikimedia.org/T209048 (10Marostegui) p:05Triage>03Normal
[14:05:45] <wikibugs>	 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2): [Bug] Update old nonuniformly distributed page_random values - https://phabricator.wikimedia.org/T208909 (10phuedx) The script is currentl...
[14:09:33] <phuedx>	 small wikis are done
[14:13:26] <phuedx>	 marostegui: i'll move on to medium.dblist
[14:13:31] <wikibugs>	 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata, and 3 others: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10jcrespo) Could you check the list of schema changes and maintenance to be ran during switchover to test if the...
[14:14:32] <wikibugs>	 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata, and 3 others: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) >>! In T203709#4739468, @jcrespo wrote: > Could you check the list of schema changes and maintenan...
[14:49:19] <banyek>	 I start to roll out the new parsercache code
[14:49:36] <marostegui>	 cool
[14:53:02] <banyek>	 puppet disabled, now merging the patch
[14:56:04] <jynus>	 did you disable alerting on the new host?
[14:56:14] <jynus>	 otherwise it may page/alert
[14:56:45] <jynus>	 it has to be done on puppet because with icinga it would not apply to new alerts
[14:56:58] <banyek>	 no
[14:57:01] <banyek>	 damn it
[14:58:52] <banyek>	 I downtimed it on icinga
[15:02:04] <jynus>	 on icinga it will do nothing
[15:02:10] <jynus>	 as it will not apply to the new services
[15:02:17] <jynus>	 they will alert anyway
[15:03:19] <jynus>	 you should quickly disable alerting on the new services
[15:07:21] <banyek>	 if you add a +1 on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/473021/ I merge it
[15:11:12] <banyek>	 ok, it's out
[15:12:37] <banyek>	 I enable puppet on pc2004 and run the agent there too
[15:12:46] <banyek>	 🤞
[15:14:26] <banyek>	 cool, there's no changes except the motd
[15:14:31] <banyek>	 https://www.irccloud.com/pastebin/r3AiKlJx/
[15:14:53] <banyek>	 I enable the puppet on the hosts one-by-one and run the agent
[15:17:42] <wikibugs>	 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2): [Bug] Update old nonuniformly distributed page_random values - https://phabricator.wikimedia.org/T208909 (10phuedx) … and now against [all...
[15:28:22] <jynus>	 I am going to upgrade db2094 to 10.1.37
[15:43:01] <jynus>	 marostegui: I see it ^downtime for a schema change, is it something I can stop?
[15:43:19] <marostegui>	 should have finished, let me check
[15:43:40] <jynus>	 I can also check
[15:43:56] <marostegui>	 it is finished
[15:43:59] <jynus>	 which sections, all?
[15:44:01] <marostegui>	 you can proceed
[15:44:03] <marostegui>	 only s1
[15:44:09] <marostegui>	 I downtimed it all
[15:44:30] <jynus>	 ok, reusing your window, will later put it back up again
[15:44:38] <marostegui>	 great, thank you
[15:47:34] <jynus>	 BTW, test-s1 (db1118) is still broken
[15:47:48] <marostegui>	 yep
[15:47:49] <jynus>	 will eventually install 10.3 there
[15:47:58] <jynus>	 but not a priority, so I left it alone
[15:48:00] <marostegui>	 +1!
[15:48:17] <wikibugs>	 10DBA, 10Analytics, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Nuria) Pinging @bd808 and @Fjalapeno and @tstarling per above comment.
[15:49:13] <wikibugs>	 10DBA, 10Analytics, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Milimetric) @Nuria, I'm catching up on the task that Jaime recommended and will comment there.
[15:53:00] <wikibugs>	 10DBA, 10Analytics, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Pintoch) I have taken the liberty to remove "Cloud Services" as a subscriber to this ticket as I do not think every toollabs user wants to receive n...
[15:54:55] <wikibugs>	 10DBA, 10Analytics, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Cyberpower678) Why am I getting emails to this task?
[16:01:47] <wikibugs>	 10DBA, 10Analytics, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10jcrespo) Nuria apparently subscribed 120 cloud users to this task by mistake- please be careful when using Phabricator to not annoy (with spam) our...
[16:04:30] <phuedx>	 running the script against the large wikis now. as part of the medium wikis, the script chomped through some fairly large rowsets with ease. i predict that it'll be done in the next two hours
[16:04:56] <phuedx>	 i'm still using a batch size of 200
[16:06:18] <marostegui>	 good, so far no lag apparently https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=6&fullscreen&orgId=1
[16:06:25] <marostegui>	 the s8 spikes are my alter tables on some hosts
[16:07:30] <jynus>	 phuedx: do you know how to check those issues? one is the link above
[16:08:21] <jynus>	 the other is https://logstash.wikimedia.org/app/kibana#/dashboard/DBReplication
[16:08:45] <jynus>	 of course, any regular "high rate of errors" is always theoretically possible
[16:09:19] <jynus>	 e.g. webrequests taking a lot of time due to extra locking, etc.
[16:10:39] <jynus>	 moore's law says that if done carefuly and checking, errors don't happen, while if you are not, they will :-D
[16:11:43] <phuedx>	 ^ lol and that's absolutely true
[16:12:33] <jynus>	 so we check to exect no errors to cover for that :-D
[16:12:40] <jynus>	 *expect
[16:13:29] <jynus>	 also it is normally the things you don't think will go bad- e.g. like weird interactions with other ongoing scripts
[16:15:21] <phuedx>	 jynus: yeah. the only script that's called out anywhere that i can see is anomie's ongoing script (on the deployments page) but there could well be others!
[16:15:39] <wikibugs>	 10DBA, 10Operations: db2061 has predictive disk errors - https://phabricator.wikimedia.org/T208957 (10Marostegui) 05Open>03declined As spoken via IRC, let's wait for these disks to finally fail (we don't have spares anyways)  and hosts with predictive errors are being tracked at {T208323}
[16:16:02] <jynus>	 phuedx: that is why we ask people to put the ongoing things on deployments
[16:16:07] <wikibugs>	 10DBA, 10DC-Ops: disk error on db1065 - https://phabricator.wikimedia.org/T208709 (10Marostegui) 05Open>03declined As spoken via IRC, let's wait for these disks to finally fail and hosts with predictive errors are being tracked at {T208323}
[16:16:14] <jynus>	 not to annoy them or make burocratic steps
[16:16:19] <wikibugs>	 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui)
[16:16:26] <jynus>	 just to make sure they are aware of things ongoing
[16:18:39] <wikibugs>	 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2): [Bug] Update old nonuniformly distributed page_random values - https://phabricator.wikimedia.org/T208909 (10phuedx) … and now against [all...
[16:18:54] <phuedx>	 jynus: +1 visibility is important
[16:19:24] <jynus>	 May I have some question about thise (just out of curiosity)
[16:19:34] <jynus>	 I see the task as high
[16:19:46] <jynus>	 and I have seen an unusual speed to deploy
[16:20:19] <jynus>	 but I never considered "get random article" a priority feature
[16:20:45] <jynus>	 randomness of course is important in a crypto context
[16:21:11] <jynus>	 but I don't think it is the case, is there something I may be missing about this that makes it high priority?
[16:21:42] <phuedx>	 we're relying on the randomness of page_random in the context of a large a/b test we're running to assess the impact of adding schema.org data to mainspace pages on our search result rankings
[16:21:52] <jynus>	 ah, I see
[16:22:00] <jynus>	 so it is a depentency of other functionality
[16:22:04] <jynus>	 I didn't know that
[16:22:16] <jynus>	 that makes sense now
[16:22:23] <phuedx>	 page_random is a really great field for this
[16:22:36] <phuedx>	 especially with the current definition of wfRandom (or mysql's RAND)
[16:22:52] <jynus>	 I was perplexed about the importance to "make get random page" really random
[16:22:58] <phuedx>	 haha!
[16:22:58] <jynus>	 and know I understand
[16:23:05] <jynus>	 thanks
[16:23:13] <phuedx>	 no worries :)
[16:25:50] <jynus>	 happily this was a page priority
[16:26:08] <jynus>	 page* feature
[16:26:32] <jynus>	 if it had been a field of revision it would had taken a few weeks to run
[16:26:58] <phuedx>	 noted!
[16:26:59] <jynus>	 so it is just a few million rows to update rather than dozens of thousands of millions
[16:27:20] * phuedx makes a note to avoid doing updates on the revision table
[16:27:38] <jynus>	 actually, anomie's work is to make that possible
[16:27:46] <jynus>	 thining the revision table
[16:28:03] <jynus>	 becuase it reached a point where it was very difficult to work with it
[16:29:20] <jynus>	 we (systems) should never be an obstacle to what you want to do
[16:29:27] <jynus>	 but only a facilitator
[16:33:39] <phuedx>	 :) that's a nice way of looking at it. we (readers web) haven't looked at y'all as an obstacle but as a source of wisdom that we wanted to tap before touching anything!
[16:33:56] <phuedx>	 even though we're updating a single value on a couple million rows, we're pretty darn nervous
[16:40:25] <banyek>	 I'll install the pc* on the rest of the parsercache hosts in codfw - I'll disable puppet again in the old hosts
[16:41:16] <marostegui>	 great
[16:42:39] <jynus>	 banyek: assuming you don't touch the role/profile or classes, it should be safe
[16:43:17] <banyek>	 I changed the node in manifest/site.pp to regex insead of strings that't the one I want to make sure
[16:43:26] <jynus>	 sure
[16:43:33] <jynus>	 +1 then
[16:47:40] <jynus>	 not a big issue now, but I would suggest to change the regex to only match the right hosts
[16:47:59] <jynus>	 so it doesn't match pc2014/pc2017
[16:48:09] <jynus>	 those doesn't exist now
[16:48:32] <jynus>	 but they will eventually do and we don't want to create traps for us
[16:48:59] <jynus>	 e.g. (04|07|10) is even easier to read
[16:52:43] <jynus>	 banyek: I think something went wrong with the enable_notifications
[16:52:56] <banyek>	 I've seen them
[16:52:57] <jynus>	 they are still showing as enabled for some reason
[16:53:07] <jynus>	 no, not talking about 10
[16:53:20] <jynus>	 talking about 2007
[16:53:54] <jynus>	 notifications there are still enabled, so either puppet didn't run on icinga
[16:53:56] <banyek>	 It's in maintenance mode 
[16:54:00] <jynus>	 or something else went wrong
[16:54:08] <jynus>	 icinga is in maintenance?
[16:54:13] <jynus>	 :-/
[16:55:01] <banyek>	 the host & sercices is downtimed in  icinga, and notifications is disabled via puppet
[16:55:07] <banyek>	 what else could be done?
[16:55:29] <jynus>	 well, I am looking at icinga, and puppet didn't apply
[16:55:35] <jynus>	 just giving you a heads up
[16:55:42] <jynus>	 I don't know why
[16:55:51] <banyek>	 ah 
[16:55:53] <banyek>	 I know
[16:56:14] <banyek>	 the pt-heartbeat is not able to set up via puppet because there's no database runnung
[16:56:17] <banyek>	 *running
[16:56:26] <jynus>	 ah
[16:56:28] <banyek>	 after the db's will be cloned it will be good
[16:56:32] <jynus>	 I see
[16:57:00] <jynus>	 ack it manually at least for some days
[16:57:32] <jynus>	 I don't care, but other people will ask if they are ongoing alerts
[16:57:42] <jynus>	 there are*
[17:03:22] <banyek>	 👍
[17:03:28] <banyek>	 I do that and call this a dayt
[17:03:55] <banyek>	 marostegui: all the hosts are prepared you can start clone any of them, I'll do the rest when I am in tomorrow
[17:04:03] <jynus>	 sure, in the middle of an emergency :-)
[17:04:17] <banyek>	 of course
[17:04:26] <jynus>	 of course?
[17:04:46] <jynus>	 thank you very much, banyek
[17:06:15] <banyek>	 I just wanted to be positive
[17:06:31] <banyek>	 (is there an ongoing emergency what I missed? I thought you are joking)
[17:06:55] <jynus>	 check operations-
[17:14:29] <marostegui>	 banyek: why are pc paging? didn't you disable notifications?
[17:23:50] <marostegui>	 banyek: ping?
[17:26:37] <wikibugs>	 10DBA, 10Analytics, 10Analytics-Kanban, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10fdans) a:03JAllemandou
[17:27:05] <wikibugs>	 10DBA, 10Analytics, 10Analytics-Kanban, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10fdans) p:05Triage>03High
[17:28:20] <banyek>	 sorry the painter was here
[17:28:36] <banyek>	 read back, I disabled them, and downtimes the hosts, so I don't know
[17:28:44] <banyek>	 I ACK everything
[17:28:48] <banyek>	 marostegui ^
[17:31:00] <marostegui>	 I don't see them as disabled
[17:31:10] <marostegui>	 notifications are enabled
[17:35:01] <banyek>	 ```banyek@shambala:~/projects/wikimedia/puppet (production) $ cat hieradata/hosts/pc20*
[17:35:01] <banyek>	 mariadb::parsercache::shard: 'pc1'
[17:35:01] <banyek>	 mariadb::parsercache::shard: 'pc2'
[17:35:01] <banyek>	 mariadb::parsercache::shard: 'pc3'
[17:35:01] <banyek>	 mariadb::parsercache::shard: 'pc1'
[17:35:01] <banyek>	 profile::base::notifications_enabled: '0'
[17:35:01] <banyek>	 mariadb::parsercache::shard: 'pc2'
[17:35:02] <banyek>	 profile::base::notifications_enabled: '0'
[17:35:02] <banyek>	 mariadb::parsercache::shard: 'pc3'
[17:35:03] <banyek>	 profile::base::notifications_enabled: '0'
[17:35:03] <banyek>	 mariadb::parsercache::shard: 'pc1'
[17:35:04] <banyek>	 profile::base::notifications_enabled: '0'```
[17:35:17] <banyek>	 hm...
[17:35:29] <banyek>	 what did I miss?
[17:42:00] <marostegui>	 don't know, check icinga host for those hosts definition I guess could be an start
[17:43:11] <banyek>	 Hmmm... Yeay that would be the reason I didn't run puppet on einsteinium
[17:53:22] <marostegui>	 so I would suggest you either fix that or give them more downtime and investigate tomorrow
[17:53:37] <marostegui>	 But make sure they don't page anymore either way ;)
[17:58:34] <jynus>	 I am going to disconect, getting ill
[18:08:41] <banyek>	 I downtimed them until 20181120
[18:08:49] <banyek>	 and also ran on einsteinium
[18:09:06] <banyek>	 I am pretty sure they wont page 
[18:09:09] <banyek>	 now I am out too
[18:09:18] <banyek>	 see you tomorrow
[18:09:51] <banyek>	 (be better jaim3)
[18:25:56] <wikibugs>	 10DBA, 10Analytics, 10Analytics-Kanban, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Anomie) Note the `actor` view will likely turn out to have similar issues.  As suggested in T209031#4736006, one solution woul...
[19:27:21] <wikibugs>	 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2): [Bug] Update old nonuniformly distributed page_random values - https://phabricator.wikimedia.org/T208909 (10Niedzielski) a:05phuedx>03T...
[19:45:47] <wikibugs>	 10DBA, 10MediaWiki-Commenting, 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 (10MGChecker)
[19:46:01] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 (10MGChecker)
[19:52:01] <wikibugs>	 10DBA, 10Cloud-VPS, 10MediaWiki-Commenting: Decide whether back-compat views for upcoming major schema changes will be provided in the Labs replicas - https://phabricator.wikimedia.org/T166798 (10MGChecker)
[20:07:14] <wikibugs>	 10DBA, 10Wikimedia-Site-requests, 10Community-consensus-needed, 10Patch-For-Review, 10User-Zoranzoki21: Remove FlaggedRevs and add back rights autopatrolled, patroller (with enabled RCPatrol), rollbacker on srwikinews - https://phabricator.wikimedia.org/T209251 (10Framawiki) I suppose that #mediawiki-ext...
[21:21:06] <wikibugs>	 10DBA, 10Analytics, 10Analytics-Kanban, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10JAllemandou) Thanks @Anomie. We (analytics team) also had thought of a third potential solution. I list the 3 solutions below...
[21:49:39] <wikibugs>	 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2): [Bug] Update old nonuniformly distributed page_random values - https://phabricator.wikimedia.org/T208909 (10Tbayer) Repeating query [4] fr...
[23:08:05] <wikibugs>	 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2): [Bug] Update old nonuniformly distributed page_random values - https://phabricator.wikimedia.org/T208909 (10Niedzielski) /cc @mpopov