[04:50:38] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) Once the disk have failed we will get an automatic ticket for getting that disk replaced. I don't think we need this tracking taks. [05:36:44] 10DBA, 10Operations: db2061 has predictive disk errors - https://phabricator.wikimedia.org/T208957 (10Marostegui) I would ignore this until the disks fail and we get the automatic failed disk task created [05:39:06] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Marostegui) 05Open>03Resolved I agree with Jaime, a bigger stripe size shouldn't be an issue. Plus, we will be having SSDs, which will probably compensa... [05:39:42] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Marostegui) [05:49:34] 10DBA: Duplicate rows error in db2095 replication @s7 - https://phabricator.wikimedia.org/T208672 (10Marostegui) >>! In T208672#4724031, @Banyek wrote: > I re-run the check_private_data.py to see the if reimport was good as the filters/triggers were working What's the status of this? Was this fully done? [05:59:03] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata, and 2 others: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) >>! In T203709#4724115, @Addshore wrote: > @Marostegui Do we have an ETA on these indexes being re... [06:08:01] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata, and 2 others: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) This is very weird, I am _pretty_ sure I did s8 eqiad whilst codfw was active. The fact that this... [06:08:20] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata, and 2 others: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) [06:17:41] 10DBA, 10MediaWiki-API, 10MediaWiki-Database: prop=revisions API timing out for a specific user and pages they edited - https://phabricator.wikimedia.org/T197486 (10Marostegui) Yeah, I am curious to see if 10.0.37 really fixes this :-) [06:32:07] 10DBA, 10Operations, 10ops-eqiad: db1117 went away - https://phabricator.wikimedia.org/T208150 (10Marostegui) 05Open>03Resolved a:03Cmjohnson I see no more errors on the idrac logs since the reboot. Let's close this and re-open if this happens again and then we'll need to get the vendor involved. ` E... [06:34:31] 10DBA, 10Operations: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Marostegui) a:05Papaul>03None I guess this is the reason why dbstore2002:3313 is lagging so much behind? (It could be something else, I am just catching up on emails) Should we ease a bit replication options to make... [06:43:32] 10DBA, 10MediaWiki-extensions-WikibaseMediaInfo, 10SDC Engineering, 10StructuredDataOnCommons, 10Wikidata: MediaInfo extension should not use the wb_terms table - https://phabricator.wikimedia.org/T208330 (10Marostegui) I would prefer if we don't enable more stuff in production that uses `wb_term` table,... [07:02:36] 10DBA, 10Data-Services, 10Datasets-General-or-Unknown: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 (10Marostegui) >>! In T174802#4733544, @Urbanecm wrote: > What about sending a notification via #user-notice and drop the data after a month since... [07:43:32] 10DBA, 10Analytics, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10jcrespo) This is not something we handle- we don't decide on the table structure (this refactoring, comment storage, was owned by Platform team), wh... [08:07:58] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) Already caught up with Jaime about why this ticket exists. All good here [08:12:22] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 (10Marostegui) [08:12:24] 10DBA, 10foundation.wikimedia.org: Drop the petition_data table from production - https://phabricator.wikimedia.org/T208979 (10Marostegui) [08:12:35] 10DBA, 10foundation.wikimedia.org: Drop the petition_data table from production - https://phabricator.wikimedia.org/T208979 (10Marostegui) p:05Triage>03Normal [08:17:01] 10DBA, 10DC-Ops: disk error on db1065 - https://phabricator.wikimedia.org/T208709 (10Marostegui) I would suggest for this disk to fail completely before replacing it. [08:27:01] 10DBA: Failover m3 codfw master - https://phabricator.wikimedia.org/T209261 (10Marostegui) [08:27:37] 10DBA: Failover m3 codfw master - https://phabricator.wikimedia.org/T209261 (10Marostegui) p:05Triage>03Normal [08:27:52] 10DBA, 10Operations, 10ops-codfw, 10User-Banyek: db2042 (m3) master RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Marostegui) This failed again and I have created T202051 to track it [08:29:18] Is dbstore2002:3313 so lagged because of the BBU failure? https://phabricator.wikimedia.org/T208320#4738624 [08:30:28] no idea, I saw it on thursday when it was only 1 day behind and saw it replicating the maintenance, so I supposed it was going to be temporary [08:30:47] it may be still the ongoing maintenance [08:30:53] banyek: any idea? [08:32:16] no, I didn't spent time on that. Practically that wasn't a good idea, but if I recall correctly after I realized it's normally lagging because of backups I didn't spent too much attention on it as I had other, more scary stuff to do [08:33:44] it is 6 days behind now :( [08:34:33] we shall enable the write back cache event the BBU is bad then [08:36:13] I don't know if it is that, I am catching up with emails but saw that alert there, so I was wondering what's been done [08:36:30] I don't know if there are scripts running for maintenance or what [08:36:36] Last week was pretty rough the sanitariums broke constantly [08:37:22] I would expect 3311 to also lag if it was the BBU [08:37:38] So maybe there are scripts running for maintenance, I just don't know, I am not in the loop [08:38:19] The tables are compressed in dbstore2002 too, that could be also a reason - even it is unprobable [08:38:24] (is that a word?) [08:40:36] I am going to ease consistency variables to get it back to a more recent state, 6 days is too much (and it is still increasing) [08:40:40] so UPDATE /* MigrateComments::migrate www-data@mwmain... */ is still running there [08:40:54] yeah, I saw that one [08:43:00] 10DBA, 10Operations: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Marostegui) >>! In T208320#4738624, @Marostegui wrote: > I guess this is the reason why dbstore2002:3313 is lagging so much behind? (It could be something else, I am just catching up on emails) > Should we ease a bit re... [08:50:48] hope you had a good vacation marostegui [08:53:57] hehe I did! [08:53:58] Thanks :) [09:08:38] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping rc_moved_to_title/rc_moved_to_ns on wmf databases - https://phabricator.wikimedia.org/T51191 (10Marostegui) 05stalled>03Open [09:09:07] 10DBA, 10Schema-change: Drop externallinks.el_from_namespace on wmf databases - https://phabricator.wikimedia.org/T114117 (10Marostegui) 05stalled>03Open [09:09:15] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 (10Marostegui) [09:09:33] 10DBA, 10Patch-For-Review: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 (10Marostegui) 05stalled>03Open [09:09:43] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), and 2 others: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Marostegui) [09:09:45] 10DBA, 10Schema-change, 10Tracking: [DO NOT USE] Schema changes for Wikimedia wikis (tracking) [superseded by #Blocked-on-schema-change] - https://phabricator.wikimedia.org/T51188 (10Marostegui) [09:09:47] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping rc_cur_time on wmf databases - https://phabricator.wikimedia.org/T67448 (10Marostegui) 05stalled>03Open [09:12:29] 10DBA, 10Data-Services, 10Datasets-General-or-Unknown, 10User-notice: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 (10Urbanecm) Suggested wording for the notice: "Tables used by nowadays archived extension [Education Program](https://www.mediawik... [09:30:07] hullo hullo [09:30:41] i'd like to start running this script https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/472596/ to fix ~1.5m rows with poorly distributed page.page_random values [09:31:06] the script has been reviewed by a whole bunch of folk but not by a dba afaict [09:33:32] 10DBA, 10Operations: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Banyek) s3 is the only section there which is not compressed. Btw. We can check if the BBU causes it, because if we enable write caching we can see the results. [09:33:54] anomie confirmed on the associated task (https://phabricator.wikimedia.org/T208909) that it won't conflict with the scripts that they're running, which should be about finished [09:34:56] brb [09:35:00] 10DBA, 10Schema-change, 10Tracking: [DO NOT USE] Schema changes for Wikimedia wikis (tracking) [superseded by #Blocked-on-schema-change] - https://phabricator.wikimedia.org/T51188 (10Marostegui) [09:35:02] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping rc_cur_time on wmf databases - https://phabricator.wikimedia.org/T67448 (10Marostegui) 05Open>03Resolved [09:35:39] phuedx: I have been out on holidays for 2 weeks, so I don't have the context, I am still catching up, so not sure what to say :) [09:35:42] banyek: ^ [09:35:55] banyek: you've got more details I guess? [09:36:08] i can add some context, if you'd like [09:36:53] Let's wait for banyek as he was here for those two weeks, so I guess he knows what all this is about :) [09:40:46] marostegui, banyek: sure. my team's been operating with a lot of caution because large updates to the db are outside of our regular wheelhouse. the script's been reviewed by quite a few folk so we're more confident. i guess what i'm asking is: is there any reason why i shouldn't start running the script? :) [09:40:59] also, i hope you had a good vacation! [10:03:24] marostegui: tldr; iirc ; so there's a 14 year old problem with some random number distribution around the entire database which could be fixed running a maintenance script and phuedx needs our blessing to start the maintenance script now [10:04:05] Sure, I assume it has all the wait for replications, it will be done in reasonable size batches and all that? [10:04:51] marostegui: we were thinking a batch size of 1000. we query the replicas for affected rows in the batch. update the affected rows. wait for replicas. continue [10:05:07] 1.6m rows are affected. 600,000 of which are in the enwiki database [10:05:25] phuedx: I would start with lower batch size, just to be on the safe side [10:05:46] marostegui: acknowledged. 200 (which i think is set in the change itself) [10:05:48] ? [10:05:54] sure, that sounds good [10:06:01] https://gerrit.wikimedia.org/r/#/c/472596/ [10:06:16] banyek: did you review it yourself too? [10:06:55] I checked the php code, but I wasn't through - I didn't +1'd [10:07:23] actually I don't care much about the code, but about the logic :) [10:08:10] phuedx: where will you run the script on at first? like, which wiki [10:08:11] I need to re-check it, because I don't know. As I see tgr approved and that could be a sign of good code [10:08:45] marostegui: this is the first time i've done something like this, so your advice (and banyek's!) will be invaluable [10:08:58] phuedx: are all the wikis affected? [10:10:17] marostegui: yes, aiui [10:10:37] phuedx: why not starting with testwiki then? [10:10:50] marostegui: seems reasonable, then mediawikiwiki etc [10:11:17] it affects any wikis that were in production between October 11, 2004 and July 7, 2005 [10:12:16] phuedx: after those you can check small.dblist [10:12:38] sweet! so i could foreachwiki against small.dblist [10:12:55] sure, after test wiki and mediawikiwiki that is a good option [10:13:36] alright. if you're happy. then i'll pull the script onto mwmaint1002 and run it. i'll shout out in -operations too [10:13:49] sure, make sure to !log it [10:16:35] marostegui: i appear to be denied access to the wmf_raw.mediawiki_revision table on quarry [10:16:39] if i could run this query: https://quarry.wmflabs.org/query/31134 [10:16:48] then i could know which wikis are affected [10:16:55] *all of the wikis [10:17:38] is that on labs? [10:17:56] what is that wmf_raw? [10:18:00] wait. that's a hive table :/ [10:18:04] *facepalm* [10:59:40] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping rc_moved_to_title/rc_moved_to_ns on wmf databases - https://phabricator.wikimedia.org/T51191 (10Marostegui) 05Open>03Resolved [12:21:55] :D [12:23:28] I go eat something [13:20:24] marostegui: the script has finally been deployed and i'm going to run it against testwiki [13:23:53] 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.33-notes (1.33.0-wmf.4; 2018-11-13), 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2): [Bug] Update old nonuniformly distributed page_random values - https://phabricator.wikimedia.org/T208909 (10phuedx) a:03phuedx [13:28:43] no failures seen. 32 rows updated. [13:28:58] marostegui: i'm going to run the script against mediawikiwiki [13:35:12] no failures. 246 rows updated [13:49:49] marostegui: small.dblist covers 227 of the 577 wikis in my list of affected wikis [13:53:58] i'm going to feed that dblist to foreach wiki, if that's alright by you [13:58:42] sure [14:00:36] sorry i meant mwscriptwikiset [14:00:40] foreachwiki is all wikis :) [14:00:53] *foreachwikiindblist [14:03:30] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata, and 3 others: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) I know realised why the change was "gone" from all the eqiad s8 hosts except db1071, db1087,db1124... [14:04:37] 10DBA, 10User-Banyek: Checking archive tables across the databases - https://phabricator.wikimedia.org/T209048 (10Marostegui) p:05Triage>03Normal [14:05:45] 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2): [Bug] Update old nonuniformly distributed page_random values - https://phabricator.wikimedia.org/T208909 (10phuedx) The script is currentl... [14:09:33] small wikis are done [14:13:26] marostegui: i'll move on to medium.dblist [14:13:31] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata, and 3 others: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10jcrespo) Could you check the list of schema changes and maintenance to be ran during switchover to test if the... [14:14:32] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10MediaWiki-Database, 10Wikidata, and 3 others: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 (10Marostegui) >>! In T203709#4739468, @jcrespo wrote: > Could you check the list of schema changes and maintenan... [14:49:19] I start to roll out the new parsercache code [14:49:36] cool [14:53:02] puppet disabled, now merging the patch [14:56:04] did you disable alerting on the new host? [14:56:14] otherwise it may page/alert [14:56:45] it has to be done on puppet because with icinga it would not apply to new alerts [14:56:58] no [14:57:01] damn it [14:58:52] I downtimed it on icinga [15:02:04] on icinga it will do nothing [15:02:10] as it will not apply to the new services [15:02:17] they will alert anyway [15:03:19] you should quickly disable alerting on the new services [15:07:21] if you add a +1 on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/473021/ I merge it [15:11:12] ok, it's out [15:12:37] I enable puppet on pc2004 and run the agent there too [15:12:46] 🤞 [15:14:26] cool, there's no changes except the motd [15:14:31] https://www.irccloud.com/pastebin/r3AiKlJx/ [15:14:53] I enable the puppet on the hosts one-by-one and run the agent [15:17:42] 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2): [Bug] Update old nonuniformly distributed page_random values - https://phabricator.wikimedia.org/T208909 (10phuedx) … and now against [all... [15:28:22] I am going to upgrade db2094 to 10.1.37 [15:43:01] marostegui: I see it ^downtime for a schema change, is it something I can stop? [15:43:19] should have finished, let me check [15:43:40] I can also check [15:43:56] it is finished [15:43:59] which sections, all? [15:44:01] you can proceed [15:44:03] only s1 [15:44:09] I downtimed it all [15:44:30] ok, reusing your window, will later put it back up again [15:44:38] great, thank you [15:47:34] BTW, test-s1 (db1118) is still broken [15:47:48] yep [15:47:49] will eventually install 10.3 there [15:47:58] but not a priority, so I left it alone [15:48:00] +1! [15:48:17] 10DBA, 10Analytics, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Nuria) Pinging @bd808 and @Fjalapeno and @tstarling per above comment. [15:49:13] 10DBA, 10Analytics, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Milimetric) @Nuria, I'm catching up on the task that Jaime recommended and will comment there. [15:53:00] 10DBA, 10Analytics, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Pintoch) I have taken the liberty to remove "Cloud Services" as a subscriber to this ticket as I do not think every toollabs user wants to receive n... [15:54:55] 10DBA, 10Analytics, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Cyberpower678) Why am I getting emails to this task? [16:01:47] 10DBA, 10Analytics, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10jcrespo) Nuria apparently subscribed 120 cloud users to this task by mistake- please be careful when using Phabricator to not annoy (with spam) our... [16:04:30] running the script against the large wikis now. as part of the medium wikis, the script chomped through some fairly large rowsets with ease. i predict that it'll be done in the next two hours [16:04:56] i'm still using a batch size of 200 [16:06:18] good, so far no lag apparently https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=6&fullscreen&orgId=1 [16:06:25] the s8 spikes are my alter tables on some hosts [16:07:30] phuedx: do you know how to check those issues? one is the link above [16:08:21] the other is https://logstash.wikimedia.org/app/kibana#/dashboard/DBReplication [16:08:45] of course, any regular "high rate of errors" is always theoretically possible [16:09:19] e.g. webrequests taking a lot of time due to extra locking, etc. [16:10:39] moore's law says that if done carefuly and checking, errors don't happen, while if you are not, they will :-D [16:11:43] ^ lol and that's absolutely true [16:12:33] so we check to exect no errors to cover for that :-D [16:12:40] *expect [16:13:29] also it is normally the things you don't think will go bad- e.g. like weird interactions with other ongoing scripts [16:15:21] jynus: yeah. the only script that's called out anywhere that i can see is anomie's ongoing script (on the deployments page) but there could well be others! [16:15:39] 10DBA, 10Operations: db2061 has predictive disk errors - https://phabricator.wikimedia.org/T208957 (10Marostegui) 05Open>03declined As spoken via IRC, let's wait for these disks to finally fail (we don't have spares anyways) and hosts with predictive errors are being tracked at {T208323} [16:16:02] phuedx: that is why we ask people to put the ongoing things on deployments [16:16:07] 10DBA, 10DC-Ops: disk error on db1065 - https://phabricator.wikimedia.org/T208709 (10Marostegui) 05Open>03declined As spoken via IRC, let's wait for these disks to finally fail and hosts with predictive errors are being tracked at {T208323} [16:16:14] not to annoy them or make burocratic steps [16:16:19] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [16:16:26] just to make sure they are aware of things ongoing [16:18:39] 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2): [Bug] Update old nonuniformly distributed page_random values - https://phabricator.wikimedia.org/T208909 (10phuedx) … and now against [all... [16:18:54] jynus: +1 visibility is important [16:19:24] May I have some question about thise (just out of curiosity) [16:19:34] I see the task as high [16:19:46] and I have seen an unusual speed to deploy [16:20:19] but I never considered "get random article" a priority feature [16:20:45] randomness of course is important in a crypto context [16:21:11] but I don't think it is the case, is there something I may be missing about this that makes it high priority? [16:21:42] we're relying on the randomness of page_random in the context of a large a/b test we're running to assess the impact of adding schema.org data to mainspace pages on our search result rankings [16:21:52] ah, I see [16:22:00] so it is a depentency of other functionality [16:22:04] I didn't know that [16:22:16] that makes sense now [16:22:23] page_random is a really great field for this [16:22:36] especially with the current definition of wfRandom (or mysql's RAND) [16:22:52] I was perplexed about the importance to "make get random page" really random [16:22:58] haha! [16:22:58] and know I understand [16:23:05] thanks [16:23:13] no worries :) [16:25:50] happily this was a page priority [16:26:08] page* feature [16:26:32] if it had been a field of revision it would had taken a few weeks to run [16:26:58] noted! [16:26:59] so it is just a few million rows to update rather than dozens of thousands of millions [16:27:20] * phuedx makes a note to avoid doing updates on the revision table [16:27:38] actually, anomie's work is to make that possible [16:27:46] thining the revision table [16:28:03] becuase it reached a point where it was very difficult to work with it [16:29:20] we (systems) should never be an obstacle to what you want to do [16:29:27] but only a facilitator [16:33:39] :) that's a nice way of looking at it. we (readers web) haven't looked at y'all as an obstacle but as a source of wisdom that we wanted to tap before touching anything! [16:33:56] even though we're updating a single value on a couple million rows, we're pretty darn nervous [16:40:25] I'll install the pc* on the rest of the parsercache hosts in codfw - I'll disable puppet again in the old hosts [16:41:16] great [16:42:39] banyek: assuming you don't touch the role/profile or classes, it should be safe [16:43:17] I changed the node in manifest/site.pp to regex insead of strings that't the one I want to make sure [16:43:26] sure [16:43:33] +1 then [16:47:40] not a big issue now, but I would suggest to change the regex to only match the right hosts [16:47:59] so it doesn't match pc2014/pc2017 [16:48:09] those doesn't exist now [16:48:32] but they will eventually do and we don't want to create traps for us [16:48:59] e.g. (04|07|10) is even easier to read [16:52:43] banyek: I think something went wrong with the enable_notifications [16:52:56] I've seen them [16:52:57] they are still showing as enabled for some reason [16:53:07] no, not talking about 10 [16:53:20] talking about 2007 [16:53:54] notifications there are still enabled, so either puppet didn't run on icinga [16:53:56] It's in maintenance mode [16:54:00] or something else went wrong [16:54:08] icinga is in maintenance? [16:54:13] :-/ [16:55:01] the host & sercices is downtimed in icinga, and notifications is disabled via puppet [16:55:07] what else could be done? [16:55:29] well, I am looking at icinga, and puppet didn't apply [16:55:35] just giving you a heads up [16:55:42] I don't know why [16:55:51] ah [16:55:53] I know [16:56:14] the pt-heartbeat is not able to set up via puppet because there's no database runnung [16:56:17] *running [16:56:26] ah [16:56:28] after the db's will be cloned it will be good [16:56:32] I see [16:57:00] ack it manually at least for some days [16:57:32] I don't care, but other people will ask if they are ongoing alerts [16:57:42] there are* [17:03:22] 👍 [17:03:28] I do that and call this a dayt [17:03:55] marostegui: all the hosts are prepared you can start clone any of them, I'll do the rest when I am in tomorrow [17:04:03] sure, in the middle of an emergency :-) [17:04:17] of course [17:04:26] of course? [17:04:46] thank you very much, banyek [17:06:15] I just wanted to be positive [17:06:31] (is there an ongoing emergency what I missed? I thought you are joking) [17:06:55] check operations- [17:14:29] banyek: why are pc paging? didn't you disable notifications? [17:23:50] banyek: ping? [17:26:37] 10DBA, 10Analytics, 10Analytics-Kanban, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10fdans) a:03JAllemandou [17:27:05] 10DBA, 10Analytics, 10Analytics-Kanban, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10fdans) p:05Triage>03High [17:28:20] sorry the painter was here [17:28:36] read back, I disabled them, and downtimes the hosts, so I don't know [17:28:44] I ACK everything [17:28:48] marostegui ^ [17:31:00] I don't see them as disabled [17:31:10] notifications are enabled [17:35:01] ```banyek@shambala:~/projects/wikimedia/puppet (production) $ cat hieradata/hosts/pc20* [17:35:01] mariadb::parsercache::shard: 'pc1' [17:35:01] mariadb::parsercache::shard: 'pc2' [17:35:01] mariadb::parsercache::shard: 'pc3' [17:35:01] mariadb::parsercache::shard: 'pc1' [17:35:01] profile::base::notifications_enabled: '0' [17:35:01] mariadb::parsercache::shard: 'pc2' [17:35:02] profile::base::notifications_enabled: '0' [17:35:02] mariadb::parsercache::shard: 'pc3' [17:35:03] profile::base::notifications_enabled: '0' [17:35:03] mariadb::parsercache::shard: 'pc1' [17:35:04] profile::base::notifications_enabled: '0'``` [17:35:17] hm... [17:35:29] what did I miss? [17:42:00] don't know, check icinga host for those hosts definition I guess could be an start [17:43:11] Hmmm... Yeay that would be the reason I didn't run puppet on einsteinium [17:53:22] so I would suggest you either fix that or give them more downtime and investigate tomorrow [17:53:37] But make sure they don't page anymore either way ;) [17:58:34] I am going to disconect, getting ill [18:08:41] I downtimed them until 20181120 [18:08:49] and also ran on einsteinium [18:09:06] I am pretty sure they wont page [18:09:09] now I am out too [18:09:18] see you tomorrow [18:09:51] (be better jaim3) [18:25:56] 10DBA, 10Analytics, 10Analytics-Kanban, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Anomie) Note the `actor` view will likely turn out to have similar issues. As suggested in T209031#4736006, one solution woul... [19:27:21] 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2): [Bug] Update old nonuniformly distributed page_random values - https://phabricator.wikimedia.org/T208909 (10Niedzielski) a:05phuedx>03T... [19:45:47] 10DBA, 10MediaWiki-Commenting, 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 (10MGChecker) [19:46:01] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 (10MGChecker) [19:52:01] 10DBA, 10Cloud-VPS, 10MediaWiki-Commenting: Decide whether back-compat views for upcoming major schema changes will be provided in the Labs replicas - https://phabricator.wikimedia.org/T166798 (10MGChecker) [20:07:14] 10DBA, 10Wikimedia-Site-requests, 10Community-consensus-needed, 10Patch-For-Review, 10User-Zoranzoki21: Remove FlaggedRevs and add back rights autopatrolled, patroller (with enabled RCPatrol), rollbacker on srwikinews - https://phabricator.wikimedia.org/T209251 (10Framawiki) I suppose that #mediawiki-ext... [21:21:06] 10DBA, 10Analytics, 10Analytics-Kanban, 10Data-Services: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10JAllemandou) Thanks @Anomie. We (analytics team) also had thought of a third potential solution. I list the 3 solutions below... [21:49:39] 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2): [Bug] Update old nonuniformly distributed page_random values - https://phabricator.wikimedia.org/T208909 (10Tbayer) Repeating query [4] fr... [23:08:05] 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2): [Bug] Update old nonuniformly distributed page_random values - https://phabricator.wikimedia.org/T208909 (10Niedzielski) /cc @mpopov