[09:09:20] jynus: can I ask for a CR? [09:10:01] sure [09:10:12] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/483066/ [09:10:14] thanks! [09:10:28] (This will close 2 pending tickets ;) [09:11:22] check it works properly before applying it to the other 2 [09:25:55] deletes are dangerous so double check you only deleted the right things [09:27:44] What do you think? [09:27:51] I mean I don't understand [09:28:02] I'll use this: [09:28:06] https://www.irccloud.com/pastebin/1bguYLMY/ [09:28:07] just suggesting to be careful [09:28:29] for example, getting a list of tables before and after and comparing you only deleted what you really wanted [09:28:42] or running it once only printing but not deleting [09:28:52] OK [09:28:54] you get the idea [09:30:59] I added an 'echo' before the mysql command so see what will happen [09:31:07] here's a sniplet of the output [09:31:10] https://www.irccloud.com/pastebin/fhtT8ICL/ [09:31:21] were labsdb1009-labsdb1011 dist-upgraded from jessie to stretch? I noticed that there's an old copy of libssl1.0.0 (which is only in jessie) installed and there are no reverse dependencies so I'd remove that one? [09:31:25] cool [09:31:52] moritzm: no idea but I can check [09:32:29] I just checked the installed log, it was installed in 2016, so that seems to be the case [09:32:39] I'll clean out the old libssl1.0.0, then [09:32:55] doing the cleanup now [09:32:59] moritzm: probably at the time we didn't have enough redundancy [09:33:05] to do a clean reinstall [09:33:23] as those servers were very unique [09:33:47] probably, not's not an issue, it's just triggering a corner case in debdeploy as it fails to upgrade libssl1.0.0 which is not in stretch [09:33:56] I'm fixing that up now [09:34:08] you want to check other hosts [09:34:17] I am sure they were not the only ones [09:34:27] even if most were reimaged [09:38:57] ack, will check others as well [09:50:50] this also affected db2062 and dbstore2002, I've fixed those up as well (plus sodium as a non-DB host) [09:53:05] 10DBA, 10Analytics, 10Analytics-Kanban, 10Data-Services, 10Core Platform Team Backlog (Watching / External): Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10Banyek) [09:53:07] 10DBA, 10Analytics, 10Analytics-Kanban, 10Data-Services, and 3 others: Create materialized views on Wiki Replica hosts for better query performance - https://phabricator.wikimedia.org/T210693 (10Banyek) 05Open→03Resolved a:03Banyek I cleaned up the tables, so I close the ticket [09:56:47] I ran aptitude to detect other potentially obsolete packages (as the UI displays packages not in the package repository), but aptitude seems to auto-remove packages it deems obsolete via auto removals [09:57:06] which is really dangerous and for which I'll file a separate task [09:58:13] that's why it's flagging a broken dpkg state in Icinga, I'll clean that up [11:21:22] I am going to stop db1082 and db2052 in sync to migrate replication back to eqiad [11:29:12] 10DBA, 10Operations, 10StructuredDiscussions, 10Growth-Team (Current Sprint), 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Banyek) [11:29:33] ack [11:29:45] I'll leave soon for food and errands [11:29:50] then I finish the doc [11:30:26] we should have a handover chat tomorrow I think - if we all three would like to be there. [11:30:37] I'll finish my document before that [12:57:59] 10DBA, 10Wikidata, 10wikidata-tech-focus: Set wb_changes_dispatch ROW_FORMAT=COMPRESSED on install and update - https://phabricator.wikimedia.org/T207006 (10Addshore) 05Open→03Stalled [12:58:05] 10DBA, 10MediaWiki-General-or-Unknown, 10Operations, 10Wikidata, and 6 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Addshore) [14:19:28] s3 backups are on a weird state [14:19:37] 564 | dump.s3.2019-01-08--17-00-01 | ongoing | db1095.eqiad.wmnet:3313 | dbstore1001.eqiad.wmnet | dump | s3 | 2019-01-08 17:00:01 | NULL | NULL | [14:19:41] | 567 | dump.s3.2019-01-08--17-00-01 | ongoing | dbstore2002.codfw.wmnet:3313 | es2001.codfw.wmnet | dump | s3 | 2019-01-08 17:00:01 | NULL | NULL | [14:19:45] +-----+------------------------------+----------+------------------------------+-------------------------+------+---------+---------------------+---------------------+--------------+ [14:19:49] That is from yesterday [14:19:49] But the backup is actually taken [14:19:52] Still ongoing [14:20:05] as in, the metadata file is correct, with a start and a finish [14:23:03] then it failed to be packed [14:23:21] There is no log for that process, no? [14:23:31] there should be an error [14:23:53] but if it is a permission error ir cannot log it for example [14:23:53] But it still says ongoing, so maybe it didn't catch it yet? [14:24:12] how sure are you that it finish already? [14:24:17] the packing takes some time [14:24:20] I just checked the metadata [14:24:31] I mean mydumper finished [14:24:34] no, but mydumper is only the first par of the process [14:24:38] it does more things [14:24:41] But normally it takes 2-3 hours [14:24:50] only says successful if everthing is ok [14:24:50] And that has almost taken 24 since it started (as per the paste above) [14:26:38] I will have a look [14:27:03] it is very difficult to differenciate between a failed and a freezed process [14:27:08] so I don't try [14:27:09] Thanks! Please let me know later how you debugged it, as I have never seen this issue before [14:27:27] the check just checks that it is in a successful state [14:27:31] yeah [14:27:48] But I feel that 24h is too much, specially cause I compared it with the other weeks and it was around 2-3h [14:29:27] I see now what you mean [14:29:51] it is tar'ed as moving it to latest only happens if it is successful [14:30:22] maybe tendril connection failed and the very last action, marked as sucessful failed [14:30:23] yeah [14:30:42] or something related to the metadata database [14:31:09] right [14:31:29] the problem is the are are not logs about creating logs :-) [14:31:32] to me the backup looked ok, as in, right metadata, right directory, log was clean…but as it was still "ongoing" i wasn't sure anymore what was going on [14:32:26] yeah, I just didn't understood that [14:32:42] I thought mydumper finished but it was on ongoing still [14:32:54] hehe no no, sorry I wasn't clear enough [14:33:17] I will see what got inserted on the db [14:33:21] it is morning here still :p [14:33:26] and check if tendril rebotted or something [14:33:52] but it is weird both s3 failed [14:34:11] yeah, only those [14:39:26] they are not the only ones ongoing [14:41:57] from yesterday they are [14:42:00] ah, also m3 [14:42:23] on both dcs too [14:42:25] interesting [14:42:45] so m3 and s3 from yesterday are still on-going on both dcs [14:51:04] db1095 wasn't some alert ongoing? [14:51:18] *alter [14:52:04] db1095:3313? [14:52:22] no, even if it was, the dump would have failed, not the other stuff [14:52:38] yeah, I don't recall any alter on db1095:3313 (and of course not m3) [14:53:01] I actually noticed this because I had to drop like 900 tables from s3 [14:53:10] And I was waiting yesterday for db1095:3313 to finish [14:53:20] understandable [14:53:28] let me try to debug [14:53:32] And then when I left for the day, i was like: maybe it is still cleaning up things [14:53:38] but I checked today…and still there :( [14:53:48] it takes some time after mydumper [14:53:52] for the tar [14:53:58] but indeed not as much [14:54:16] plus it only moves it to latest if something fails [14:54:43] *nothing fails [15:14:29] marostegui: so the metadata gathering failed- which I allow to happen so that backups can happen even if the matadata database is unavailable [15:14:42] it logs the error to the output [15:14:42] zarcillo you mean? [15:14:54] but we discard to prevent cronspam [15:15:08] what do exactly mean with "metadata gathering failed?" [15:15:10] as we are in the end only interested on if it fully worked or not [15:15:26] it inserted the "starting backup" [15:15:38] but didn't insert the size nor the success [15:15:51] yeah [15:16:04] if you are asking why, it is on the logs we discard [15:16:05] so it did all the mydumper, packaging, moving and failed on that last step of inserting the data? [15:16:28] it failed 2, both related to metadata gathering [15:16:45] but I allow for those to "fail" [15:16:53] without interrupting the process [15:16:58] yeah, that makes sense [15:17:09] if it is moved to latest, the process is correct (it does checking) [15:17:26] however the alert would go off in a few hours as missing backup- human chec needed [15:17:51] yeah [15:18:08] How did you debug all that? [15:18:16] I did not debug [15:18:42] I just checked what metadata was available and the code flow [15:19:06] the metadata gathering sends error codes e.g. "unable to connect" "failed to read backup entry" [15:19:14] but we discard those [15:19:38] so what I would do is a custom backup, but without disabling the output [15:19:46] and see if it fails, and where [15:20:19] I wonder why both DCs [15:20:28] isn't that too much of a coincidence? [15:20:32] at first I thought it was the packing [15:20:42] s3 is one of the few hosts that get tarred [15:20:56] ah true [15:20:57] but x1 gets too, and the tarring worked well last week [15:21:00] and m3? [15:21:40] let me double check the config to see nothing changed [15:22:38] could be some kind of db overload? [15:22:51] From my side, no config was changed for backups or backups sources from past week backups [15:22:52] they are the first 4 backups that happened, at 17:00 [15:23:13] and we know tendril is not the most stable service ever [15:23:36] so some kind of race condition on the first trial? [15:23:57] tendril backup happens at the same time [15:24:11] which also is maybe not the most clever idea :-) [15:24:20] Oh I didn't notice they were all at 17:00 sharp [15:24:36] to the second even [15:24:40] ye [15:24:42] s [15:25:02] but the first insert "backup started" worked well [15:27:03] dumps are still ongoing on codfw [15:27:12] which is normal because way less resources [15:27:13] for s3?? [15:27:24] no, s5 and s7 [15:27:31] ah right [15:27:32] yeah [15:28:50] I will try to reproduce it, but my suspicion is some kind of database race condition [15:29:03] yeah, it looks too much of a coincidence [15:29:04] with locking on metadata insert, not on the source [15:29:07] like to the second [15:29:16] *not metadata locking, I mean backup metadata [15:29:25] yeah [15:30:07] or some stupid sql error [15:30:46] I will setup logging to file as part of the ongoing improvements [15:31:00] the logging exists, it is just that we discard it [15:32:22] that would help yeah! [15:45:36] 10DBA, 10SDC Engineering, 10SDC General, 10Wikidata, and 2 others: Create a production test wiki in group0 to parallel Wikimedia Commons - https://phabricator.wikimedia.org/T197616 (10Marostegui) 05Open→03Resolved I just sync'ed up with @Jdforrester-WMF - we are re-closing this - we will have a proper... [15:47:04] jynus: Can I drop 900 tables from db1095 or you prefer me to wait until we are fine with s3 backups? [15:47:24] the backup worked [15:47:34] please check it is on some of the tars [15:47:42] and proceed [15:47:47] my new backup will take some time [15:48:04] ok! [15:48:49] I go to the kindergarten again, I'll check back later again [15:53:04] 10DBA, 10User-Banyek, 10User-Zoranzoki21: Drop FlaggedRevs tables in database for ptwikipedia - https://phabricator.wikimedia.org/T211544 (10Bstorm) [15:58:11] dumps are now ongoing on eqiad, I will see if we get an error or they are successful [16:21:40] Sorry I was in a meeting [16:21:42] Great! [16:22:53] I may leave soon [16:23:02] they are running on a screen on dbstore1001 [16:23:18] great! [16:23:20] have a good day :) [16:23:24] I will check later [16:23:39] either they work or they will so an error [16:23:43] *show [16:24:52] 10DBA, 10Data-Services, 10Operations, 10Patch-For-Review: db1082 power loss resulted on mysql crash - https://phabricator.wikimedia.org/T213108 (10jcrespo) 05Open→03Resolved db1082 is fully repooled, it and db1124 had gtid reeenabled. [16:24:54] we will see what we get [16:26:22] Ah db1124 is already back under db1082 [16:26:25] Nice work! [16:28:13] I may connect a bit later to check it [16:28:34] I will check it :) [16:28:59] if you get an error, copy it to a ticket and I will attend it tomorrow [16:29:08] will do! [16:29:10] thanks :) [16:29:11] if not, it will have been fixed itself [16:29:20] we may get an alert on codfw though [16:29:42] an alert on db1115 you mean, no? [16:29:43] not a page, just a regular alert [16:29:45] yes [16:29:47] yep [16:30:22] If that happens, I will leave it there until we see if running again the backup on eqiad works, and if it does, I will ack it and re-run it on codfw [16:43:30] bd808: [16:43:43] Are there problems with the enwiki databases? [16:43:51] select count(*) from revision join page on rev_page = page_id where p [16:43:56] age_title = 'Castle_Conway' and page_namespace = 0 [16:43:59] SQL queries like this ^ do not work [16:44:09] what's going on [16:45:36] Amir1: anyone here? [16:45:38] "do not work" is too generic [16:45:48] it happens nothing [16:45:57] you need to state what you are tring to do and what fail (error code, synthom) [16:46:38] it's running a hour ... and running and running, no result, no return [16:46:55] ok, so you are getting no result, that is more useful [16:47:03] where are you running the query? [16:47:48] from dwl@taxonbot [16:48:16] are you executing code, which service are you connecting to? [16:48:19] enwiki only [16:48:34] all the other wikis are running [16:48:36] are you talking about the wikireplicas? [16:50:39] I can see there is an alter running on enwiki, I would suggest to kill your query and trying again [16:50:51] it may be blocked by the alter [16:51:22] yes, replica [16:52:20] blocked by the alter <- what does this mean? [16:52:54] so sometimes, aletrs come from production [16:53:10] if there is a long running select at that moment [16:53:14] *alters [16:53:28] the aleter and all subsequent queries get blocked [16:53:29] i try one more, no success [16:54:00] yeah, try the other service, if you are using analytics, web or the other way [16:54:09] there are several servers [16:54:10] bstorm_> doctaxon: looks like someone dropped a column in the page table. I'll [16:54:13] precisely for that [16:54:13] check if they are rebuilding the views yet or if that is something I can [16:54:16] jump on. [16:55:05] from #wikimedia-cloud [16:55:37] I think I have fixed it [16:55:47] there was a long running query that because of the alter [16:55:50] was blocking all others [16:55:53] try again? [16:56:10] I see no metadata locking now [16:56:30] that just happens and cannot be avoided, but time would have fixed it [16:56:52] bstorm_ jynus: is it running now [16:56:56] now it's okay [16:58:51] doctaxon: there is an alter table going for page table [16:58:59] ah I see jynus already replied [17:00:24] ya, thank you, it is working fine now [17:00:57] doctaxon: in the future, be more specific so we can help you faster :-) [17:01:08] "X don't work" is not very understandable [17:01:33] see some advice at https://www.mediawiki.org/wiki/How_to_report_a_bug#Reporting_a_new_bug_or_feature_request [17:01:48] but I didn't know the failing process [17:02:15] no, you don't need to know about it, just more verbose about what you were trying [17:02:37] but I have given you the database query [17:03:05] yeah, but I needed more context, - wikireplica service, which endpoint, etc. [17:03:19] we handle a lot of databases :-D [17:03:29] like 200 servers or so [17:03:43] yes I know, but my knowledge about it is not the best [17:03:53] I am thanking you for the report, just trying to help for the next time [17:04:08] so we can have a look faster and fix it faster :-D [17:04:52] sorry for the problems caused [17:05:04] i know your how_to [17:05:31] but I couldn't do it better [17:05:49] it is ok, I am telling you for next time, no problem now :-) [17:05:56] but thanks for any help, we got it and that's the point [17:06:42] e.g. "I ran a query to enwiki.analytics.wikireplicas.svc.wmflabs.org and it seems to get blocked" [17:07:01] (or whatever happens) [17:07:14] "I cannot connect" [17:07:20] "I get error XXXX" [17:09:35] okay, but there was no return, it ran without end [17:10:14] I don't know where my queries are running to [17:10:38] only that it is a replica [17:12:02] If you used `sql` to connect, it defaults to the analytics replicas [17:12:53] okay [17:15:55] jynus: I still get "View 'enwiki_p.page' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them" on the replicas with any query that touches 'page' [17:16:06] e.g. `SELECT page_title FROM enwiki_p.page LIMIT 1;` [17:17:11] this is on enwiki.web.db.svc.eqiad.wmflabs [17:17:25] musikanimal: yeah, I am going to run the script now [17:17:31] as it was still running on some labs hosts [17:17:32] that would be the alter breaking the views, hopefully manuel and brook can help you with that, even if only a quick fix for enwiki [17:17:36] musikanimal: should be fixed in a sec [17:17:50] I leave you in good company then :-) [17:17:54] sweet, thanks! [17:18:07] musikanimal: sadly, one thing that happens when accessing the internals [17:18:15] is that we have to play catch up [17:18:25] because there is no guarantee of compatibility [17:18:27] musikanimal: should be fixed on 1009 and 1010 [17:18:32] 1011 is still running the alter [17:18:36] once it is done, I will fix that one too [17:18:43] marostegui: make sure there is not metadata locking [17:19:01] and kill blocking queries as I did for the other issue [17:19:07] works! many thanks [17:19:29] jynus: nope, nothing [18:09:05] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 (10Marostegui) [18:09:08] 10DBA, 10Patch-For-Review: Drop valid_tag table - https://phabricator.wikimedia.org/T212254 (10Marostegui) 05Open→03Resolved This is all done! Almost 2000 files less in the backups! (schema and data files) [18:09:23] 10DBA, 10Patch-For-Review: Drop valid_tag table - https://phabricator.wikimedia.org/T212254 (10Marostegui) [18:17:32] 10DBA, 10Patch-For-Review: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 (10Marostegui) [18:22:48] 10DBA, 10Patch-For-Review: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 (10Marostegui) I have renamed tag_summary on enwiki on db1089: ` root@db1089.eqiad.wmnet[enwiki]> set session sql_log_bin=0; rename table tag_summary to T212255_tag_summary; Query OK, 0 rows affected (0.00 sec) Query... [18:40:03] 10DBA, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) [18:41:33] 10DBA, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) @elukey I have updated the original task, to add the last statuses of the curr... [18:55:23] 10DBA, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) [18:55:58] 10DBA, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) @elukey For those databases that we have decided, so far, to backup and archiv... [18:57:27] 10DBA, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) I asked @chasemp about `fab_migration` and I think we need to have a final wor... [18:57:54] 10DBA, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) [18:59:01] 10DBA, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Review dbstore1002's non-wiki databases and decide which ones needs to be migrated to the new multi instance setup - https://phabricator.wikimedia.org/T212487 (10Marostegui) [19:17:47] 10DBA, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 10MediaWiki-Database, and 2 others: Enable MariaDB/MySQL's Strict Mode - https://phabricator.wikimedia.org/T108255 (10Marostegui) [21:13:46] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) [21:16:22] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) [21:20:08] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) As always, I will do the first schema change carefully, I will start with s6 hosts (only in codfw), to make sure we have no issues w... [21:20:32] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) [21:49:38] 10DBA, 10Recommendation-API, 10Research, 10Core Platform Team Backlog (Watching / External), and 2 others: Recommendation API exceeds max_user_connections in MySQL - https://phabricator.wikimedia.org/T212154 (10bmansurov) [22:04:41] 10DBA, 10Patch-For-Review: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 (10Marostegui) I have seen no errors so far. So tomorrow I will rename the table on a wikidata server (as in wikidata is where the table is bigger in size) and leave it renamed on those two servers (1 enwiki, 1 wikida... [22:14:39] 10DBA, 10Patch-For-Review: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 (10Marostegui) @Ladsgroup I just realised that a wiki that was created the 4th of January (T197616#4859783 s4 - testcommonswiki) has the `tag_summary` table even if your change was merged the 18th of Dec (https://gerr... [22:20:39] 10DBA, 10Patch-For-Review: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 (10Marostegui) The interesting part is that it doesn't have `valid_tag` table but it does have `tag_summary` and the patch https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/479266/ contains the drop for both of them... [22:25:05] 10DBA, 10MediaWiki-extensions-WikibaseMediaInfo, 10SDC Engineering, 10StructuredDataOnCommons, 10Wikidata: MediaInfo extension should not use the wb_terms table - https://phabricator.wikimedia.org/T208330 (10Addshore) So, it doesn't look like WBMI actually uses the data in the table. If that is the case...