[06:39:53] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3086795 (10Marostegui) It has actually done some recovering as the file it is scanning now has changed since last night: ``` postgres 7189 0.0 0.... [06:59:45] 10DBA, 13Patch-For-Review: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414#3086814 (10Marostegui) db1023: ``` root@neodymium:~# for i in frwiki jawiki ruwiki; do echo $i;mysql --skip-ssl -hdb1023 $i -e "show create table revisi... [07:44:29] once all the issues have passed, are you planning to repool db1051 soon or you still need to work with it? [07:50:04] no, I was going to repool it, but it was still lagging yesterday [07:50:14] I will do it now, with low weight [07:50:19] ok :) [08:34:58] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3086995 (10Marostegui) Not sure from which time this is: ``` FATAL: the database system is starting up FATAL: terminating walreceiver process due... [08:45:19] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3086996 (10jcrespo) That is me killing the replication, which will not work anyway. @akosiaris can you point us to the osm load process, do you have... [10:20:20] 10DBA, 13Patch-For-Review: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414#3087191 (10Marostegui) db1093 ``` root@neodymium:~# for i in frwiki jawiki ruwiki; do echo $i;mysql --skip-ssl -hdb1093 $i -e "show create table revisio... [10:23:24] jynus: marostegui I am finally around to help with labsdb1007 [10:23:33] akosiaris: \o/ [10:23:39] I 'll login now and try and see what it is I can do [10:23:55] waking up to attacks is not nice [10:24:22] akosiaris: check the latest news starting here: https://phabricator.wikimedia.org/T157359#3085505 [10:24:29] yeah, what a great week to come back eh alex XD [10:26:13] akosiaris, there is not much to see- restarting 9.1 again is taking ages because it is in recovery mode [10:26:47] and the upgrade script needs it up to do that [10:29:53] so I'd say let's dump from the master or probably better as you said, a fresh refresh of osm and dump only the other dbs [10:36:10] may I ask what is OSM? I have seen it written a couple of times but I am not sure what it refers to [10:36:21] open street maps I assume? [10:36:35] yes [10:36:39] ok :) [10:36:53] and there is a whole debate with OpenStackManager (also known as wikitech :P) [10:37:03] OpenStreetMap, without the S! :-) [10:37:10] OpenStackManager is an extension to mediawiki btw [10:37:18] run in only one wiki in the entire world [10:37:19] wikitech [10:37:26] but we are finally killing it [10:37:30] I thought it was replaced already [10:37:31] ah [10:37:41] it's in the process last I heard [10:37:52] but don't take me as an authoritative source on that [10:38:21] so I am stopping postgres [10:38:25] also, what is being imported from osm- last I heard, and osm dump was 2 TB [10:38:44] and that has barely 1 TB [10:38:51] /home/jynus/pgsql-9.1/bin/postgres ? [10:38:57] arg ? [10:39:02] do I want to know ? [10:39:02] that is running [10:39:08] yes [10:39:12] that is expected [10:39:16] ok lemme read the task [10:39:17] it is the old cluster [10:39:34] running on a separate port to do the conversion [10:39:58] the upgrade requires both servers [10:40:20] and obviously 9.1 is not in jessie [10:41:13] pg_upgrade requires a running postgres ? [10:41:22] isn't it supposed to do it in place ? [10:41:31] in fact, pg_upgrade start postgress [10:41:42] lol [10:41:45] niow??? [10:41:49] we run it manually because we didn't know why it failed [10:41:50] *now [10:42:05] it turns out it starts but gets stuck in recovery mode [10:42:29] I mean, we can kill it now and wipe the old data [10:42:35] despite the recovery.conf file getting deleted ? [10:42:38] but that is the "documented" procedure [10:43:19] ok.. I 'd like to try something just to obtain some knowledge about this. We are anyway in the clear, no downtime or anything [10:43:33] do we have a clean copy of the datadir anywhere ? [10:43:37] yes [10:43:45] but I warn you [10:43:52] we had to do so many hacks [10:43:59] because differences between 9.1 [10:44:04] and 9.4 packages [10:44:12] and ubuntu and debian differences [10:44:29] omg ... [10:44:40] maybe it's just easier to just resync from osm [10:44:47] that is what I proposed [10:44:52] probably yes [10:44:53] if it was replication [10:45:03] or data corruption [10:45:08] don't you love mysql now ? :P [10:45:12] or the many times the upgrade procedure killed postgress [10:45:15] I do not want to know [10:45:41] because the upgrade had the nice idea of auto-start postgres [10:45:49] but if it didn't start- kill it [10:46:38] to be fair, it would have been easier if we had 2x the disk, and we could have done tests in advance in a local copy [10:47:16] So I can kill the 9.1 instance, wipe everthing clean [10:47:44] 6. Install custom shared object files: Install any custom shared object files (or DLLs) used by the old cluster into the new [10:47:44] cluster, e.g. pgcrypto.so, whether they are from contrib or some other source. Do not install the schema definitions, e.g. [10:47:44] pgcrypto.sql, because these will be upgraded from the old cluster. Also, any custom full text search files (dictionary, [10:47:44] synonym, thesaurus, stop words) must also be copied to the new cluster. [10:47:47] ok I am giving up [10:47:54] pg_upgrade is crap [10:47:55] in theory [10:48:03] that is done with the local bin copy I did [10:48:04] all I had to do was read the man page [10:48:08] :-) [10:48:10] yeah it's a mess [10:48:27] but because gis extensions [10:48:34] it was going to be doubtful [10:48:57] those are probly changed from 9.1 to 9.4 [10:49:06] but I was stubborn to try! [10:49:14] it was good that we tried [10:49:18] heh.. I would too [10:49:22] I wanted like 5 mins ago [10:49:30] xddd [10:49:36] but not.. this is a mess .. this is not a good procedure [10:49:45] and it's the actual documented one [10:50:02] uber's article on the mysql migration makes even more sense right now [10:50:26] I complain about mysql upgrade procedure- but the only time we had issue- it was data corruption [10:50:27] so I 'll promote labsdb1007 to a master in puppet [10:50:36] and let it start the sync [10:50:38] of very specific compatibility issues [10:50:40] it's handled by puppet IIRC [10:50:56] oh, good [10:51:00] that is why I asked [10:51:07] I didn't want to work too much [10:51:13] let me kill all rests of 9.1 [10:54:51] ok starting postgres first [10:54:56] wait [10:55:07] let me delete 9.1 files [10:55:12] ok with that, marostegui ? [10:55:12] ok [10:55:21] we still have the copy remotely [10:55:26] go ahead [10:56:40] man, these disks are slow [10:56:50] is it a raid5? 6? [10:57:21] 5 :( [10:58:10] I am ready- if you want to run initdb again or let puppet do it, just say it [10:58:10] raid5 SSDs [10:58:22] akosiaris, doesnt look like it :-/ [10:58:50] they are definitely SSDs cause OSM was dieing with rotating disks [10:59:02] I don't remember if they were high perf ones [10:59:07] probably not [10:59:08] anyway [10:59:34] there is 9.1 anymore, and the disk should have enough space [10:59:51] 1.5T Avail [11:00:29] I 've started a curl to fetch the latest pbf file [11:02:45] gosh, I have to google every single term of postgres xddd [11:02:47] embarrasing [11:03:00] that is osm, actually [11:03:24] well, google, but the format used by osm [11:04:18] I see, I am educating myself a bit [11:04:50] let's leave it on that alex and me have a history...with osm [11:05:03] Is it a love one? [11:07:02] for me it is http://hdyc.neis-one.org/?jynus but ask alex [11:08:05] Mapper since: April 19th, 2007!!!!!!! [11:10:59] This is my first contribution to wikipedia (it was a colaboration): https://es.wikipedia.org/w/index.php?title=Friki&oldid=22335 [11:12:22] 2003 :o [11:19:45] ah, I see, we use the one without history, making it smaller [11:24:12] the pbf is downloading at 3-4MB/s needs another 28 mins [12:01:32] it finished already [12:09:25] 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, 06TCB-Team, and 2 others: Allow setting the watchlist table to read-only on a per-wiki basis - https://phabricator.wikimedia.org/T160062#3087424 (10Lea_WMDE) [12:10:02] 10DBA, 10Expiring-Watchlist-Items, 10MediaWiki-Watchlist, 06TCB-Team, and 3 others: Allow setting the watchlist table to read-only on a per-wiki basis - https://phabricator.wikimedia.org/T160062#3087441 (10Lea_WMDE) [12:46:06] 10DBA, 13Patch-For-Review: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414#3087506 (10Marostegui) db1088: ``` root@neodymium:~# for i in frwiki jawiki ruwiki; do echo $i;mysql --skip-ssl -hdb1088 $i -e "show create table revisi... [13:00:23] jynus: are you saying you can't view the database quotes in phab? [13:03:22] oh because they aren't linked from the task [13:18:22] the corrected/updated ones [13:21:14] yeah noticed [13:21:15] so [13:21:29] agreed, 800 GB SSDs would be best, even with longer lead time [13:21:36] we're not constrained on lead time in codfw I think, are we? [13:21:47] wait, shouldn't there be a separate purchase for s8 in eqiad? [13:22:13] robh and I noticed that on the spreadsheet [13:22:19] it is separate on phab [13:22:20] ah yes thre is [13:22:31] let me find it [13:23:13] mark, the only concern is if it would go beyond the EoFY [13:23:21] ok [13:23:22] yes [13:23:31] 60 days isn't a problem [13:23:32] and of course, almost sure not ready for the switchover [13:23:34] 100 days is tight [13:23:38] but that's ok, right? [13:23:42] yes [13:23:56] it is from our side, is it from the finance point of view? [13:24:05] as I said, only non-technical opinions were missing [13:24:13] the technical part was mostly done and said [13:24:23] what about the 1y vs 3y coverage for the SSDs? [13:24:26] in tems of ok with wait time, etc. [13:24:35] marostegui, yes, I said we want coverage [13:24:48] If the difference is 3k (in total) i would go for the 3y one [13:24:55] I think the question was on #2 vs. #3 [13:25:40] and we all seem to agree on keep a standard disk size [13:26:46] https://phabricator.wikimedia.org/T158580 is the other ticket [13:27:26] apparently no response from finaces yet/not reviewed by them yet [13:28:10] while different chassis is an issue- I actually like having 50-50 different vendors [13:28:30] yeah, it is good not to stick to one [13:28:31] not all apples in the same basket [13:31:15] ok i've replied on the task that probably #2 looks best, but we need the quotes [13:31:24] yes [13:31:48] it is not trusting 3rd parties, but just a proper technical review [13:36:10] marostegui, I broke >labsdb1009 while securing db1095 [13:36:21] (only the replication, do not freak out :-P) [13:36:23] oh, what happened? [13:36:32] need help? [13:36:35] you can see if you connect [13:36:36] no [13:36:47] it is just one admin query that failed [13:36:47] ah [13:36:48] i see [13:36:48] haha [13:37:09] you were changing to connect thru socket? [13:37:10] and it is actually good that it failed [13:37:22] let's talk in private [13:37:29] ok! [13:39:48] 10DBA: Import x1 on dbstore2001 - https://phabricator.wikimedia.org/T159707#3087637 (10Marostegui) After all the checks to make sure we do not break dbstore2001 I am importing x1 into it. [14:34:38] 10DBA, 10Wikidata: Migrate wb_terms to using prefixed entity IDs instead of numeric IDs - https://phabricator.wikimedia.org/T114903#3087785 (10daniel) [15:10:00] 10DBA, 13Patch-For-Review: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414#3087937 (10Marostegui) db1085: ``` root@neodymium:~# for i in frwiki jawiki ruwiki; do echo $i;mysql --skip-ssl -hdb1085 $i -e "show create table revisi... [15:18:31] Osm2pgsql failed due to ERROR: CREATE UNLOGGED TABLE planet_osm_point [15:18:35] grrrr [15:21:35] ? [15:22:13] hstore extension? [15:23:17] postgresql-contrib-9.4 is installed [15:24:29] no it was installed [15:24:33] anyway fixed [15:24:36] not sure what happened [15:24:41] or it was postgis? [15:24:42] but after a postgres restart it's working fine [15:24:55] ok, cool [15:24:56] but I had to reenable them after the restart [15:25:09] as if the entire cluster dir after the restart went away [15:25:17] oh but it did.. [15:25:18] heh [15:25:19] ok [15:25:32] anyway.. I am running a screen in labsdb1007 [15:25:39] in it it is doing the initial import [15:25:42] leave it be for now [15:25:49] will take the rest of the day if not more [15:25:55] sure [15:25:58] thank you! [15:26:08] you are welcome [15:26:15] sorry for that part being such a mess [15:26:25] we should probably clean up all that mess soon [15:26:33] with precise being deprecated it should be easier [15:26:41] well, you are helping with this, which is more than I could ask for [15:27:23] I would have had to reverse engineer all that without your help [15:33:20] actually it should never have been that mess... but thanks what happens when you meet the goals and the goals are set realistically [15:33:43] live to fight another day [15:34:01] and that "another day" has this tendency to finally arrive [15:34:15] damn you "another day" [15:34:38] again, I am not complaining at all [15:34:58] it is exactly the same position with dbs- we have to improve very slowly [16:34:26] jynus let me know when you are around, for a heartbeat table question [16:34:36] shoot [16:34:40] soooo [16:34:52] dbstore2001 - I imported x1, and was about to start replication [16:35:00] but? [16:35:02] the hearbeat table has the entry for x1 [16:35:21] but replication complains about not being able to find a record for it [16:35:35] it may be a different master [16:36:16] the current master is db1031 [16:36:23] yep [16:36:28] and the entry refers to db1031 [16:36:29] or maybe the id changed [16:36:33] server_id [16:36:37] let me see [16:36:55] nope [16:36:57] looks the same [16:37:24] then look at the change that the binlog wants to do [16:37:49] see why the update fails [16:38:12] I assume that is row based replication [16:38:33] so you will get the previous and new values on the binary log [16:38:35] dbstore2001 is statement [16:38:42] it doesn't matter [16:38:51] what matters is its master [16:39:18] update_rows_v1 would look to me as a row-based binlog [16:39:26] yes, yes, x1 is row [16:39:29] db1031 is row [16:39:34] let's check the binlog then [16:39:51] yeah, check what is not right, and change it [16:40:18] normally it is a different server_id [16:40:33] I think that is the PK [16:41:41] yep, the PK is server_id [16:42:30] that is normally what fails- I am not 100% sure it is [16:42:40] but it is how I fixed it last times [16:43:34] Replicate_Wild_Do_Table: flowdb.%,wikishared.%,heartbeat.% is worrying, though [16:44:02] did you import all of x1, then started replication? [16:44:18] oh the filters! [16:44:19] damn [16:44:25] I am afraid to say [16:44:31] you may need to reimport again [16:44:34] well, the import didn't take long [16:44:41] what took long was to check the affected tables [16:44:46] but yes, I will need to reimport [16:44:48] as you may have lost already 1 second of writes [16:44:51] yep [16:45:06] maybe you can replay those based on the binlog [16:45:28] but it I were you, it would take the way that takes less of my onw time [16:45:34] I will reimport [16:45:36] even if it takes more server time [16:45:38] it doesn't take much [16:45:44] I will try to fix the heartbeat thingy [16:45:47] just for the sake of it [16:45:50] server time is plenty, you rime is more valuable [16:46:34] I do not understand though, you reimported everthing [16:46:44] but did not reset slave that connection? [16:46:53] i did a mysqldump [16:46:56] yes [16:46:58] that conneciton wasn't existing there [16:47:00] no problem on that [16:47:14] ah, so it got it from the config? [16:47:33] we need to change that on puppet [16:47:56] e.g. dbstore2 [16:50:42] yeah [16:50:47] I will start to reimport it again :( [16:51:00] we need to change modules/role/templates/mariadb/mysqld_config/dbstore2.my.cnf.erb [16:51:25] I think that should affect only dbstore2001 [16:51:34] yes, only servers in codfw [16:51:48] I will change that tomorrow, as I will leave the import running now and I have to run for errands [16:52:00] thanks [16:52:01] going to reset all 'x1' [16:52:08] so it will be clean tomorrow [16:52:31] it was difficult to handle because the db names [16:52:55] that is why I initially only imported dbs that were separate [16:57:08] still weird the heartbeat thing, as the row indentical to what it is in db2033 (where it is being dumped from) [16:57:17] but i will investigate tomorrow [16:57:33] marostegui, that cannot be [16:58:00] there must be a gap between when the backup was taken and when it was loaded [16:58:29] in any case, whatever it is, it is not an "issue" [16:58:49] but if the ts doesn't matter…that shouldn't be an issue indeed [17:05:58] marostegui, should I reset slave all x1 on dbstore2001 or ack the alert? [17:06:19] oh yes [17:06:21] did it ping? [17:06:22] damn [17:06:29] sorry [17:06:48] I can do it just tell me wich of the 2 :-) [17:07:31] do the reset x1 slave [17:07:34] I am silencing icinga [17:08:35] done [17:08:44] thank you [22:16:20] 10DBA, 10AbuseFilter, 06Performance-Team, 05MW-1.27-release-notes: AFComputedVariable::compute query timeouts - https://phabricator.wikimedia.org/T116557#3089604 (10Krinkle) . Still upto 100 exceptions per day from this code path for dat...